Optimized Deep Learning Pipelines: A Deep Dive into TFRecords and Protobufs (Part 2)

Kelly “Scott” Sims

Published in

Heartbeat

17 min readJul 27, 2023

Note: This is the second part of a two part series. Check out part 1 here.

Protobufs for TFRecords

So why the heck did I just spend all that time covering protobufs. Well, that’s because at the heart of Tensorflow’s TFRecords, are protobufs. You can view the actual proto file here. But we are going to step our way through this file, message by message.

The comments at the top of the proto file already provide an example data structure using movie data as the example data. We will just use the same information to make our way through the explanation. So go ahead and open up that file now and take a look as we go through this.

The most basic component of a TFRecord (and by extension, the protobufs that make up TFRecords) is data that consists of one of three types

bytes — Would be used for text, audio, or video based features/inputs.
float — Would be used for features/inputs with floating point precision e.g. 3.14
int64 — Would be used for features/inputs with simple integer values e.g. 100

The basics

Representing this in the TFRecord, there are three proto messages defined.

Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers

The initialization and utilization for any of them would require 0 or more entries of data consisting of the specified data type. This is due to the “repeated” keyword. This signals to the proto compiler that this field is not a single value, but it is an array of values of the defined data type.

I’ve taken the proto file from their git repo and have compiled it. I’m using the generated code in the examples that follow.

Since this is just the raw values with no labels, we need a way to itemize the data. Obviously this would be necessary for data understanding as well as feature engineering. To do this, we first create a “Feature” proto.

The “oneof” keyword signals to a user that this proto will be a feature containing one and only one of the base proto types. When compiling a proto with protoc that contains a “oneof” member, it also generates extra API for type-checking capabilities. This makes it so a user and/or code can inspect the proto and determine which “kind” it contains. And of course it also enforces that an instance of the feature proto can, and will, have a single type. Otherwise a run-time or compile-time error will be thrown depending on the programming language being used. You can read more about the “oneof” keyword here

But wait, why did I create the individual protos of ByteList, FloatList, and Int64List just to wrap them in yet another proto? Well this is mostly a design choice. And whether you feel like it’s a good one or not, simply boils down to philosophical differences. But the next part might shed some light on this design choice.

Feature proto map

After we have created our Feature protos, we still need a way to assign a label to them. And we do this by aggregating all of our newly created features in a feature map proto called “Features” (unique naming, I know). In this proto map, each feature we have created is indexed by a string key. If you’ve only been programming in Python land your whole life, and have no clue what I mean when I say map, you can think of it as no different than a Python dictionary. It’s a key-value data structure.

Because our raw data is contained as either BytesList, FloatList, or Int64List and wrapped in a “oneof” Feature proto, that simplifies the map (and thus justifies the design choice). If we weren’t wrapping the base protos in the Feature proto, then we would have to create an individual map member for all the base types. For example, “Features” would have to become:

Again, if you’ve only ever programmed in Python, or something of the sort, this might seem strange to you. But unlike Python, which is dynamic and doesn’t enforce declared data types, protobufs are strongly type. You can’t create arrays or maps and insert mixed data types into them. Hence we would have to make our Features proto like above. But this would be more cumbersome to deal with in practice. Instead of having a single map containing all of our data, we would now have to inspect 3 separate maps for the possibility of any data.

Sometimes simple solutions offer the best results. We made minor hardware optimizations for a huge increase in throughput. Check out the project here.

TFRecords in Tensorflow

If you’ve ever looked at the documentation in Tensorflow for utilizing TFRecords, of if you’ve ever just used them in practice, you may realize that the API docs don’t actually import any generated protobuf files, nor does it mention anything of protobufs apart from the fact that TFRecords are backed by protos. This is because the Tensorflow API has its own wrappers and abstractions around the protos. Quite conveniently however, the Tensorflow API almost matches exactly what we did with the raw protobufs 1:1.

Writing TFRecord Files

To demonstrate creating TFRecords using the Tensorflow API, I’m going to use the Stanford Cars Dataset. This is a great example dataset to use in this demonstration since all the training images are contained in a single folder and their actual labels and names are in a separate “.mat” file. We can use this opportunity to not only convert these images from JPG to TFRecords, but when converting them, we can even write them with their appropriate label and any other metadata we wish to store with the image itself.

Before we start though, let’s do some setup steps. Because the names and labels for each car are in a separate file, lets create a dictionary where the keys are the image names — e.g. “00151.jpg” — and the values are another dictionary containing the car name and the class label of the car for classification. This label is represented as an integer in the data. Some example entries of the dictionary would be

'00234.jpg': {'class': 132, 'label': 'Hyundai Tucson SUV 2012'},
 '00235.jpg': {'class': 111, 'label': 'Ford Ranger SuperCab 2011'},
 '00236.jpg': {'class': 7, 'label': 'Acura ZDX Hatchback 2012'},
 '00237.jpg': {'class': 65, 'label': 'Chevrolet Avalanche Crew Cab 2012'},
 '00238.jpg': {'class': 26, 'label': 'BMW ActiveHybrid 5 Sedan 2012'},
 '00239.jpg': {'class': 16, 'label': 'Audi V8 Sedan 1994'},
 '00240.jpg': {'class': 138, 'label': 'Hyundai Sonata Sedan 2012'},
 '00241.jpg': {'class': 170, 'label': 'Nissan Juke Hatchback 2012'},
 '00242.jpg': {'class': 102, 'label': 'Ferrari California Convertible 2012'},

Since this isn’t an article on data cleaning/preparation, for this initial step, I’m just going to show my code with comments. I’m not going to explain it.

To better organize the training data, each individual image will be converted from its JPEG form to a Tensorflow “Features” object. Remember that the “Features” proto is a map of string to feature. The same holds true for the Tensorflow “Features” object. Thus for a single image, it will be represented in the following manner:

features = {
  "image":  #Feature( BytesList ( image in bytes ) )
  "width":  #Feature( Int64List ( pixel width ) )
  "height": #Feature( Int64List ( pixel heigth) )
  "label":  #Feature( BytesList ( car Name ) )
  "class":  #Feature( Int64List ( integer class value) )
}

Defining our helper functions

Following the same pattern that we did above when we were using the actual protobufs, we first need to convert each of these features into either a Tensorflow BytesList, FloatList, or Int64List object. We then need to wrap that newly created Bytes, Float, or Int64 list, in a Tensorflow “Feature” object(not “Features” — again, unique naming, I know). We will create helper functions to do so. This part is where you typically see most tutorials on TFRecords start!!!!

Next, We will create a helper function that will

Read an image into memory
Extract its name and label from the dictionary in the preprocessing step
Convert the image to bytes
Convert all features into either a bytes, int64, or float feature
Create the final “Features” map object.

With our helper functions done, all we need to do now is to:

Get a list of all the images in the train directory (their paths)

def write_tfrecords(file_paths: list, num_shards:int, img_meta_dict: dict, train_set: bool):
    '''
    Takes in a list of image paths and an integer for how many shards the images should be
    split into. Each shard is written as a separate TFRecord file.
    '''
    print(f"Writing {len(file_paths)} images into {num_shards} TFRecord files")
    sharded_file_paths = shard_list(file_paths, num_shards)
    for count, shard in tqdm(enumerate(sharded_file_paths, start=1), desc="Shard", colour="red"):
        #base output name e.g. cars_train_dataset_2of10.tfrecord
        output_filename = f'cars_{"train" if train_set else "test"}_dataset_{count}of{num_shards}.tfrecord'
        write_shard(shard, output_filename, img_meta_dict)

2. Shard them into a certain amount — e.g. a list of 50 pictures sharded into 5 sets would result in 5 lists of 10 images

def shard_list(data: list, num_shards: int) -> list[list]:
    '''
    Takes in a list of data and returns a list of lists of the same data
    sharded "num_shards" times.
    e.x:
    data = [1,2,3,4,5,6,7,8,9], num_shards = 3
    returns [[1,2,3] ,[4,5,6], [7,8,9]]
    '''
    n = len(data)
    return [ data[(i*n // num_shards) : ((i+1)*n // num_shards)] for i in range(num_shards)]

3. Write each shard as a separate TFRecord file.

def write_shard(shard:list, file_name: str, img_meta_dict: dict):
    '''
    Takes in a list of image paths, loads them into memory, 
    converts them to tf.train.Example object and writes to a TFRecord file
    '''
    with tf.io.TFRecordWriter(file_name) as writer:
        for example in shard:
            features_map = create_image_features_map(example, img_meta_dict)
            example = tf.train.Example(features = features_map)
            writer.write(example.SerializeToString())

Last thing of note — before we write the data as TFRecord, we will wrap the features map object in one last object, the tf.train.Example object. Once in that form, we can write to disk. There is no requirement to use tf.train.Example in TFRecord files. tf.train.Example is just a method of serializing dictionaries to byte-strings.

Reading TFRecords

Reading the TFRecords and preparing them for model training is straightforward and doesn’t deviate very much from all the examples in the tf.Dataset docs. We will write a couple of helper functions. The first helper function will allow us to parse a TFRecord file that is loaded into memory. Since the data is serialized, we need a way to deserialize it. We do that by first defining the expected schema of the data in a dictionary format so the parser knows what to expect.

fmt = {
        'image': tf.io.FixedLenFeature((), tf.string, default_value=""),
        'width': tf.io.FixedLenFeature((), tf.int64, default_value=0),
        'height': tf.io.FixedLenFeature((), tf.int64, default_value=0),
        'label': tf.io.FixedLenFeature((), tf.string, default_value=""),
        'class': tf.io.FixedLenFeature((), tf.int64, default_value=0),
    }

parsed = tf.io.parse_single_example(example, fmt)

You can see that this follows the same dictionary format we created when first wrote the TFRecord files. This informs the parser that it should expect to be able to deserialize the data into the 5 feature fields of image, width, height, label, and class. It also informs the parser of the expected data types as well as the default value if the data is missing for that particular feature.

Decoding the binary file

You might be asking yourself why you need to yet again define the structure of the data if you already did so when writing the TFRecords. Well this is because the data is serialized in one long string of information. Without giving the parser the expected structure of that information, it doesn’t know how to interpret it. Think of it this way, if I just gave you the binary of:

01000001

In pure binary, this is just the number 65. However I might have intended for you to parse it as ASCII, and in that case this is actually the letter ‘A’. Without the extra information, there’s ambiguity as far as what the data could actually mean.

Next we need to convert the serialized image back into matrix form. I’m also going to take this opportunity to resize the image since I’m going to use a pre-trained model that expects image input to be (299,299,3).

image = tf.io.decode_jpeg(parsed["image"], channels=3)
image = tf.image.convert_image_dtype(image, tf.float32)
image = tf.image.resize(image, [299,299])

return image, parsed['label'], parsed['class']

Configuring our input pipeline

For the next step we will use the Tensorflow Dataset API and configure our input pipeline. To do this, we need to tell Tensorflow where the files are and how to load them.

list_of_tfrs = os.path.join(path, "*.tfrecord")
files = tf.data.Dataset.list_files(list_of_tfrs)
dataset = files.interleave(
    map_func = tf.data.TFRecordDataset,
    num_parallel_calls = tf.data.experimental.AUTOTUNE
)

The actual loading of the TFRecords is handled by the mapping function TFRecordDataset. What the interleave method does is it spawns as many threads as you specify, or how many it thinks is optimal if you use the AUTOTUNE parameter. Each thread with load and process its own part of the data concurrently. So instead of processing a single file one at a time, many can be processed at once. As each thread finishes processing a portion of its data, Tensorflow will “interleave” the processed data from various threads to make a batch of processed data. Hence instead of loading and processing a single file and making all the data from that file a part of a batch, it will gather data randomly from the various threads and create a batch.

After the data is loaded, we need to apply our parsing function we created above as well as other pipeline parameters such as batch size and prefetching.

dataset = dataset.map(map_func = parse_img_data)
datset = dataset.batch(batch_size = batch_size)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
return dataset

The prefetch method basically tells the CPU to prepare the next batch of data and have it ready to go while the GPU is working. That way, once the GPU is done with its current batch, there’s minimal idle time while it waits for the CPU to prepare the next batch.

And that’s it for reading the data. We can now test it out and plot an image as well as inspect the image size transformation.

Training with TFRecords vs Raw Input

Most deep learning tutorials, both Pytorch and Tensorflow, typically show you how to prepare your data for model training by using simple DataGenerators which read the raw data. With this method, you are (lazily) reading the data from disk in its raw form. In our case, a DataGenerator with batch size of 10 would have to read 10 separate JPG files into memory prior to any other preprocessing in the pipeline. To compare how (in)efficient this is compared to TFRecords, let’s create our own DataGenerator that still uses the same pipeline as our TFRecords. The steps for this will be

Create a helper function that gets the class and label from the meta-dictionaries we made in the preprocessing step.

def parse_labels(metadata):
    def inner(img_name):
        str_img_name = img_name.numpy().decode("utf-8")
        label = metadata[str_img_name]['label']
        clss = metadata[str_img_name]['class']
        return label, clss
    return inner

Create a helper function that reads the image into memory and bundles it with it label and class as retrieved in the previous in the helper function.

def parse_raw_data(metadata):
    def inner(filename):

        img_name = tf.strings.split(filename, os.sep)[-1]
        
        label, clss = tf.py_function(parse_labels(metadata), [img_name], [tf.string, tf.int64])

        img = tf.io.read_file(filename)
        img = tf.io.decode_jpeg(img, channels=3)
        img = tf.image.convert_image_dtype(img, tf.float32)
        img = tf.image.resize(img, [299,299])
        return img, label, clss
    return inner

Returns the data generator with applied pipeline settings of “batch size” and “prefetching”

def make_dataset_jpg(img_files: list, metadata: dict, batch_size: int):
    list_ds = tf.data.Dataset.list_files(img_files)
    ds = list_ds.map(parse_raw_data(metadata))
    ds = ds.batch(batch_size = batch_size)
    ds = ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return ds

Monitoring Training

We don’t care so much about the actual model for our testing. So we will just use transfer learning and load InceptionResNetV2 model. What we do care about is

CPU Utilization — When is it active and how long is it active for
Memory Utilization — How much memory is being used for processing of the data prior to loading to the GPU
GPU Utilization — How much idle time does the GPU experience while it is waiting for the CPU to prepare the next batch
Time to train between each batch
Time to train between each epoch

To monitor the utilization, we will do two separate things that ultimately achieve the same goal. The first one requires more work by you, the second handles everything for you after minimal setup using Comet’s online dashboard.

Manually logging system resources

1. We will spawn a separate thread (so it’s not blocking our model training) and sample utilization metrics 5x a second. Below is the code for this operations. Again since this isn’t an article on logging system resources, I leave it to the reader to figure out the code.

Logging with Comet

2. We will use Comet.ml to log our metrics to a web based dashboard. We will do this so we can chart in real-time the performance of model training with respect to both datasets as it’s training. If you’re not familiar with comet, you can just head over to Comet.ml and sign up for free. You will need your API key. After you create your account, click your profile pic in the top right corner, then go to Account settings, On the left navigation panel, you will see API Keys . It is here where you can find your key.

To log the time between batches and epoch, we will simply have the data written to a list during training. We could also log the batch and epoch information to Comet as well. However I prefer to collect the data locally so that I can do some extra analysis and plotting after the fact for this blog post. The entire training loop is the following:

Parameters

System parameters:

RTX 3090 24GB Ram
Ryzen 5950x 16 Core 32 Thread CPU
64GB DDR4 Ram
ASUS Crosshair III Formula X570 Motherboard
2 TB Samsung EVO SSD

Test parameters

batch size = 64
epochs = 10
prefetching = True

Below shows the setup, testing and logging using the TFRecords first. As the model is training, Comet is logging system resources for us after we have setup the experiment as seen above. Notice that we call exp.end() after the model is done training. This will signal to Comet to stop logging. Also, our custom logging function is recording system resource utilization on a separate thread locally. It is recording that data to the “monitoring_data” dictionary we passed into the “monitoring_func”.

Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers, Comet ML — TFRecords (left) vs Raw JPG (right) realtime system metrics logging using Comet.ml

RESULTS

While monitoring the metrics during training on Comet (image above), it was already clear how much more effective TFRecords were vs raw JPG on disk. The dashboard shows that the GPU was utilized almost 100% of the entire training loop. Conversely, the JPGs induced a lot of idle time. In the dashboard above, we can see that TFRecords took 18 minutes to train for 10 epochs, where JPG took almost >9mins longer, at 27 minutes to train. The Comet data was aggregated and uploaded at 1-minute intervals.

Looking at the data logged locally at a higher resolution (5x a second) — when training with the original JPGs, there’s a lot of “dead time” on the GPU while it’s waiting on the CPU to read each image, process it, and load it. Compared to the raw results of the training session with TFRecords, the GPU stayed busy almost the whole time as we saw in real-time on Comet.

Note:

As an FYI — When reading the x-axis, it is plotting a python DateTime as the value, so the first number on the axis is the date, not the time. Hence the value of 22:21:10 actually means day=22nd, hour=21, min=10. It also looks as if the CPU was never quite as busy than when it was training with the TFRecords. It’s almost as if it was working double time to ensure it was keeping up with the demand of the GPU. Which is why the GPU was so busy compared to just the raw JPG files. Let’s apply a Gaussian filter on the time-series and see if the smoothed curves reveal any further trends.

After smoothing, it’s very evident when a new epoch was starting while training with the TFRecords. There’s a 10 spaced out spikes in CPU Utilization that corresponds with 10 spikes in GPU memory utilization. This is clearly the CPU processing and loading the GPU with the next batches for the next epoch. This should be even more evident this is the case since we trained the model for 10 epochs, and there’s 10 spikes! Meanwhile, While training with the JPG images, there’s a lot of down time on the GPUs at each epoch. There’s 10 significant drops in GPU Utilization that appears to last for about a minute or so. Also it seems that the CPU is just taking its time processing the images.

Time distributions

Let’s now take a look at the distributions for the time it took to process each batch. One would assume that we shouldn’t see a significant difference in the time it takes to process a batch. This is because for both formats, they’re being converted to a (None, 299, 299, 3) tensor. Any differences between the two inputs would be purely due to stochasticity.

Sure enough this is what we see. The time to process each batch is more or less the same. The bit that would should be more interested in is the time for each epoch, since that involves the entire pipeline of the CPU and GPU processing. The distribution won’t be very interesting to look at since we only did 10 epochs. Hence setting the bins on the histogram to 3, this is what we have:

Although not a graph bountiful of data, it adds more context to what we already saw on Comet. The time spent per epoch is significantly less when training with TFRecords than with the raw data on disk. The data breaks down like:

|        | TFRecords | JPG    |
|--------|-----------|--------|
| Mean   | 215.4     | 271.43 |
| Median | 205.65    | 280.78 |
| STD    | 26.34     | 22.27  |

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.