A Gentle Introduction to data pipelines using tf.data

Viraj Kadam
2 min readJun 1, 2022

--

The first question is , what is there is need for data pipelines?

The need of Input pipeline is due to :

  • Data might not fit in memory .
  • Data might require some form of preprocessing.
  • To efficiently use hardware.

Enter the ETL process (Extract,Transform,Load)

  • Extract: Read the data from memory/storage.
  • Transform: Apply the necessary transformation on the data.
  • Load : Transfer the ready data to the GPU/Accelerator for further operations.

We can build pipelines for our deep learning models using tensorflow.data. Lets take a look at how to do that.

We will select 5000 images from the training and test set, and make a pipeline to predict. In the next cell block, we select the files in directory having a “.jpg” format and randomly select 5000 of those file-paths.

Randomly Selecting 5000 Images from train and test set.

Before going to the next step, we can talk about parallelization. So we are loading the data using a CPU, while the model training/inference is happening in parallel with a GPU. We can further optimize the data extraction/transformation by parallelizing the pipeline operations.

Now the question is how to decide what number of samples to extract/transform in parallel? There are two ways we can do this:

  • Trial and error (and watching the hardware resources ) to come up with a ideal size.
  • Use tf.data.experimental.AUTOTUNE

Next we load the selected images by using the custom function load_image.

Then we set the pipeline to load the images we selected earlier by using .map(load_image,num_parallel_calls = AUTOTUNE). Notice that we use autotune function to parallelize loading images.

We define a autotune function, and use it in .map to parallelize the loading operation.

Next, we apply some preprocessing to the images by using a custom preprocessing function.

Preprocess Images

The next we optimize the pipeline. We split the data into batches of predefined batch size, shuffle the data and pre-fetch the data to avoid bottlenecks.

Phew, now our pipeline is ready to be used for training/prediction. You can simply use the pipeline inside the predict or fit function of your tensorflow model.

# References :

--

--

No responses yet