A Gentle Introduction to data pipelines using tf.data
The first question is , what is there is need for data pipelines?
The need of Input pipeline is due to :
- Data might not fit in memory .
- Data might require some form of preprocessing.
- To efficiently use hardware.
Enter the ETL process (Extract,Transform,Load)
- Extract: Read the data from memory/storage.
- Transform: Apply the necessary transformation on the data.
- Load : Transfer the ready data to the GPU/Accelerator for further operations.
We can build pipelines for our deep learning models using tensorflow.data. Lets take a look at how to do that.
We will select 5000 images from the training and test set, and make a pipeline to predict. In the next cell block, we select the files in directory having a “.jpg” format and randomly select 5000 of those file-paths.
Before going to the next step, we can talk about parallelization. So we are loading the data using a CPU, while the model training/inference is happening in parallel with a GPU. We can further optimize the data extraction/transformation by parallelizing the pipeline operations.
Now the question is how to decide what number of samples to extract/transform in parallel? There are two ways we can do this:
- Trial and error (and watching the hardware resources ) to come up with a ideal size.
- Use tf.data.experimental.AUTOTUNE
Next we load the selected images by using the custom function load_image.
Then we set the pipeline to load the images we selected earlier by using .map(load_image,num_parallel_calls = AUTOTUNE). Notice that we use autotune function to parallelize loading images.
Next, we apply some preprocessing to the images by using a custom preprocessing function.
The next we optimize the pipeline. We split the data into batches of predefined batch size, shuffle the data and pre-fetch the data to avoid bottlenecks.
Phew, now our pipeline is ready to be used for training/prediction. You can simply use the pipeline inside the predict or fit function of your tensorflow model.
# References :