How to create data pipelines as Docker containers

How to create data pipelines as Docker containers

In this tutorial, we'll cover building data pipelines as Docker containers and demonstrate the process with examples.

Did you know that 2.5 quintillion bytes of data are created every day ( Dihuni. Robinson, M. A. (2021, February 9).

It is true to say that Data is the new oil, driving the world today and it needs to be transported to where it is processed. This is made efficient through the use of data pipelines.

Docker technology has helped in the packaging of our solutions. It has provided a platform where we can automate our project deployment, and software application deployment and we can deploy any of our trained deep learning, and machine learning models inside Containers which not only makes it easier for production but also adds automation and abstraction.

When working with data pipelines, there are advantages of containerizing your data pipelines using the docker such as reproducibility, scalability, integration tests and many more. In this tutorial, we'll discuss how to create and containerize your data pipelines.

Setting up environment

To get started, first, install the following:

  1. Python 3

  2. Docker

You can check if you have python installed by running the following command in your terminal:

python3 --version

For docker, use the following command to install it:

pip install docker

Creating a data pipeline

A data pipeline is simply a group of data processing elements that are connected in series, where the output of one is the input of the other.

In the example below, I create a simple data pipeline that prints a sentence and demonstrates how to put arguments. Using this example, you will be able to create more elaborate and scalable data pipelines that involve real data processing.

#This is the library we will use to access variables 
import sys 
import pandas as pd # data manipulation library

print(sys.argv)

day = sys.argv[1]

print(f'Job finished successfully for day = {day}')

We will name our file data_pipeline.py

Creating a Dockerfile

A docker file is a file that describes the specifications for building our container image.

To create our container that runs the data pipeline above, these are the specifications that we will need.

FROM python:3.9

RUN pip install pandas

WORKDIR /app
COPY data_pipeline.py pipeline.py

ENTRYPOINT [ "bash" ]

In this example, we create a layer as our base image from python:3.9 then we install pandas in this pulled image. WORKDIR /app specifies that we will use a directory called app in the container as our working directory. Using ENTRYPOINT we set the command to target the container.

Building the Docker Container

After creating the Dockerfile, using the following command, we will build our container image that will run our data pipeline.

docker build  -t name:tag dir
#Building our example container
docker build -t data_pipeline:v001 .

Docker build command builds a container image from a Dockerfile. Using -t argument, we input the name and optionally a tag in this format name:tag , in our example, we have data_pipeline:v001 .

Also, we have to put the directory where our Dockerfile is, in our example we have . which tells the command to build using the Dockerfile that is in the same directory from where we are running the command.

In the output below, we can see that the container image has been successfully built and stored in the docker hub.

 ---------------------------------------------------------------------
 => [2/4] RUN pip install pandas                                                                                                                                                                     
 => [3/4] WORKDIR /app                                                                                                                                                                               
 => [4/4] COPY data_pipeline.py pipeline.py                                                                                                                                                           
 => exporting to image                                                                                                                                                                                
 => => exporting layers                                                                                                                                                                               
 => => writing image sha256:c29e1fef73ef97c228162bff2a8cc88b55e8d6715cbbb7c90fcb12250a7aad51                                                                                                          
 => => naming to docker.io/library/data_pipeline:v001

Running the Container Image

Now having everything is set, we will run and execute our data pipeline inside the container image.

To run a container image, use the following command:

docker run -i -t data_pipeline:v001

The container image will run and enter the ENTRYPOINT that you specified. In our example, it will enter bash, since that was our ENTRYPOINT .

root@f53b599d13fa:/app# ls
\pipeline.py
root@f53b599d13fa:/app# python pipeline.py Monday
['pipeline.py', 'Monday']
Job finished successfully for day = Monday
root@f53b599d13fa:/app#

In the example above, we can see that after running our container image, it opens bash in the /app directory, when we list the files in the directory, we can see our pipeline.py file.

Now, you can execute the data pipeline with no worries. In our example, we run python pipeline.py Monday where Monday is an argument.

Hureeeeh!! You have successfully executed your data pipeline inside a container image.

Conclusion

There are several advantages to using Docker to containerize your solutions as a developer. The main ones are the portability and reproducibility of results in different environments without worrying about stuff like OS or crushing libraries. As a data engineer, you can also benefit from using docker for containerization e.g. containerizing data pipelines.

In this tutorial, we have learned how you can create data pipelines in python, containerize them using docker and then run them across any environment.

Connect with me on Twitter & LinkedIn.

Thank you and I wish you Happy Learning!