Airbyte 101: Building Data Pipelines with Airbyte

Airbyte 101: Building Data Pipelines with Airbyte

In this tutorial, we will use Airbyte to transfer data from HubSpot to the PostgreSQL database.

Have you ever been in a situation you want to transfer data faster from one source to another? AIRBYTE is the tool that will save the day when it comes to scalable and fast ELT data pipelines.

Airbyte is an open-source data integration platform that enables users to quickly and easily move data between a source to a destination e.g. from the cloud to an on-premise application.

With Airbyte, developers easily create data pipelines to move data from one platform to other thanks to its simple design. They can also effortlessly manage data pipelines, automate data synchronization, and keep an eye on dataflows.

Airbyte has made it easy for us to connect to any data source, including databases, SaaS applications, cloud storage, and more using connectors and if the connector is not supported, you can build your own.

In this tutorial, we'll,

  • Deploy Airbyte Application

  • Create Source and Destination connections to relational databases

  • Transfer data from HubSpot to the PostgreSQL database.

Setting up the environment

Airbyte comes in different offerings, i.e Airbyte open source, Cloud and Enterprise versions. We will deploy the open-source version of Airbyte on our local machine, but remember that you can deploy on different platforms such as cloud platforms (AWS EC2, GCP, Azure), or on Kubernetes, etc.

First, we need to make sure we have the following installed in our environment,

  • Docker engine

  • Docker compose plugin

You can check using the following commands;

# checking if docker is installed
docker --version

And for Docker compose,

# Checking if docker-compose is installed
docker-compose --version

Output:

Now after fulfilling the prerequisites, we will set up Airbyte using the following commands:

# cloning Airbyte from GitHub
git clone https://github.com/airbytehq/airbyte.git

# switching into Airbyte directory
cd airbyte

# starting Airbyte using docker
docker compose up

Once the Git repository is cloned, we change the directory into the Airbyte folder. From this directory, we will invoke the docker-compose-up command.

When the containers are up, go to http://localhost:8000/, and you will see the Airbyte app UI running and ready to start ETL-ing! To log in, we will use the default credentials (airbyte, password) but you can change them from .env file

Creating a source

We will search and select our source connector from the built-in connectors provided by Airbyte. Since we will be sourcing our data from HubSpot, we will select HubSpot as our source type.

Note: We need first to retrieve our Access token for a private app. Back then HubSpot used to provide API Keys but it was scraped off.

We will set up our source by completing the form which is given by Airbyte, containing connection details of our source system.

These are the values that are needed to create the source successfully:

  • Name: Hubspot

  • Source type: Hubspot

  • Access Token: My Hubspot Access Token for the private app

  • start_date: I used 2017-01-25T00:00:00Z which is the machine-readable timestamp for 1/1/2017

Here is a cute picture to celebrate the far we have come and usher us into the next step.

Creating the destination

It is time to create the destination where we will persist our data to. In our case, we will use the PostgreSQL database as our destination.

  • Name: I called it “Postgres-app” but it could be anything you want

  • Host: Use host.docker.internal, you can also use localhost

  • Port: 5432

  • Database Name: Mine is “postgres.”

  • Schema: I left it as “public.” You might want to use schemas for organizational purposes. Schemas are “namespaces” and you can think of them as folders for your tables.

  • User: Again, mine is “postgres

  • Password: You can leave this blank unless you set up a password. Mine I had set as "password".

From there you can test your connection and you should see a message that says, “All connection tests passed!

Creating the ELT connection

We've already created a source and a destination, now all we have to do is to tell Airbyte we want to extract from the Source and load data to the Destination.

We'll go back to the Destinations screen and open your Postgres-app source, click “add source,” and choose your source. For me, this is the source I created called “Hubspot.”

Airbyte will then go and test both the Source and Destination. Once both tests succeed, you can set up your sync.

There are many settings available! Fortunately, you can leave most of them alone until you wish to be more precise about how you save and organize your data.

Set the Sync frequency to "manual" for the time being, and deselect all Hubspot objects other than Contacts.

Nonetheless, starting with Contacts is a fantastic idea because it will be much quicker to finish the initial load and the analyses will still be rather compelling. In the future, you might load more items for more complex analysis.

At the bottom of the screen, select "Set up connection."

"Huraaaaay!!" You have created your first connection successfully. Click “Sync now” and start the ETL job!

After the sync has finished running, you will see lots of logs.

What’s happening in Postgres now?

All the data is being loaded into temporary tables that Airbyte is constructing. It will then replicate those data into its tables for the final state. The tables will be set up in a way that makes analysis a little bit simpler. Here are more T and L related to the ETL process!

You can return to Postico throughout the sync to refresh the table list (cmd+R) and see new tables as they are created.

Conclusion

In this tutorial, we successfully set up Airbyte locally. Then we created a source (HubSpot) and a destination (PostgreSQL) connection and we were able to sync data from the source to the destination.