Implement data warehousing solution using dbt on Amazon Redshift

Amazon Redshift is a cloud data warehousing service that provides high-performance analytical processing based on a massively parallel processing (MPP) architecture. Building and maintaining data pipelines is a common challenge for all enterprises. Managing the SQL files, integrating cross-team work, incorporating all software engineering principles, and importing external utilities can be a time-consuming task that requires complex design and lots of preparation.

dbt (DataBuildTool) offers this mechanism by introducing a well-structured framework for data analysis, transformation and orchestration. It also applies general software engineering principles like integrating with git repositories, setting up DRYer code, adding functional test cases, and including external libraries. This mechanism allows developers to focus on preparing the SQL files per the business logic, and the rest is taken care of by dbt.

In this post, we look into an optimal and cost-effective way of incorporating dbt within Amazon Redshift. We use Amazon Elastic Container Registry (Amazon ECR) to store our dbt Docker images and AWS Fargate as an Amazon Elastic Container Service (Amazon ECS) task to run the job.

How does the dbt framework work with Amazon Redshift?

dbt has an Amazon Redshift adapter module named dbt-redshift that enables it to connect and work with Amazon Redshift. All the connection profiles are configured within the dbt profiles.yml file. In an optimal environment, we store the credentials in AWS Secrets Manager and retrieve them.

The following code shows the contents of profile.yml:

SampleProject:

target: dev
outputs:
dev:
type: redshift
host: "{{ env_var('DBT_HOST') }}"
user: "{{ env_var('DBT_USER') }}"
password: "{{ env_var('DBT_PASSWORD') }}"
port: 5439
dbname: "{{ env_var('DBT_DB_NAME') }}"
schema: dev
threads: 4
keepalives_idle: 240 # default 240 seconds
connect_timeout: 10 # default 10 seconds
sslmode: require
ra3_node: true

The following diagram illustrates the key components of the dbt framework:

The primary components are as follows:

Models – These are written as a SELECT statement and saved as a .sql file. All the transformation queries can be written here which can be materialized as a table or view. The table refresh can be full or incremental based on the configuration. For more information, refer SQL models.
Snapshots – These implements type-2 slowly changing dimensions (SCDs) over mutable source tables. These SCDs identify how a row in a table changes over time.
Seeds – These are CSV files in your dbt project (typically in your seeds directory), which dbt can load into your data warehouse using the dbt seed command.
Tests – These are assertions you make about your models and other resources in your dbt project (such as sources, seeds, and snapshots). When you run dbt test, dbt will tell you if each test in your project passes or fails.
Macros – These are pieces of code that can be reused multiple times. They are analogous to “functions” in other programming languages, and are extremely useful if you find yourself repeating code across multiple models.

These components are stored as .sql files and are run by dbt CLI commands. During the run, dbt creates a Directed Acyclic Graph (DAG) based on the internal reference between the dbt components. It uses the DAG to orchestrate the run sequence accordingly.

Multiple profiles can be created within the profiles.yml file, which dbt can use to target different Redshift environments while running. For more information, refer to Redshift set up.

Solution overview

The following diagram illustrates our solution architecture.

The workflow contains the following steps:

The open source dbt-redshift connector is used to create our dbt project including all the necessary models, snapshots, tests, macros and profiles.
A Docker image is created and pushed to the ECR repository.
The Docker image is run by Fargate as an ECS task triggered via AWS Step Functions. All the Amazon Redshift credentials are stored in Secrets Manager, which is then used by the ECS task to connect with Amazon Redshift.
During the run, dbt converts all the models, snapshots, tests and macros to Amazon Redshift compliant SQL statements and it orchestrates the run based on the internal data lineage graph maintained. These SQL commands are run directly on the Redshift cluster and therefore the workload is pushed to Amazon Redshift directly.
When the run is complete, dbt will create a set of HTML and JSON files to host the dbt documentation, which describes the data catalog, compiled SQL statements, data lineage graph, and more.

Prerequisites

You should have the following prerequisites:

A good understanding of the dbt principles and implementation steps.
An AWS account with user role permission to access the AWS services used in this solution.
Security groups for Fargate to access the Redshift cluster and Secrets Manager from Amazon ECS.
A Redshift cluster. For creation instructions, refer to Create a cluster.
An ECR repository: For instructions, refer to Creating a private repository
A Secrets Manager secret containing all the credentials for connecting to Amazon Redshift. This includes the host, port, database name, user name, and password. For more information, refer to Create an AWS Secrets Manager database secret.
An Amazon Simple Storage (Amazon S3) bucket to host documentation files.

Create a dbt project

We are using dbt CLI so all commands are run in the command line. Therefore, install pip if not already installed. Refer to installation for more information.

To create a dbt project, complete the following steps:

Install dependent dbt packages:
pip install dbt-redshift
Initialize a dbt project using the dbt init <project_name> command, which creates all the template folders automatically.
Add all the required DBT artifacts.
Refer to the dbt-redshift-etlpattern repo which includes a reference dbt project. For more information about building projects, refer to About dbt projects.

In the reference project, we have implemented the following features:

SCD type 1 using incremental models
SCD type 2 using snapshots
Seed look-up files
Macros for adding reusable code in the project
Tests for analyzing inbound data

The Python script is prepared to fetch the credentials required from Secrets Manager for accessing Amazon Redshift. Refer to the export_redshift_connection.py file.

Prepare the run_dbt.sh script to run the dbt pipeline sequentially. This script is placed in the root folder of the dbt project as shown in sample repo.

-- Import the dependent external libraries
dbt deps --profiles-dir . --project-dir .

-- Create tables based on the seed files
dbt seed --profiles-dir . --project-dir .

-- Run all the model files
dbt run --profiles-dir . --project-dir .

-- Run all the snapshot files
dbt snapshot --profiles-dir . --project-dir .

-- Run all inbuilt and custom test cases prepared
dbt test --profiles-dir . --project-dir .

-- Generate dbt documentation files
dbt docs generate --profiles-dir . --project-dir .

--Copying dbt outputs to s3 bucket - for hosting
aws s3 cp --recursive --exclude="*" --include="*.json" --include="*.html" dbt/target/ s3://<bucketName>/REDSHIFT_POC/

Create a Docker file in the parent directory of the dbt project folder. This step builds the image of the dbt project to be pushed to the ECR repository.

FROM python:3

ADD dbt_src /dbt_src

RUN pip install -U pip

# Install DBT libraries
RUN pip install --no-cache-dir dbt-core

RUN pip install --no-cache-dir dbt-redshift

RUN pip install --no-cache-dir boto3

RUN pip install --no-cache-dir awscli

WORKDIR /dbt_src

RUN chmod -R 755 .

ENTRYPOINT [ "/bin/sh", "-c" ]

CMD ["./run_dbt.sh"]

Upload the image to Amazon ECR and run it as an ECS task

To push the image to the ECR repository, complete the following steps:

Retrieve an authentication token and authenticate your Docker client to your registry:

aws ecr get-login-password --region <region_name> | docker login --username AWS --password-stdin <repository_name>

Build your Docker image using the following command:

docker build -t <image tag> .

After the build is complete, tag your image so you can push it to the repository:

docker tag <image tag>:latest <repository_name>:latest

Run the following command to push the image to your newly created AWS repository:

docker push <repository_name>/<image tag>:latest

On the Amazon ECS console, create a cluster with Fargate as an infrastructure option.
Provide your VPC and subnets as required.
After you create the cluster, create an ECS task and assign the created dbt image as the task definition family.
In the networking section, choose your VPC, subnets, and security group to connect with Amazon Redshift, Amazon S3 and Secrets Manager.

This task will trigger the run_dbt.sh pipeline script and run all the dbt commands sequentially. When the script is complete, we can see the results in Amazon Redshift and the documentation files pushed to Amazon S3.

You can host the documentation via Amazon S3 static website hosting. For more information, refer to Hosting a static website using Amazon S3.
Finally, you can run this task in Step Functions as an ECS task to schedule the jobs as required. For more information, refer to Manage Amazon ECS or Fargate Tasks with Step Functions.

The dbt-redshift-etlpattern repo now has all the code samples required.

Cost for executing dbt jobs in AWS Fargate as an Amazon ECS task with minimal operational requirements would take around $1.5 (cost_link) per month.

Clean up

Complete the following steps to clean up your resources:

Delete the ECS Cluster you created.
Delete the ECR repository you created for storing the image files.
Delete the Redshift Cluster you created.
Delete the Redshift Secrets stored in Secrets Manager.

Conclusion

This post covered the basic implementation of using dbt with Amazon Redshift in a cost-efficient way by using Fargate in Amazon ECS. We described the key infrastructure and configuration set-up with a sample project. This architecture can help you take advantage of the benefits of having a dbt framework to manage your data warehouse platform in Amazon Redshift.

For more information about dbt macros and models for Amazon Redshift internal operation and maintenance, refer to the following GitHub repo. In subsequent post, we’ll explore the traditional extract, transform, and load (ETL) patterns that you can implement using the dbt framework in Amazon Redshift. Test this solution in your account and provide feedback or suggestions in the comments.

About the Authors

Seshadri Senthamaraikannan is a data architect with AWS professional services team based in London, UK. He is well experienced and specialised in Data Analytics and works with customers focusing on building innovative and scalable solutions in AWS Cloud to meet their business goals. In his spare time, he enjoys spending time with his family and play sports.

Mohamed Hamdy is a Senior Big Data Architect with AWS Professional Services based in London, UK. He has over 15 years of experience architecting, leading, and building data warehouses and big data platforms. He helps customers develop big data and analytics solutions to accelerate their business outcomes through their cloud adoption journey. Outside of work, Mohamed likes travelling, running, swimming and playing squash.