AWS Big Data Blog
Manage data transformations with dbt in Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift enables you to use your data to acquire new insights for your business and customers while keeping costs low.
Together with price-performance, customers want to manage data transformations (SQL Select statements written by data engineers, data analysts, and data scientists) in Amazon Redshift with features including modular programming and data lineage documentation.
dbt (data build tool) is a framework that supports these features and more to manage data transformations in Amazon Redshift. There are two interfaces for dbt:
- dbt CLI – Available as an open-source project
- dbt Cloud – A hosted service with added features including an IDE, job scheduling, and more
In this post, we demonstrate some features in dbt that help you manage data transformations in Amazon Redshift. We also provide the dbt CLI and Amazon Redshift workshop to get started using these features.
Manage common logic
dbt enables you to write SQL in a modular fashion. This improves maintainability and productivity because common logic can be consolidated (maintain a single instance of logic) and referenced (build on existing logic instead of starting from scratch).
The following figure is an example showing how dbt consolidates common logic. In this example, two models rely on the same subquery. Instead of replicating the subquery, dbt allows you to create a model for the subquery and reference it later.
The concept of referencing isn’t limited to logic related to subqueries. You can also use referencing for logic related to fields.
The following is an example showing how dbt consolidates common logic related to fields. In this example, a model applies the same case statement on two fields. Instead of replicating the case statement for each field, dbt allows you to create a macro containing the case statement and reference it later.
How is a model in dbt subsequently created in Amazon Redshift? dbt provides you with the command dbt run, which materializes models as views or tables in your targeted Amazon Redshift cluster. You can try this out in the dbt CLI and Amazon Redshift workshop.
Manage common data mappings
Although you can use macros to manage data mappings (for example, mapping “1” to “One” and “2” to “Two”), an alternative is to maintain data mappings in files and manage the files in dbt.
The following is an example of how dbt manages common data mappings. In this example, a model applies one-to-one data mappings on a field. Instead of creating a macro for the one-to-one data mappings, dbt allows you to create a seed for the one-to-one data mappings in the form of a CSV file and then reference it later.
You can create or update a seed with a two-step process. After you create or update a CSV seed file, run the command dbt seed to create the CSV seed as a table in your targeted Amazon Redshift cluster before referencing it.
Manage data lineage documentation
After you have created models and seeds in dbt, and used dbt’s referencing capability, dbt provides you with a method to generate documentation on your data transformations.
You can run the command dbt docs generate followed by dbt docs serve to launch a locally hosted website containing documentation on your dbt project. When you choose a model on the locally hosted website, information about the model is displayed, including columns in the final view or table, dependencies to create the model, and the SQL that is compiled to create the view or table. The following screenshot shows an example of this documentation.
You can also visualize dependencies for improved navigation of documentations during impact analysis. In the following example graph, we can see that model rpt_tech_all_users
is built referencing the model base_public_users
, which in turn references the table users
in the public
schema.
Conclusion
This post covered how you can use dbt to manage data transformations in Amazon Redshift. As you explore dbt, you will come across other features like hooks, which you can use to manage administrative tasks, for example, continuous granting of privileges.
For a hands-on experience with dbt CLI and Amazon Redshift, we have a workshop with step-by-step instructions to help you create your first dbt project and explore the features mentioned in this post—models, macros, seeds, and hooks. Visit dbt CLI and Amazon Redshift to get started.
If you have any questions or suggestions, leave your feedback in the comments section. If you need any further assistance to optimize your Amazon Redshift implementation, contact your AWS account team or a trusted AWS partner.
About the authors
Randy Chng is an Analytics Acceleration Lab Solutions Architect at Amazon Web Services. He works with customers to accelerate their Amazon Redshift journey by delivering proof of concepts on key business problems.
Sean Beath is an Analytics Acceleration Lab Solutions Architect at Amazon Web Services. He delivers proof of concepts with customers on Amazon Redshift, helping customers drive analytics value on AWS.