AWS Big Data Blog
Build a DataOps platform to break silos between engineers and analysts
Organizations across the globe are striving to provide a better service to internal and external stakeholders by enabling various divisions across the enterprise, like customer success, marketing, and finance, to make data-driven decisions. Data teams are the key enablers in this process, and usually consist of multiple roles, such as data engineers and analysts. However, as per an IDC report in early 2020, these roles act in silos, which are the biggest barrier for organizations to develop their intelligence capabilities and hinder the data-driven decision-making process.
In this post, we discuss and build a data platform that fosters effective collaboration between engineers and analysts. We show you how to enable data analysts to transform data in Amazon Redshift by using software engineering practices—DataOps. We demonstrate how analytics transformations, defined as data build tool (dbt) models, can be submitted as part of a CI/CD pipeline, and then be scheduled and orchestrated by Apache Airflow.
Siloed data teams
Roles within a data team are very well defined in most companies. Data engineers are responsible for creating and maintaining the infrastructure such as data lakes, data warehouses, and CI/CD pipelines. They also build pipelines to ingest data from different sources. Data analysts have domain knowledge and are responsible for serving analytics requests from different stakeholders, such as marketing and business development teams, so that a company can make data-driven decisions.
However, data analysts often rely on data engineers to implement business logic due to the required technical knowledge and complexity of applying changes to production systems. As the company grows, this process becomes a bottleneck and creates silos between engineers and analysts, which results in delays and overall inefficiency to serve business with data insights.
Overview of modern data team and DataOps
We can break these silos by implementing the DataOps methodology. Teams can operationalize data analytics with automation and processes to reduce the time in data analytics cycles. In this setup, data engineers enable data analysts to implement business logic by following defined processes and therefore deliver results faster. Engineers can now focus on evolving the data platform and system implementation to further streamline the process for analysts.
To implement the DataOps process for data analysts, you can complete the following steps:
- Implement business logic and tests in SQL.
- Submit code to a Git repository.
- Perform code review and run automated tests.
- Run the code in a production data warehouse based on a defined schedule.
Analysts need a way to write data transformations in plain SQL code, and don’t have an overhead of knowing how different systems interact. They also need to define dependencies between transformations. The open-source data build tool (dbt) enables data analysts to transform data in data warehouses using SQL and test those transformations. This tool does the “T” in ELT (extract, load, transform). With dbt, all transformations are expressed as SQL SELECT statements called models, which can be tested through schema and data tests.
We also need to enable data analysts to define the schedule on which transformations are run. There are a few tools that we could use for scheduling and orchestration, such as AWS Step Functions and Apache Airflow. In this post, we use Airflow, which is an open-source workflow management platform. Airflow handles dependency management through the concept of Directed Acyclic Graphs (DAGs), which provide a powerful mechanism to express complex dependencies between tasks. It also comes with a user-friendly web interface that enables observability over defined workflows.
Solution overview
Let’s define an architecture that supports our data platform. Containerized Airflow is deployed on AWS Fargate. As part of the setup, it uses an Amazon RDS for PostgreSQL database as a metadata store, and Amazon ElastiCache for Redis as a Celery backend. Docker images for Fargate are stored in Amazon Elastic Container Registry (Amazon ECR). Whenever Airflow runs a dbt workflow, it creates a new Fargate task that triggers dbt transformations in the Amazon Redshift data warehouse.
Airflow tasks and dbt transformations definitions are pushed to an Amazon Simple Storage Service (Amazon S3) bucket as part of the CI/CD pipeline, so Airflow can pull the latest changes in near-real time. Updates to the infrastructure are performed through CI/CD, which triggers AWS CodeBuild, which applies the changes. Data engineers and analysts can interact with Airflow’s user interface through the Application Load Balancer.
The following diagram illustrates our solution architecture.
Prerequisites
You can deploy this solution from your local machine. Make sure you have the following prerequisites:
- An AWS account
- npm package manager
- Docker
- The AWS Command Line Interface (AWS CLI) installed
- AWS Cloud Development Kit (AWS CDK) version 1.90.0 installed
- Python version equal to or greater than 3.6
Clone and set up the repositories
To get started, clone the following GitHub repository to your local machine:
The analytics
folder contains dbt and Airflow DAGs. The dataops-infra
folder contains code and instructions to deploy our architecture.
Navigate to dataops-infra
and create a Python virtual environment:
This rule creates a virtual environment in infra/venv
and installs all the required dependencies for this project.
Generate a Fernet key
Apache Airflow uses Fernet to encrypt passwords for connection and variable configurations. To generate a Fernet key, run the following command:
This outputs a generated Fernet key to your console:
Store it as a new secret to AWS Secrets Manager:
Set environment variables
This project uses environment variables to set required parameters for AWS CDK. To set up variables, you need to edit the .env file located in the dataops-infra
folder. The following shows an example of the file:
You can use the AWS CLI to get the Amazon Resource Name (ARN) for your Fernet key secret:
Deploy infrastructure services
As a first step, you deploy services that compose the core of this infrastructure and don’t change frequently. Those include a VPC, S3 bucket, Amazon ECR repositories, Redis, Amazon Redshift, and Postgres.
To deploy these services, run the following code:
The AWS CDK CLI asks for permission to deploy specific resources, so make sure to acknowledge by entering y
in your terminal and pressing Enter. The bootstrap
process can take up to 20 minutes.
Now you can upload Docker images for Airflow and dbt to Amazon ECR. To do so, run the following code:
Finally, you can deploy Airflow:
Seed sample data for Amazon Redshift
To test dbt transformations in this project, you need to insert sample data to the Amazon Redshift data warehouse. For instructions, see Step 6: Load sample data from Amazon S3. This tutorial guides you through creating sample tables and inserting data from an existing S3 bucket. For your convenience, you can run this tutorial using the query editor feature in Amazon Redshift. Connect to your cluster in the query editor using the following credentials:
- Database name –
redshift-db
- Database user –
redshift-user
You need the ARN for the AWS Identity and Access Management (IAM) role that was created during deployment to run COPY statements from the tutorial. Run the following command to get the ARN:
Run transformations as a data analyst
Before you can run the redshift_transformations
DAG, you need to update its network configuration to reflect your VPC subnets and security groups. To do that, navigate to the analytics
folder and edit the airflow_dags/redshift_transformations.py
file. Update the securityGroups
and subnets values of the network_configuration
dictionary. To get the ID of the required security group, run the following command:
You can use the AWS CLI to get the IDs of required subnets. The following command assumes that you deployed this project to the eu-west-1
Region. Make sure to use the correct subnet names if you choose a different Region and run the following code:
Finally, upload these changes to your S3 bucket. Navigate to the analytics
folder and run the following code:
Let’s now get the domain name of Application Load Balancer (ALB) so you can access the Airflow UI. Navigate to ALB on the AWS Management Console and copy the DNS name value. Open the copied link in your browser and log in to Airflow UI with the following credentials:
- Username –
user
- Password –
bitnami
When you open the Airflow UI, you can see two DAGs were deployed: example_dag
and redshift_transformations
. The first contains examples of a few Airflow tasks. The second triggers the creation of a new Fargate task, which then runs dbt transformations in Amazon Redshift. We have five dbt models located in analytics/dbt_dags/models
that contain our transformations:
Let’s trigger the redshift_transformations
DAG from the Airflow user interface.
When the DAG run is complete, we can open the Amazon Redshift query editor and look for five materialized views created by dbt. You can also access logs for this specific run from both the Airflow UI and Amazon CloudWatch console under the ecs/dbt-cdk
log group.
You can now change dbt models and upload code to your S3 bucket. If you trigger the redshift_transformations
DAG again, it runs your latest code. Also, Airflow syncs the code every minute from an S3 bucket so any changes or new DAGs are available. Alternatively, you can create and store AWS access keys for this repository. In that case, push your code to a GitHub repository and your changes trigger an action that deploys code to an S3 bucket.
Update your infrastructure as a data engineer
You can make changes to the infrastructure in the dataops_infra
folder. To apply those changes, you can use the AWS CDK CLI and deploy only specific services. For example, to deploy changes to Airflow services, run the following code:
Alternatively, you can use bootstrap
and deploy rules in the Makefile
like you did previously.
If you want to have an automated CI/CD pipeline to deploy changes to the infrastructure, you can use the GitHub actions we created. These actions use the AWS built GitHub action for CodeBuild. To make this work, you need to create a new CodeBuild project and edit actions
in the dataops_infra
folder. After you create the project, edit actions
in dataops_infra/.github/workflows
to replace <AWS_ACCOUNT_ID>
, <AWS_REGION>
, and <YOUR_CODEBUILD_PROJECT_NAME>
with your values.
Now you can push changes to your GitHub repository, which triggers CodeBuild to deploy the infrastructure.
Cleanup
When you’re finished exploring this solution, you might want to clean up your account to avoid unnecessary costs. To delete all resources that you created in this post, run the following command:
Conclusion
Automating processes such as testing, deployment, scheduling, and orchestration in data teams has a significant impact. It enables faster delivery of data insights with uncompromised data quality and effective cross-team collaboration.
In this post, we showed you how to build a reliable, scalable, and highly available data platform. You also built a CI/CD pipeline with AWS services like CodeBuild, Amazon ECR, and Amazon S3. You experimented with Fargate, Secrets Manager, and Amazon Redshift. We talked about the DataOps culture and demonstrated an end-to-end solution with Apache Airflow and dbt, which you can reuse for your own implementation.
You can find the source code from this post on GitHub. You can deep dive into the implementation and use it to build your own solution.
About the Authors
Dzenan Softic is a Solutions Architect at AWS. He works with startups in incubators and accelerators to help them define and implement their ideas. Previously, his main focus was in data engineering and infrastructure. He was also a startup founder and machine learning researcher.
Riccardo Bassetto is a Solutions Architect at AWS, helping startup customers implement their ideas in the cloud.
David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.
As a Solutions Architect for Amazon Web Services, Tim Berger supports startups across Germany, Austria, and Switzerland in building scalable and cost-effective applications on the AWS Cloud.