AWS Big Data Blog

Build a DataOps platform to break silos between engineers and analysts

Organizations across the globe are striving to provide a better service to internal and external stakeholders by enabling various divisions across the enterprise, like customer success, marketing, and finance, to make data-driven decisions. Data teams are the key enablers in this process, and usually consist of multiple roles, such as data engineers and analysts. However, as per an IDC report in early 2020, these roles act in silos, which are the biggest barrier for organizations to develop their intelligence capabilities and hinder the data-driven decision-making process.

In this post, we discuss and build a data platform that fosters effective collaboration between engineers and analysts. We show you how to enable data analysts to transform data in Amazon Redshift by using software engineering practices—DataOps. We demonstrate how analytics transformations, defined as data build tool (dbt) models, can be submitted as part of a CI/CD pipeline, and then be scheduled and orchestrated by Apache Airflow.

Siloed data teams

Roles within a data team are very well defined in most companies. Data engineers are responsible for creating and maintaining the infrastructure such as data lakes, data warehouses, and CI/CD pipelines. They also build pipelines to ingest data from different sources. Data analysts have domain knowledge and are responsible for serving analytics requests from different stakeholders, such as marketing and business development teams, so that a company can make data-driven decisions.

However, data analysts often rely on data engineers to implement business logic due to the required technical knowledge and complexity of applying changes to production systems. As the company grows, this process becomes a bottleneck and creates silos between engineers and analysts, which results in delays and overall inefficiency to serve business with data insights.

Overview of modern data team and DataOps

We can break these silos by implementing the DataOps methodology. Teams can operationalize data analytics with automation and processes to reduce the time in data analytics cycles. In this setup, data engineers enable data analysts to implement business logic by following defined processes and therefore deliver results faster. Engineers can now focus on evolving the data platform and system implementation to further streamline the process for analysts.

To implement the DataOps process for data analysts, you can complete the following steps:

  1. Implement business logic and tests in SQL.
  2. Submit code to a Git repository.
  3. Perform code review and run automated tests.
  4. Run the code in a production data warehouse based on a defined schedule.

 

Analysts need a way to write data transformations in plain SQL code, and don’t have an overhead of knowing how different systems interact. They also need to define dependencies between transformations. The open-source data build tool (dbt) enables data analysts to transform data in data warehouses using SQL and test those transformations. This tool does the “T” in ELT (extract, load, transform). With dbt, all transformations are expressed as SQL SELECT statements called models, which can be tested through schema and data tests.

We also need to enable data analysts to define the schedule on which transformations are run. There are a few tools that we could use for scheduling and orchestration, such as AWS Step Functions and Apache Airflow. In this post, we use Airflow, which is an open-source workflow management platform. Airflow handles dependency management through the concept of Directed Acyclic Graphs (DAGs), which provide a powerful mechanism to express complex dependencies between tasks. It also comes with a user-friendly web interface that enables observability over defined workflows.

Solution overview

Let’s define an architecture that supports our data platform. Containerized Airflow is deployed on AWS Fargate. As part of the setup, it uses an Amazon RDS for PostgreSQL database as a metadata store, and Amazon ElastiCache for Redis as a Celery backend. Docker images for Fargate are stored in Amazon Elastic Container Registry (Amazon ECR). Whenever Airflow runs a dbt workflow, it creates a new Fargate task that triggers dbt transformations in the Amazon Redshift data warehouse.

Airflow tasks and dbt transformations definitions are pushed to an Amazon Simple Storage Service (Amazon S3) bucket as part of the CI/CD pipeline, so Airflow can pull the latest changes in near-real time. Updates to the infrastructure are performed through CI/CD, which triggers AWS CodeBuild, which applies the changes. Data engineers and analysts can interact with Airflow’s user interface through the Application Load Balancer.

The following diagram illustrates our solution architecture.

Prerequisites

You can deploy this solution from your local machine. Make sure you have the following prerequisites:

Clone and set up the repositories

To get started, clone the following GitHub repository to your local machine:

bash $ > git clone https://github.com/aws-samples/dataops-platform-airflow-dbt dataops-platform-airflow-dbt

The analytics folder contains dbt and Airflow DAGs. The dataops-infra folder contains code and instructions to deploy our architecture.

Navigate to dataops-infra and create a Python virtual environment:

bash $ > make venv

This rule creates a virtual environment in infra/venv and installs all the required dependencies for this project.

Generate a Fernet key

Apache Airflow uses Fernet to encrypt passwords for connection and variable configurations. To generate a Fernet key, run the following command:

bash $ > make generate_fernet

This outputs a generated Fernet key to your console:

This is your Fernet key: <your fernet key>

Store it as a new secret to AWS Secrets Manager:

bash $ > aws secretsmanager create-secret –name fernetKeySecret –-description “Fernet key for Airflow” –secret-string YOUR_FERNET_KEY

Set environment variables

This project uses environment variables to set required parameters for AWS CDK. To set up variables, you need to edit the .env file located in the dataops-infra folder. The following shows an example of the file:

AWS_REGION=eu-west-1
BUCKET_NAME=my-unique-dataops-bucket-name
FERNET_SECRET_ARN=arn:aws:secretsmanager:eu-west-1:123456789012:secret:airflow/fernet_key-AbCdEf
ECR_URI=<YOUR_ACCOUNT_ID>.dkr.ecr.eu-west-1.amazonaws.com

You can use the AWS CLI to get the Amazon Resource Name (ARN) for your Fernet key secret:

bash $ > aws secretsmanager list-secrets

Deploy infrastructure services

As a first step, you deploy services that compose the core of this infrastructure and don’t change frequently. Those include a VPC, S3 bucket, Amazon ECR repositories, Redis, Amazon Redshift, and Postgres.

To deploy these services, run the following code:

bash $ > make bootstrap

The AWS CDK CLI asks for permission to deploy specific resources, so make sure to acknowledge by entering y in your terminal and pressing Enter. The bootstrap process can take up to 20 minutes.

Now you can upload Docker images for Airflow and dbt to Amazon ECR. To do so, run the following code:

bash $ > make push_images

Finally, you can deploy Airflow:

bash $ > make deploy

Seed sample data for Amazon Redshift

To test dbt transformations in this project, you need to insert sample data to the Amazon Redshift data warehouse. For instructions, see Step 6: Load sample data from Amazon S3. This tutorial guides you through creating sample tables and inserting data from an existing S3 bucket. For your convenience, you can run this tutorial using the query editor feature in Amazon Redshift. Connect to your cluster in the query editor using the following credentials:

  • Database nameredshift-db
  • Database userredshift-user

You need the ARN for the AWS Identity and Access Management (IAM) role that was created during deployment to run COPY statements from the tutorial. Run the following command to get the ARN:

bash $ > aws redshift describe-clusters --query 'Clusters[*].IamRoles[*].IamRoleArn'

Run transformations as a data analyst

Before you can run the redshift_transformations DAG, you need to update its network configuration to reflect your VPC subnets and security groups. To do that, navigate to the analytics folder and edit the airflow_dags/redshift_transformations.py file. Update the securityGroups and subnets values of the network_configuration dictionary. To get the ID of the required security group, run the following command:

bash $ > aws ec2 describe-security-groups \
    --filters Name=group-name,Values=airflow-sg-cdk \                               
    --query "SecurityGroups[*].{Name:GroupName,ID:GroupId}"

You can use the AWS CLI to get the IDs of required subnets. The following command assumes that you deployed this project to the eu-west-1 Region. Make sure to use the correct subnet names if you choose a different Region and run the following code:

bash $ > aws ec2 describe-subnets \
    --filters Name=tag:Name,Values=isolated-subnet-eu-west-1a,isolated-subnet-eu-west-1b --query 'Subnets[*].SubnetId'

Finally, upload these changes to your S3 bucket. Navigate to the analytics folder and run the following code:

bash $ > aws s3 sync . s3://<YOUR_BUCKET_NAME> --exclude '.git/*' --exclude '.github/*' --exclude '.gitignore' --exclude 'README.md'

Let’s now get the domain name of Application Load Balancer (ALB) so you can access the Airflow UI. Navigate to ALB on the AWS Management Console and copy the DNS name value. Open the copied link in your browser and log in to Airflow UI with the following credentials:

  • Usernameuser
  • Passwordbitnami

When you open the Airflow UI, you can see two DAGs were deployed: example_dag and redshift_transformations. The first contains examples of a few Airflow tasks. The second triggers the creation of a new Fargate task, which then runs dbt transformations in Amazon Redshift. We have five dbt models located in analytics/dbt_dags/models that contain our transformations:

# All time gross sales model:
SELECT eventid, sum(pricepaid) total_price
FROM sales
GROUP BY eventid

# Percentile of all time gross sales
SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile 
FROM {{ ref('all_gross_sales_model') }}

# Events in 99.9 percentile
SELECT eventname, total_price 
FROM {{ ref('percentile_sales_model') }} q, event e
WHERE q.eventid = e.eventid
AND percentile = 1
ORDER BY total_price DESC

# Top 10 buyers by quantity
SELECT buyerid, sum(qtysold) total_quantity
FROM sales
GROUP BY buyerid
ORDER BY total_quantity DESC
LIMIT 10

# Extract user data from top buyers  
SELECT firstname, lastname, total_quantity 
FROM {{ ref('top_buyers_by_quantity_model') }} q, users
WHERE q.buyerid = userid
ORDER BY q.total_quantity DESC

Let’s trigger the redshift_transformations DAG from the Airflow user interface.

When the DAG run is complete, we can open the Amazon Redshift query editor and look for five materialized views created by dbt. You can also access logs for this specific run from both the Airflow UI and Amazon CloudWatch console under the ecs/dbt-cdk log group.

 

You can now change dbt models and upload code to your S3 bucket. If you trigger the redshift_transformations DAG again, it runs your latest code. Also, Airflow syncs the code every minute from an S3 bucket so any changes or new DAGs are available. Alternatively, you can create and store AWS access keys for this repository. In that case, push your code to a GitHub repository and your changes trigger an action that deploys code to an S3 bucket.

Update your infrastructure as a data engineer

You can make changes to the infrastructure in the dataops_infra folder. To apply those changes, you can use the AWS CDK CLI and deploy only specific services. For example, to deploy changes to Airflow services, run the following code:

bash $ > cdk deploy airflow

Alternatively, you can use bootstrap and deploy rules in the Makefile like you did previously.

If you want to have an automated CI/CD pipeline to deploy changes to the infrastructure, you can use the GitHub actions we created. These actions use the AWS built GitHub action for CodeBuild. To make this work, you need to create a new CodeBuild project and edit actions in the dataops_infra folder. After you create the project, edit actions in dataops_infra/.github/workflows to replace <AWS_ACCOUNT_ID>, <AWS_REGION>, and <YOUR_CODEBUILD_PROJECT_NAME> with your values.

Now you can push changes to your GitHub repository, which triggers CodeBuild to deploy the infrastructure.

Cleanup

When you’re finished exploring this solution, you might want to clean up your account to avoid unnecessary costs. To delete all resources that you created in this post, run the following command:

bash $ > make destroy

Conclusion

Automating processes such as testing, deployment, scheduling, and orchestration in data teams has a significant impact. It enables faster delivery of data insights with uncompromised data quality and effective cross-team collaboration.

In this post, we showed you how to build a reliable, scalable, and highly available data platform. You also built a CI/CD pipeline with AWS services like CodeBuild, Amazon ECR, and Amazon S3. You experimented with Fargate, Secrets Manager, and Amazon Redshift. We talked about the DataOps culture and demonstrated an end-to-end solution with Apache Airflow and dbt, which you can reuse for your own implementation.

You can find the source code from this post on GitHub. You can deep dive into the implementation and use it to build your own solution.


About the Authors

Dzenan Softic is a Solutions Architect at AWS. He works with startups in incubators and accelerators to help them define and implement their ideas. Previously, his main focus was in data engineering and infrastructure. He was also a startup founder and machine learning researcher.

 

 

Riccardo Bassetto is a Solutions Architect at AWS, helping startup customers implement their ideas in the cloud.

 

 

 

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.

 

 

As a Solutions Architect for Amazon Web Services, Tim Berger supports startups across Germany, Austria, and Switzerland in building scalable and cost-effective applications on the AWS Cloud.