Managing the Evolution of an Amazon Redshift Data Warehouse Using a Declarative Deployment Pipeline
By Sean Beath, Manager, Strategy, AI & Transformation – Deloitte Australia
By Athar Shah, Director, Strategy, AI & Transformation – Deloitte Australia
By Greg Webber, APJC Solutions Architecture Manager (Deloitte) – AWS Australia
By Special thanks to Rick Fraser and Avnish Jain, Analytics Specialist SAs – AWS
Enterprise data warehouses are complex and can consist of tens of thousands of database objects (such as tables, views, or grants) that need to be carefully modified to reflect the changing needs of business, data analytics, and machine learning (ML) teams.
In this post, we will examine an approach to managing the evolution of enterprise-scale data warehouses based on the experience of Deloitte‘s Data and Artificial Intelligence (AI) global practice teams.
We’ll look at a declarative tool developed by Deloitte that can automatically generate data definition language (DDL) statements to align Amazon Redshift’s state to an approved baseline configuration. We’ll also discuss some methods of incorporating generation, approval, and deployment steps into CI/CD workflows and tooling.
Deloitte is an AWS Premier Consulting Partner and Managed Service Provider (MSP) with the Amazon Redshift delivery designation. Deloitte’s end-to-end capabilities and understanding of your business and industry help amplify the transformative value of cloud.
Changes to data warehouse environments supporting the full software development lifecycle (SDLC) can be made inconsistently and lead to environment drift.
If this happens, environments can become impossible to consistently reproduce and cause both DDL and ETL/ELT operations to fail along with subsequent analytics, business intelligence, and machine learning tasks.
Consider the following causes of failure or inconsistency in managing data warehouse object configuration:
- Out-of-sequence DDL statements: Instructing a database to modify a table can fail if the table isn’t in the expected state.
- Forgotten grants: All operations in a database require that the user executing them has the requisite permissions granted. Grant drift—including missing users, roles, or associated grants—can exist between environments and cause ETL operations to fail, including missing users, roles or associated grants.
- Missing views: Views are secondary objects that control the visibility of data in underlying database tables and failures will be experienced if a missing table or misspecified view is referenced in a SQL query, ETL/ELT process, or reporting solution.
- Audit fields: Ensures the existence and consistent definition of audit fields.
- Inconsistent constraints: Whilst Amazon Redshift doesn’t enforce constraints, they are used as planning hints which can affect statistical computations and, consequently, query performance.
- Suboptimal column positioning: Failing to group related columns, failing to push commonly used columns to the left, or neglecting to maintain audit columns to the right per convention can cause presentation issues for end users in business intelligence (BI) and other data exploration tools, leading to reduced usability or even errors.
When managed manually in an enterprise data warehouse, these issues translate to increased operational risk, protracted time to deployment and business value, and increased cost.
The following sections show how such issues can be mitigated without the need for explicit coding, and how they can be incorporated into existing deployment automation tooling.
Many organizations have adopted modern approaches to the delivery of IT projects. The agile methodology and DevOps practices encourage the breaking down of organizational silos to achieve greater collaboration, faster time to value, and the avoidance of issues throughout the SDLC by employing high levels of automation.
The learnings from these movements are permeating business project management, as well as areas of IT beyond traditional application delivery including artificial intelligence (AI), machine learning, and data.
DataOps is an adaptation of the approaches outlined above to apply to the building and ongoing change management of data lakes and lake houses, data staging, crawling, ETL/ELT and data pipelines, analytics, and business intelligence. To do this successfully, cross-team collaboration and automation are just as critical in the data domain as they are in the application domain.
The diagram in Figure 1 below depicts a high-level architecture and tools devised by Deloitte and applied at complex data engagements to address the challenges above. The two main tools we’ll focus on are the Comparator and Deployment tools.
The role of the Comparator is to identify any differences between a baseline Amazon Redshift DDL configuration specified in YAML and a target Amazon Redshift cluster. The Comparator generates any DDL statements that must be applied to address any drift, plus additional logging information summarizing these operations for the benefit of data engineers.
The Deployment tool is responsible for taking the output of the Comparator tool and applying the generated DDL statements to address any identified drift.
Figure 1 – Amazon Redshift Comparator and Deployment tools with associated pipeline stages.
In the following sections, we’ll expand upon the details of each stage in the data definition pipeline.
In the definition stage, a data engineer describes data warehouse objects in YAML files and checks these into a source code repository such as Git or AWS CodeCommit.
Initial commits will define new objects in their base form, whilst later commits will represent both new objects and changes to existing objects. Commits can trigger build and deploy actions, however this is not shown.
The definition stage also allows a data engineer to specify the order in which objects are to be deployed, or to rely on predefined ordering within the solution.
The diagram below shows part of a YAML baseline configuration file used by the Comparator tool. It defines a dimension table
dim_business_address in the
edz schema. The definition includes common information like sort and distribution keys, as well as column names and types.
The definition also specifies a view to be created in the
bdz schema applying any schema rules such as active record only filter or masking of personally identifiable information (PII). Common rules for a schema (not shown) can be specified, which will be transparently applied to ensure audit and other mandatory columns are consistently applied to all table definitions.
Figure 2 – YAML table definition.
The build stage is responsible for taking the YAML data definitions prepared in the definition phase and comparing these to the target Amazon Redshift environment. Where the target environment object differs from the YAML definition, this phase emits DDL statements in a SQL file that, when applied, would bring the environment into line with the YAML definition.
The build stage can also identify a lack of adherence to naming conventions and a range of other standards, such as the existence of audit fields and field ordering, amongst others. It can also take certain actions like performing backups when potentially destructive operations may be involved.
The final object emitted is a log summarizing details of the change list and deployment sequence information. Logs and automatically generated data definition SQL files are stored centrally in Amazon Simple Storage Service (Amazon S3) where they are available to subsequent phases.
Below, we see an example output log file generated from the table YAML configuration shown in Figure 2. It has defined a table in the
edz schema to be created, a
bdz view to be created, and due to a schema rule for
edz there will be a staging table created in the stg schema.
This highlights the power of automatic DDL generation, as a developer hasn’t needed to define the
bdz view or staging table. We also see there’s a collection of other tables that have not been changed, and therefore will have no action applied.
Figure 3 – Example Comparator tool log output.
The deploy stage applies the automatically generated DDL produced in the previous phase to the corresponding Amazon Redshift target cluster, after human review and approval if required.
Like previous stages, AWS CodePipeline can be used to orchestrate CI/CD stages and actions. However, organizations that have based their CI/CD automation around other tools such as Jenkins could base their orchestration around pre-existing tooling, if desired.
The tool supporting the deploy stage operates in a declarative manner. That is, you specify what you want your Amazon Redshift environment to look like, and the tool takes the necessary actions to effect the required changes.
This is in contrast to the imperative approach in which each change would need to be uniquely scripted (possibly per environment, to account for any differences) and carefully sequenced, requiring considerably more effort and possibly introducing errors along the way due to the manual nature of the effort.
In this post, we have seen how the evolution of objects in an Amazon Redshift data warehouse can be managed with best practices from DataOps using a declarative approach as supported by Deloitte’s Comparator and Deployment tools.
We have shown the high-level stages of deployment pipeline, from defining objects in YAML, automatically generating DDL against development, test, and production environments all the way through to review/approval and deployment into a target environment.
We have shown how the deployment pipeline architecture works well with AWS CodeCommit and AWS CodeDeploy services, though it could be made to work with other CI/CD orchestration tooling. The approach to data warehouse management automates complex, granular activities typically requiring many steps and dependencies.
In addition to automatically discovering what needs to be done and taking the necessary actions, we saw that other actions can be automated including enforcement of naming conventions and other standards, logging of actions, and performing backups and clean-up actions. This leads to less effort, fewer errors, and faster time to business value.
For more information, reach out to the Deloitte Data & AI team to see how they can bring this approach and tooling together with extensive experience in big data and analytics to bear on your analytics projects.
Deloitte – AWS Partner Spotlight
Deloitte is an AWS Premier Consulting Partner and MSP. Through a network of professionals, industry specialists, and an ecosystem of alliances, they assist clients in turning complex business issues into opportunities for growth, helping organizations transform in the digital era.
*Already worked with Deloitte? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.