AWS Database Blog
Amazon Aurora PostgreSQL zero-ETL integration with Amazon Redshift is generally available
Data-driven decision making is critical for modern businesses, driving customer experiences and fueling generative AI, innovation, and growth. According to Forrester, advanced insights-driven organizations are 8.5 times more likely to report at least 20% revenue growth than companies in the early stages of putting data at the center of business decisions. However, this growth hinges on simplifying data management at scale, including data preparation, data movement, and data security.
At AWS, we have heard from customers that this data wrangling is challenging. Therefore, we’re investing in a zero-ETL future, enabling builders to focus on creating value from data rather than preparing it for analysis. This zero-ETL vision involves integrating data services to eliminate the need for you to build complex extract, transform, and load (ETL) pipelines, while providing fast, scalable data transformation capabilities through AWS Glue for tasks like cleansing, deduplication, and combining datasets. Additionally, AWS continues to expand support for diverse data sources, such as the general availability of zero-ETL integrations from Amazon Aurora MySQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for MySQL to Amazon Redshift (see the announcement posts for Aurora MySQL and Amazon RDS for MySQL, respectively). These improvements empower organizations to seamlessly act on their data across departments, services, on-premises tools, and third-party applications.
In this post, we discuss the challenges with traditional data analytics mechanisms, our approach to solve them, and how you can use Amazon Aurora PostgreSQL-Compatible Edition zero-ETL integration with Amazon Redshift, which is generally available as of October 15th, 2024.
Challenges with analytical workloads using PostgreSQL
Setting up and managing data pipelines for analytical workloads using PostgreSQL logical replication can be a complex and challenging endeavor. This complexity is due to how PostgreSQL logical replication relies on the concept of replication slots, which are used to keep track of the changes that need to be replicated. One major challenge is the lack of native support in PostgreSQL logical replication for handling Data Definition Language (DDL) changes. Although DDL changes are fairly common in a majority of database workloads, they can disrupt the PostgreSQL replication process and potentially lead to data inconsistencies or replication failures.
When a DDL statement is run, such as creating or altering the table, the replication process might stop and require manual operator intervention. PostgreSQL’s logical replication also requires creation of a replication slot per logical database and an equivalent consumer. Creation and ongoing handling of hundreds or thousands of replication slots might not be possible to manage for some customers. Another challenge lies in handling schema changes and data type mismatches between the source and target systems. A transactional database can often undergo frequent schema changes to accommodate evolving business requirements, and analytical systems and data lakes may have entirely different data models and data type representations. Making sure that these changes are propagated correctly and consistently across the ETL pipeline can be a complex and error-prone task, requiring careful monitoring and manual intervention, such as running frequent validation checks to verify that query results across source and destination systems yield the same results.
Depending on your workload, enabling logical replication on a PostgreSQL database can introduce a noticeable performance overhead. The PostgreSQL logical decoding process requires additional CPU and memory resources to parse and extract data from the write-ahead log (WAL). This overhead can be particularly significant in high-throughput environments with heavy write workloads, because the logical decoding process must keep up with the rate of changes being generated, which competes for the CPU resources. Additionally, batch data processing jobs can place significant load on the transactional database as large amounts of data are extracted during peak operational hours, potentially impacting the performance and responsiveness of the database for end-users or other dependent systems. This can lead to increased latency, resource contention, and potential scalability issues.
Although some of these challenges can be mitigated by scaling up the deployed infrastructure, it can increase costs. In some cases, database administrators must carefully tune the logical decoding settings, allocate dedicated resources for the logical decoding process, or explore alternative replication strategies, such as physical replication or third-party change data capture (CDC) solutions. However, these alternative approaches often introduce their own management complexities and trade-offs, further complicating the setup and management of data pipelines for analytical workloads.
Setting up and managing Extract, Transform, and Load (ETL) data pipelines from a PostgreSQL transactional database to an analytical database or data lake can be a complex and an intensive endeavor, both from a human resourcing as well as infrastructure cost perspective. One of the primary challenges arises from the need to maintain data consistency and integrity throughout the ETL process, while minimizing the impact on the transactional database’s performance. Traditionally, ETL pipelines involve extracting data from the source database, applying transformations to conform to the target system’s schema and requirements, and then loading the transformed data into the analytical system or data lake. This process often relies on batch operations, which can introduce latency and potentially result in stale data in the target system.
Maintaining data quality and consistency during the ETL process at scale can be a significant challenge. Transactional databases may contain data access limitations, complex data relationships, constraints, and business rules that need to be preserved and enforced in the target system. Developing and maintaining ETL processes that accurately capture and translate these relationships can be time-consuming and prone to errors, especially in large and complex data models. Traditional batch ETL processes might struggle to keep up with the increasing demand for near real-time data integration, requiring significant infrastructure investments and complex load balancing and parallelization strategies.
Additionally, managing and monitoring ETL pipelines can be challenging, especially in environments with multiple data sources, complex transformations, and stringent Service Level Agreements (SLAs). Providing end-to-end visibility, troubleshooting issues, and maintaining audit trails can be resource-intensive and may require specialized tools and expertise.
Aurora PostgreSQL zero-ETL integration with Amazon Redshift
Aurora PostgreSQL zero-ETL integration with Amazon Redshift enables you to run near real-time analytics and machine learning (ML) on petabytes of transactional data in Amazon Redshift by offering a no-code solution for making transactional data from Aurora available in Amazon Redshift within seconds of being written.
With zero-ETL integrations, you can choose the Aurora databases containing the data you want to analyze with Amazon Redshift, and zero-ETL replicates the schema, historical data, and any data or schema changes into Amazon Redshift. You can use Aurora for transaction processing and Amazon Redshift for analytics because there are no shared compute resources, yielding a performant and operationally stable solution. You can replicate data from multiple Aurora database clusters into the same Redshift instance to derive holistic insights across several applications or across different workloads of the same application, while also consolidating your core analytics assets, gaining significant cost savings and operational efficiencies.
With zero-ETL, you can take advantage of the rich analytical capabilities of Amazon Redshift, such as high-performance SQL, built-in ML and Spark integrations, workload optimization autonomics, concurrency scaling, materialized views, data sharing, data masking, and direct access to multiple data stores and data lakes. This allows transaction processing on data in Aurora while simultaneously using Amazon Redshift for analytics workloads such as reporting and dashboards.
We launched the preview of Aurora PostgreSQL zero-ETL integration with Amazon Redshift at AWS re:Invent 2023 and are announcing the general availability.
The figure below demonstrates the functioning of Aurora PostgreSQL zero-ETL integration with Amazon Redshift. The zero-ETL integration automatically seeds the data from an Aurora PostgreSQL source into Amazon Redshift, sets up an ongoing CDC streaming, and also reseeds the data, as needed. Additionally, zero-ETL integration also provides ongoing operational capabilities though automated monitoring and recovery mechanisms.
General availability experience
Aurora PostgreSQL zero-ETL integration with Amazon Redshift delivers exceptional data replication capabilities and performance by taking advantage of the investments in Aurora to enhance PostgreSQL’s logical replication capabilities. We have mitigated traditional PostgreSQL logical replication challenges by separating storage of the transaction log from storage of the logical replication records (WALs). We have built a specialized storage layer optimized for storing and decoding WALs. This storage layer has added logic that makes it possible for the database engine to push down the filtering, sorting, and ordering of WALs to the storage layer. This allows us to increase parallelization, reduces locking, and shorten the commit times in the database engine, while still achieving ordered writes. These architectural improvements help us better scale zero-ETL’s performance vertically with your workload than the existing logical replication solution, as highlighted in our performance benchmarks later in this post.
The general availability of Aurora PostgreSQL zero-ETL integration with Amazon Redshift introduces several exciting features designed to enhance data replication, improve performance, and provide greater flexibility and control over the integration process. One of the most notable innovations is support for expanded DDL events, which allows handling of create, alter, drop, and rename of databases, schemas, and tables, including advanced relationships like cascade operations. This makes sure schema changes made to the source Aurora PostgreSQL database are propagated to the target Redshift warehouse, minimizing the need for manual intervention or reseeding and maintaining data consistency. Our DDL support is designed to be highly efficient for the amount of data being transmitted across systems.
We also support replication slots using the aforementioned optimized WAL storage and can now create and manage replication slots on the Aurora PostgreSQL compute nodes if zero-ETL is enabled. Additionally, zero-ETL integrations can now be enabled or disabled on existing Aurora PostgreSQL clusters on database engine version 16.4 or higher. These features enable integration of existing PostgreSQL replication mechanisms, providing flexibility and compatibility for a diverse set of use cases.
The general availability release also expands data type support to include the majority of data types and data values supported by Aurora PostgreSQL-Compatible, including larger data values, such as those stored as off-row data in Aurora PostgreSQL-Compatible using The Oversized-Attribute Storage Technique (TOAST). You can now replicate data from multiple logical databases within a single Aurora PostgreSQL cluster using a single zero-ETL integration. This minimizes the management overhead of creating and managing individual replication slots for each logical PostgreSQL database while providing a simple mechanism to consolidate data from multiple logical database sources within a cluster.
We are also simplifying the customer experience with data security, costs, and maintenance. During preview, you could specify only one logical database within the zero-ETL integration, and all the schemas and tables within the selected database of the source Aurora PostgreSQL cluster would be replicated to Amazon Redshift. However, you might need additional control to selectively replicate data based on your specific requirements to avoid transmission of personally identifiable information (PII) information and reduce costs. You now have the flexibility to specify which Aurora PostgreSQL databases, schemas, or tables should be replicated to Amazon Redshift. To streamline deployment and management of zero-ETL integrations, you can also use AWS CloudFormation templates to use infrastructure as code (IaC) practices and facilitate automation. Zero-ETL integrations also support enhanced virtual private cloud (eVPC) for both Amazon Redshift provisioned and serverless, enabling enhanced security.
We have also removed the preview limitation to enable full subtransaction support. For additional details on PostgreSQL’s subtransaction processing, refer to Subtransactions. To improve the overall zero-ETL integration creation and ongoing management experience, we have further expanded our error handling and debugging features. These improvements aim to provide faster integration failure detection, graceful error handling, and comprehensive information logging, improving transparency and troubleshooting capabilities.
Performance
Based on our performance tests, Aurora PostgreSQL zero-ETL integration with Amazon Redshift processed over 1.4 million transactions per minute (an equivalent of 23.8 million insert, update, or delete row operations per minute) from two Aurora databases and made them available in Amazon Redshift in single-digit seconds. We benchmarked CDC throughput and end-to-end replication lag achieved with two Aurora PostgreSQL zero-ETL integrations to a single Redshift cluster running 12 ra3.4xlarge nodes. The tests were conducted with db.r6i.32xlarge Aurora instances using a HammerDB TPC-C workload and repeated with 128 and 256 virtual users. The following table shows the detailed performance results from our benchmark.
Virtual Users | Number of Integrations | Total Transactions Per Minute | Total Number of Operations | P50 End-to-End Replication Lag (in Seconds) |
128 | 2 | 1.27 Mil | 21.59 Mil | 8.23 |
256 | 2 | 1.65 Mil | 28.05 Mil | 9.76 |
How to get started
zero-ETL integrations are available for Aurora PostgreSQL-Compatible version 16.4 and higher. You can create and manage integrations using either the AWS Management Console, the AWS Command Line Interface (AWS CLI), or the Amazon RDS API in the US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Hong Kong, Mumbai, Singapore, Sydney, Tokyo) and Europe (Frankfurt, Ireland, Stockholm) AWS Regions. For step-by-step guidance on how to set up a zero-ETL integration from Aurora PostgreSQL-Compatible to Amazon Redshift, see Amazon Aurora PostgreSQL and Amazon DynamoDB zero-ETL integrations with Amazon Redshift now generally available.
Conclusion
In this post, we discussed the generally availability capabilities of Aurora PostgreSQL zero-ETL integration with Amazon Redshift, enabling you to run near real-time analytics and ML on your transactional data in Amazon Redshift with minimal effort. This integration uses the enhanced logical replication capabilities of Aurora to overcome traditional PostgreSQL replication challenges, delivering exceptional performance and support for expanded DDL events, larger data types, and nested transactions. With added features for data filtering, cost optimization, and improved manageability, you can seamlessly replicate data from multiple Aurora PostgreSQL databases to Amazon Redshift, consolidate your analytics workloads, and unlock valuable insights rapidly and cost-effectively. For more information, see Working with Aurora zero-ETL integrations with Amazon Redshift and Zero-ETL integrations.
About the author
Abhinav Dhandh is a Product Management Lead on the Amazon Aurora team. He leads the long-term product vision and delivery of critical capabilities, such as zero-ETL, change data capture streaming, and logical replication for Aurora MySQL and Aurora PostgreSQL databases within AWS. He also leads the Aurora MySQL product management with a focus on horizontal scalability, versioning strategy, and customer experience with multiple launches around migrations, upgrades, version currency, availability, and performance. He brings expertise of launching and managing products in diversified verticals, such as cloud computing, advertising, digital media, security, identity protection, and digital learning.