How Etleap uses AWS DMS to make data analytics pipelines easy and robust
This blog post was co-written with Archie Menzies from Etleap.
You can use AWS Database Migration Service (AWS DMS) as a robust and configurable solution for change data capture (CDC) from all major databases into AWS. In this post, we discuss how Etleap integrates AWS DMS into its AWS-native ETL/ELT solution, and the usability features that Etleap offers on top of AWS DMS such as proactive source validation, destination schema management, and error handling and notifications.
Etleap is an AWS Advanced Technology Partner with the AWS Data & Analytics Competency and Amazon Redshift Service Ready designation. In this post, we show how to configure data pipelines in Etleap, using AWS DMS to capture database changes from source databases, and easily monitor and maintain these pipelines with Etleap.
AWS DMS is a cloud service that simplifies the migration of relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups.
With AWS DMS, you can perform one-time migrations, and you can replicate ongoing changes to keep sources and targets in sync.
The following diagram illustrates the AWS DMS replication process.
Etleap is a fully-managed extract, transform, and load (ETL) solution built on AWS that doesn’t require extensive engineering work to set up, maintain, and scale. It reduces ETL setup time from months to days, automates most maintenance work, and provides transparency and full control over data pipelines.
Etleap runs either as a hosted software as a service (SaaS) solution or inside your virtual private cloud (VPC).
Etleap provides both a no-code UI and an API solution for ingesting data from any data source to targets such as Amazon Redshift, Snowflake, and Amazon Simple Storage Service (Amazon S3) with AWS Glue. Etleap supports ingesting from all major database sources via SQL queries, and it uses AWS DMS to capture changed data when access to database transaction logs is available.
AWS DMS is a great solution for capturing data from databases and loading it to analytics destinations, and it even provides a set of transformation rules to manipulate the data. However, this transformation rule set requires a certain level of engineering experience to use. Etleap builds on this by offering a comprehensive set of transformations that can be used without an engineering background.
Etleap integrates with AWS DMS to capture newly inserted, updated, and deleted rows from database source tables as well as changes to the tables’ schemas. Etleap’s wrangler then allows you to specify powerful and expressive transformations without needing to write a single line of code. These transformations are then applied before loading into the data warehouse or data lake. All data captured by AWS DMS is stored in Amazon S3 for the duration of the pipeline’s lifetime so the changes can be replayed and transformed in other ways without needing to fetch the data again from the source.
The following diagram shows the architecture of an Etleap AWS DMS-enabled pipeline. The arrows show the flow of data from source to destination.
Setting up CDC via AWS DMS on your own can be complicated, and requires deep familiarity with both AWS DMS and the source databases. Etleap simplifies this through its connection and pipeline creation wizards, and reduces the effort of monitoring and managing ongoing CDC pipelines.
Create a CDC pipeline in Etleap with no AWS DMS expertise
Let’s start by looking at the simple three-step process to create a pipeline that uses AWS DMS for CDC. In this example, the database source is PostgreSQL and the destination is an Amazon Redshift data warehouse.
The first step is to configure the PostgreSQL source connection in Etleap, ensuring that CDC is enabled and that all prerequisites are met. Etleap automatically validates all requirements before creating the connection, ensuring that the source is configured correctly for AWS DMS.
The following screenshots show the configuration for a PostgreSQL connection in Etleap (left), and the expanded CDC group with the CDC prerequisite checklist (right).
Next, you configure the Amazon Redshift destination and open the pipeline creation wizard, as shown in the following screenshot. You can select the tables to replicate and specify which pre-loaded transformations should be applied to the extracted data, if any.
The final step confirms the pipeline creation, as shown in the following screenshot. Upon completing the wizard, a pipeline is set up for each selected source table.
Etleap automatically handles the creation and configuration of the endpoints and tasks in AWS DMS, which capture changes to these tables from the PostgreSQL Write-Ahead Log (WAL).
An initial snapshot of each table is taken using SQL queries. Etleap then seamlessly begins ingesting the changes captured by AWS DMS. All data, both from the initial snapshot and AWS DMS, is transformed according to the pipeline’s transformation script, which is configurable in Etleap’s wrangler.
From this point, Etleap will manage the pipeline, detect any changes to the data or the table schema, and notify you of any errors or required actions. Etleap will monitor for data correctness and latency to ensure both data integrity and recency meets the requirements you set.
Intuitive AWS DMS issue resolution with Etleap
In addition to being challenging to set up, CDC pipelines can also consume countless hours of engineering time in maintenance and issue resolution. This is another key area where Etleap saves you time and minimizes your need to navigate the details of AWS DMS.
Errors in AWS DMS can occur for a variety of reasons. Two of the most common types of error causes are an incorrectly configured source, and a source entering a bad state (for example, if the log files are truncated before AWS DMS is able to read all of the changes).
Reduce errors by ensuring proper CDC setup
Etleap’s pipeline setup steps ensure that users don’t experience the first type of errors. Etleap runs through a series of source database validation tests and provides an alert for any issues. This additional check is added when creating a source database connection in Etleap that is enabled for CDC. This helps you catch common problems early so you can confidently create CDC-ready pipelines in minutes.
Reduce errors by detecting AWS DMS failures during ongoing replication
After your pipelines have been set up, Etleap constantly monitors the state of the replication in AWS DMS. Any errors during CDC generate alerts to users. You can see errors quickly while disruption to analytics is still minimal. This allows you to resolve these errors with minimal concern for the boilerplate involved in recovering from a failed replication state.
Catch errors quickly
A key challenge to minimizing the disruption of pipeline errors is fast detection. When a task fails in AWS DMS, it enters a failed state. At this point, data stops flowing and any downstream processes that rely on that data quickly become out of date. The problem can quickly snowball the longer the error goes undetected or ignored.
Etleap natively monitors and sends alerts about any pipeline issues. It monitors for specific AWS DMS failures for relevant pipelines and data latency for all pipelines.
The state of the operational PostgreSQL AWS DMS task appears on the user’s connection page, as shown in the following screenshot.
When an issue arises, you receive an email notification about the nature of the problem and specific steps to resolve the issue.
Guided issue resolution
After you’re aware of an issue, the next key challenge is minimizing the effort to respond to and resolve the issue. Errors that occur during database replication are usually domain-specific and surfaced directly from the source database. This requires a detailed understanding of the replication system and the specific issue before you can determine how to respond.
Etleap is able to parse and interpret errors that occur in AWS DMS. It then presents this information as specific instructions to return the CDC process to an operational state. This gives a less technical user the power to monitor and manage CDC replication directly without needing to involve an engineering team. If it’s an unknown error, Etleap’s team of support agents are alerted and engaged to help resolve the issue. After the issue has been resolved, Etleap restores the data integrity by automatically re-syncing the source to remove any gaps in the data.
End-to-end example: Error detection, notification, and resolution
The following is a specific example of this straightforward troubleshooting workflow. The user has already established a PostgreSQL pipeline in Etleap, which is actively replicating changes to Amazon Redshift.
An issue arises when AWS DMS can no longer read changes because the replication slot has been dropped on the source. The AWS DMS task fails and the flow of data stops. When this occurs, Etleap registers that the AWS DMS migration task is in a failed state. It parses the task log and determines the reason for the failure as a missing replication slot.
Etleap triggers an email notification alerting you that there is an issue with the source connection. An Amazon Simple Notification Service (Amazon SNS) notification is sent as well if you have configured it. The notification explains the issue and provides instructions to restart the AWS DMS task to automatically recreate the replication slot.
You can find information about the issue on Etleap’s CDC page, as shown in the following screenshot. This includes how long the replication has been halted, the prescribed steps to resolve the problem, and controls for restarting the AWS DMS migration task.
After the issue on the source has been addressed, you can choose Restart, and the AWS DMS migration task is restarted with a new replication slot. Etleap will then continue to monitor the restarting of the task.
After the task is confirmed to be in a good state, you receive another notification. Etleap checks that there are no other issues with the migration task before automatically refreshing the pipelines that ingest from this source connection. This refresh ensures that there are no gaps in Amazon Redshift caused by expired WAL logs and that data integrity is maintained.
AWS DMS is a powerful tool for both one-time database migrations as well as for replicating ongoing changes to keep sources and targets in sync. In this post, we showed how AWS DMS powers the CDC feature of Etleap’s AWS-native ETL/ELT solution. Etleap can bring the power of AWS DMS to a wider, less technical audience through straightforward setup validation, schema management, and error handling and notifications.
About the authors
Archie Menzies is an engineer at Etleap with 3 years of experience in developing robust and performant ETL software. He holds an MEng in Computer Science from the University of Southampton and a Big Data Specialty certification from AWS.
Mahesh Kansara is a database engineering manager at Amazon Web Services. He closely works with development and engineering teams to improve the migration and replication service. He also works with our customers to provide guidance and technical assistance on various database and analytical projects, helping them improving the value of their solutions when using AWS.
Sathisan Vannadil is a Senior Partner Solutions Architect at Amazon Web Services (AWS). His primary focus is on helping independent software vendor (ISV) partners design and build solutions at scale on AWS. Prior to AWS, Sathisan held several technical positions and has over 20 years of experience in the field of data and analytics.