AWS Database Blog

Simplify data integration using zero-ETL from self-managed databases to Amazon Redshift

In this post, we demonstrate how to set up a zero-ETL integration between self-managed databases such as MySQL, PostgreSQL, SQL Server, and Oracle to Amazon Redshift. The transactional data from the source gets replicated in near real time on the destination, which processes analytical queries.

Organizations need real-time access to their data across multiple sources to make informed business decisions. Although extract, transform, and load (ETL) processes have traditionally helped integrate data, they can be complex and time-consuming to maintain. ETL is the process of combining, cleaning, and normalizing data from different sources to prepare it for analytics, AI, and machine learning (ML) workloads. Moving to zero-ETL architecture gives your teams more responsive analytics capabilities while eliminating complex data pipeline maintenance, so insights are available when decisions need to be made.

A self-managed database refers to a database environment where you have complete control and responsibility over the database instance running on your own infrastructure, whether that’s an Amazon Elastic Compute Cloud (Amazon EC2) instance, an on-premises virtual machine (VM), or any other cloud provider’s virtual machine.

Solution overview

To create a zero-ETL integration, you specify a self-managed database as the source and an Amazon Redshift data warehouse as the target. The integration replicates data from the source database into the target data warehouse. The following diagram illustrates this architecture.

Prerequisites

Before you begin, ensure you have the following prerequisites in place:

  1. On-premises database server or an EC2 instance with database configured with:
    1. Network connectivity to Amazon Web Services (AWS) using AWS Direct Connect, AWS Site-to-Site VPN, or Amazon Virtual Private Cloud (Amazon VPC).
    2. VPC security groups allowing database port connection to zero-ETL integration.
    3. Appropriate subnets and network configuration for zero-ETL replication traffic.
  2. Database configured for on-going replication. Refer to AWS Database Migration Service (AWS DMS) documentation for more information: Oracle, Microsoft SQL Server, MySQL, or PostgreSQL.
  3. Create a new Redshift cluster or modify an existing cluster to use a custom parameter group with the enable_case_sensitive_identifier parameter set to on, which must be enabled to create zero-ETL integrations in the data warehouse.
  4. Make sure your AWS user or role has the necessary AWS Identity and Access Management (IAM) permissions. Complete the following steps to configure the permissions:
    1. On the IAM console, choose Policies in the navigation pane.
    2. Choose Create policy.
    3. Create a new policy called zeroetl-selfmanaged-integrations using the following JSON:
      {
         "Version":"2012-10-17",
         "Statement":[
            {
               "Sid":"ZetlGlueIntegrationAccess",
               "Effect":"Allow",
               "Action":[
                  "glue:CreateIntegration",
                  "glue:ModifyIntegration",
                  "glue:DeleteIntegration",
                  "glue:DescribeIntegrations",
                  "glue:DescribeInboundIntegrations"
               ],
               "Resource":"*"
            },
            {
               "Sid":"ZetlRedshiftFullAccess",
               "Effect":"Allow",
               "Action":[
                  "redshift:GetResourcePolicy",
                  "redshift:PutResourcePolicy",
                  "redshift:DescribeClusterParameters",
                  "redshift:CreateInboundIntegration",
                  "redshift:DescribeClusters"
               ],
               "Resource":"*"
            },
            {
               "Sid":"ListNetworkSettings",
               "Effect":"Allow",
               "Action":[
                  "ec2:DescribeVpcs",
                  "ec2:DescribeSubnets",
                  "ec2:DescribeSecurityGroups"
               ],
               "Resource":"*"
            },
            {
               "Sid":"ZetlRedshiftDataAPI",
               "Effect":"Allow",
               "Action":[
                  "redshift-data:ExecuteStatement",
                  "redshift-data:CancelStatement",
                  "redshift-data:ListStatements",
                  "redshift-data:GetStatementResult",
                  "redshift-data:DescribeStatement",
                  "redshift-data:ListDatabases",
                  "redshift-data:ListSchemas",
                  "redshift-data:ListTables",
                  "redshift-data:DescribeTable"
               ],
               "Resource":"*"
            },
            {
               "Sid":"DMSIntegrationAccess",
               "Effect":"Allow",
               "Action":[
                  "dms:CreateOutboundIntegration",
                  "dms:ModifyOutboundIntegration",
                  "dms:CreateEndpoint",
                  "dms:DescribeEndpoints",
                  "dms:ModifyEndpoint",
                  "dms:DeleteEndpoint",
                  "dms:TestConnection"
               ],
               "Resource":"*"
            },
            {
               "Sid":"ZetlKMSAccess",
               "Effect":"Allow",
               "Action":[
                  "kms:CreateKey",
                  "kms:DescribeKey",
                  "kms:Encrypt",
                  "kms:Decrypt",
                  "kms:GenerateDataKey",
                  "kms:ListKeys",
                  "kms:CreateAlias",
                  "kms:ListAliases",
                  "kms:CreateGrant"
               ],
               "Resource":"*"
            },
            {
               "Sid":"ZetlSecretsManagerAccess",
               "Effect":"Allow",
               "Action":[
                  "secretsmanager:GetSecretValue",
                  "secretsmanager:PutSecretValue",
                  "secretsmanager:CreateSecret",
                  "secretsmanager:UpdateSecret",
                  "secretsmanager:DeleteSecret",
                  "secretsmanager:DescribeSecret",
                  "secretsmanager:ListSecrets",
                  "secretsmanager:GetResourcePolicy",
                  "secretsmanager:PutResourcePolicy",
                  "secretsmanager:ValidateResourcePolicy"
               ],
               "Resource":"*"
            }
         ]
      }
  5. Attach the policy you created to your IAM user or role permissions.

Solution walkthrough

To create a DMS endpoint, follow these steps:

  1. Open the AWS DMS console.
  2. In the navigation pane, under Migrate or replicate, choose Endpoints.
  3. Choose Create endpoint, select Source endpoint.
  4. Select the Source engine, such as Oracle, MySQL, PostgreSQL and Microsoft SQL Server.
  5. Choose access endpoint type and provide information. Select Choose AWS Secrets Manager to provide a secret ID and IAM role.
  6. Choose Create endpoint.
  7. Upon creation of source DMS endpoint, copy DMS endpoint ARN.

To configure the Redshift cluster destination, follow these steps:

  1. Navigate to the Redshift cluster and choose the Resource policy tab.
  2. Choose Edit Authorized principals.
  3. Enter either the ARN of the AWS user or role or the AWS account ID (IAM principals) that are allowed to create integrations in this namespace.
  4. Choose Save changes.
  5. Choose Edit Authorized integration sources.
  6. Enter the ARN of the DMS source endpoint that’s the data source for the zero-ETL integration.
  7. Choose Save changes.

To create a zero-ETL integration, follow these steps:

  1. Open AWS Glue console.
  2. In the navigation pane, under Data Integration and ETL, select Zero-ETL integrations.
  3. Choose Create zero-ETL integration.
  4. Select the source type, as shown in the following screenshot. It can be any of the self-managed databases such as Microsoft SQL Server, MySQL, Oracle, or PostgreSQL. Choose Next.
  5. On this page, configure the source and target.
    1. To complete source details, Browse DMS endpoints and select the DMS endpoint created in step 1.
    2. Under Data filtering options, in the Choose filter type dropdown menu, select either Include or Exclude, as shown in the following screenshot. Expressions must use Maxwell filter syntax.
    3. In Network settings, provide details such as Virtual private cloud (VPC), Subnets, and VPC security groups to make a successful connection to the source and run zero-ETL integrations, as shown in the following screenshot.
    4. To complete target details, under AWS account select the option appropriate for where the Amazon Redshift data warehouse is hosted, either Use the current account or Specify a different account, as shown in the following screenshot.
  6. For Security and data encryption, you can select either AWS managed KMS key or a custom AWS KMS key managed by AWS Key Management Service (AWS KMS).
  7. In the Integration details section, for Name, enter zeroetl-selfmanaged-redshift and choose Next.
  8. Review the details and choose Create and launch integration.
  9. The newly created zero-ETL integration will show as Active in about a minute, as shown in the following screenshot.

To create a database from the integration in AWS Glue, follow the instructions in the next section.To create your database, complete the following steps:

  1. Open AWS Glue console.
  2. In the navigation pane, under Data integration and ETL, choose zero-ETL integrations.
  3. Choose Create database from integration
  4. Enter Destination database name that will be created in Amazon Redshift.
  5. Enter Source database name from which data will be replicated, as shown in the following screenshot.

To validate the zero-ETL integration, perform the following steps:

  1. Open Amazon Redshift.
  2. In the navigation pane, under Integrations, choose zero-ETL integrations.
  3. Navigate to the zeroetl-selfmanaged-redshift integration. Review the details of zero-ETL integration.
  4. Choose Connect to database and provide Database name and Database user, as shown in the following screenshot.
  5. To view the integration-related metrics published to Amazon CloudWatch, choose the Integration metrics, as shown in the following screenshot.
  6. Switch to Table statistics to view all tables that are currently active or having errors.
  7. Query the following system views and tables in Amazon Redshift to get information about the zero-ETL integrations with Amazon Redshift:
    1. SVV_INTEGRATION provides configuration details of zero-ETL integrations.
    2. SYS_INTEGRATION_ACTIVITY provides completed zero-ETL integrations.
    3. SVV_INTEGRATION_TABLE_MAPPING provides mapping metadata values from source to target.
    4. SVV_INTEGRATION_TABLE_STATE provides information about integration state.
    5. SYS_INTEGRATION_TABLE_ACTIVITY provides information about insert, delete, and update activity of integrations.
    6. SYS_INTEGRATION_TABLE_STATE_CHANGE provides information about table state change log for integrations.
  8. To query the preceding tables or query the data from the replicated database, choose Query data on the Zero-ETL integrations.

Clean up

When you delete zero-ETL integration, your transactional data isn’t deleted from self-managed database or Amazon Redshift, but source database doesn’t send any new data to Amazon Redshift.

To avoid incurring future charges, remove the resources used in this post from your AWS account by taking the following steps:

  1. Delete the zero-ETL integration:
    1. On the AWS Glue console, choose Zero-ETL integrations in the navigation panel.
    2. Select the zero-ETL integration that you want to delete and on the Actions menu, choose Delete.
    3. To confirm the deletion, enter confirm and choose Delete.
  2. Delete Redshift cluster:
    1. On the Amazon Redshift console, choose the Clusters in the navigation panel.
    2. Select the Redshift cluster that you want to delete and on the Actions menu, choose Delete.
    3. If a final snapshot is required, select Create final snapshot and provide Final snapshot identifier.
    4. To confirm the deletion, enter confirm and choose Delete.
  3. Delete source DMS endpoint:
    1. On the AWS DMS console, choose the Endpoints in the navigation panel.
    2. Select the DMS endpoint that you want to delete and on the Actions menu, choose Delete.
  4. Delete the AWS Secrets Manager secret

Conclusion

In this post, we showed how you can run zero-ETL integration from Oracle, PostgreSQL, MySQL, or SQL Server databases to Amazon Redshift. This minimizes the need to maintain complex data pipelines and enables near real-time analytics on transactional and operational data. With zero-ETL integrations, you can focus more on deriving value from your data and less on managing data movement.

As next steps, review the Amazon Redshift zero-ETL integration documentation to understand implementation details. Consider expanding your analytics capabilities by exploring Amazon Redshift ML for machine learning (ML) integration or implementing federated queries for cross-database analytics. For additional insights, learn how other customers are implementing zero-ETL with Amazon Aurora (MySQL-Compatible and PostgreSQL-Compatible) in the zero-ETL integration with Amazon Aurora blog post.


About the authors

Alex Anto

Alex Anto

Alex is as a Senior Data Migration Specialist Solutions Architect on the AWS DMS team. With deep expertise in AWS DMS, he supports migrations across all sources and target database combinations that the service enables. Beyond customer engagement, Alex contributes to improving AWS data migration products and has successfully guided hundreds of customers through their cloud transformation journeys.

Shashank Kalki

Shashank Kalki

Shashank is a Data Migration Specialist Solutions Architect on the AWS DMS team. He works on challenges related to data migration and works closely with customers to help them realize the true potential of AWS DMS.

Johnson Mushonga

Johnson Mushonga

Johnson is a Data Migration Specialist Solutions Architect on the AWS DMS team. Through data discovery sessions, architectural reviews, and deep dives into customer configurations, Johnson advises customers to successfully migrate their database workloads to the AWS Cloud.

Changsheng Su

Changsheng Su

Changsheng Su is a Software Engineer on the AWS DMS team, where he contributes to feature development for DMS streaming endpoints. He is also a main contributor for zero-ETL streaming engine and infrastructure.

Yupeng Su

Yupeng Su

Yupeng Su is a Senior Software Engineer at Amazon Web Services (AWS), working on AWS DMS. He specializes in building and operating scalable, secure, distributed data migration and ingestion platforms. His areas of interest include open source data warehouse and streaming frameworks, plus AI technologies. Outside of his work, he likes to read tech blogs, papers, and customer reviews of AWS services.