AWS Database Blog
Load CDC data from relational databases to Amazon Kinesis using Database Migration Service
UPDATE, March 2019: Since this blog post was published, DMS has launched a new capability to set Kinesis as a target. For the most updated guidance, please visit Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams.
Many large enterprises are moving their data processing from batch to real-time in order to get more timely insights. The challenge of doing so is that a real-time data processing architecture must be able to keep up with the incoming data stream. This requires strong fault tolerance and elasticity.
In this blog post, we discuss real-time data processing architecture and how to capture changes in Amazon RDS for SQL Server database and send them to Amazon Kinesis Data Streams. We show how to use Amazon S3 and AWS Lambda for the initial load of data and the AWS Database Migration Service for ongoing replication.
AWS offers many services for successful data processing, such as Amazon Kinesis Data Streams, AWS Database Migration Services (DMS), AWS Lambda, and more. DMS helps you migrate existing data and also replicate ongoing change data from a source system to a target system.
Prerequisites and assumptions
- To use this solution yourself, you need an AWS account that provides access to AWS services.
- The services need to be created in us-east-1 Region and are configured in the same VPC in us-east-1 Region to simplify networking considerations.
- The compiled Java code for the Lambda functions is included in the ZIP files located in this Amazon Repository. Create a S3 bucket and upload the ZIP files. You need to provide the name of this S3 bucket while creating the CloudFormation stack.
- The default master user name for the RDS database is AWSUser. You select the master password while creating the CloudFormation stack.
When you use AWS CloudFormation, you manage related resources as a single unit called a stack. You create, update, and delete a collection of resources by creating, updating, and deleting stacks. All the resources in a stack are defined by the stack’s AWS CloudFormation template.
Steps to implement the solution
The code and the CloudFormation template to implement this solution can be found in this Amazon GitHub Repository.
We will provision the following:
- An Amazon RDS for SQL Server database from a publicly available snapshot.
- A S3 Bucket for DMS that has permissions to invoke the Lambda function.
- An IAM Role for DMS.
- AWS Database Migration Service (AWS DMS).
- An Amazon Kinesis Data Stream.
- An IAM Role for the Lambda functions.
- Lambda function to load data from source S3 bucket to target Kinesis Data Stream.
- Lambda function to read data from the Kinesis Data Stream.
The architecture diagram below shows the flow of data from RDS for SQL Server to Kinesis Data Stream using the Database Migration Service.
Step 1: Create CloudFormation Stack
Download the CloudFormation Script from this Amazon Repository.
Go to the CloudFormation page on the AWS Console and click Create Stack as shown below.
Upload the CDCCloudFormation.yml file as shown below.
On the next screen provide:
- Stack name
- CodeS3Bucket name (this is bucket for Lambda code ZIP files)
- Master user password for the database
- TargetS3BucketName (this is target bucket for DMS). Choose different bucket names for each S3 bucket. They are globally unique.
Click Next as shown below.
Click Next until you are on the Review Details page. Acknowledge the creation of IAM resources and click Create as shown below.
After the CloudFormation stack is complete, verify that the resources have been created. If the CloudFormation stack does not complete, check the events for the CloudFormation stack on the AWS Management Console.
Step 2: Verify Amazon RDS for SQL Server database from a publicly available snapshot
A SQL Server Database on RDS is created after the CloudFormation script is complete. You can check the AWS Management Console, as shown below, to check that the RDS instance was created.
You must also enable change data capture (CDC) for SQL Server. For more information about this, see Using Change Data Capture in the Amazon RDS User Guide. First enable CDC for the Database by executing
exec msdb.dbo.rds_cdc_enable_db 'CustomerDB'. Then enable CDC for the table by executing the stored procedure below.
If you are not able to access the SQL RDS instance via SQL Server Management Studio, open the port that your RDS instance is listening on within your security group. If you are using the CloudFormation template provided with this blog, the port number used is 1433.
Step 3: Verify the S3 Bucket for DMS
A target S3 bucket for Database Migration Service is created for the CDC data CSV files. You can check the AWS Management Console, as shown below, to check that the bucket was created.
Step 4: Setup DMS Database Migration Service
AWS Database Migration Service tasks require 3 entities:
Your source is the database you want to move data from and the target is the database you’re moving data to. In our case, the source database is the SQL Server on RDS and the target database is the S3 bucket. The replication instance processes the migration tasks and requires access to the source and target endpoints inside your VPC. A replication instance initiates the connection between the source and target databases, transfers the data, and caches any changes that occur on the source database during the initial data load.
After the CloudFormation script has completed successfully, check that the replication instance, task, and the endpoints have been created. Verify and start the DMS task as shown below.
Step 5: Verify Lambda Functions
After the CloudFormation script is complete, check the console to verify creation of two Lambda functions. The first function reads the CSV file on S3, which is generated by the Database Migration Service. The second function consumes data from the Kinesis Data Stream.
On the console, see two Lambda functions shown below.
The compiled Java code for the Lambda functions is included in the ZIP files located in this Amazon GitHub Repository. In order to use the code, create an S3 bucket and provide its name when you create the CloudFormation stack. After this, upload the ZIP files to your S3 bucket.
When Database Migration Service loads CSV data to S3, the Kinesis Data Stream is populated. The consumekinesis stream Lambda function polls the stream for the new data. You can see the data consumed by the consumekinesis Lambda function in CloudWatch under the Log Group for consumekinesis as shown below.
The Java code for the first Lambda function gets executed when Lambda is triggered by S3 after an Object is created on S3. We then use the S3 event to retrieve the Object metadata and content. We process the content in the CSV file and it to Kinesis Data Stream. The Java code for the second Lambda function gets executed when Lambda finds new records published to the Kinesis Data Stream. In this case the Lambda function polls the Kinesis Data Stream for new records published to the stream.
Feel free to modify the code for either Lambda functions to suit your use case.
In this blog post, we showed end-to-end solution that loads CDC data from a relational database to Amazon Kinesis using Database Migration Service. The solution can be used to analyze real-time data changes. For example, you might want to monitor transactions for fraud or to satisfy regulatory laws relating to anti-money laundering.
If you have comments about this blog post, submit them in the Comments section below. If you have questions about implementing the solution in this post, you can also leave that in the Comments section below too.
About the Author
Zafar Kapadia is a cloud application architect at Amazon Web Services. He works on Application Development and Optimization projects. He is also an avid cricketer and plays in various local leagues.
Udayasimha Theepireddy is a database cloud architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance about database migrations and big data projects.