AWS Storage Blog

Creating an ETL pipeline trigger for existing AWS DataSync tasks

Organizations look for ways to leverage the compute power of the cloud to analyze their data and produce reports to help drive business decisions. They want to load their data sets into extract-transform-load (ETL) pipelines for data processing. Once the data is processed, business decision makers at these organizations rely on accurate report generation to help drive key decisions.

If these organizations have low bandwidth connections and try to load large data sets from on-premises legacy systems to AWS, they might run into a couple of issues. First, their legacy systems and data sources may not natively write to cloud services such as Amazon Simple Storage Service (S3). These organizations can usually remedy this issue by synchronizing the data in the source system and Amazon S3 with the help of AWS DataSync, which allows you to schedule tasks at intervals and create bridge between on-premises legacy systems and AWS storage services. This allows organizations to leverage the cloud’s scale for data processing even when files are generated elsewhere. The second issue organizations run into when trying to extract value from the data moved to the cloud is starting the pipeline. When transferring large volumes of data over a slow connection, files are sometimes still being written to source systems when AWS DataSync begins committing them to Amazon S3. This leads to the data pipeline being launched on incomplete files by an Amazon S3 Event Notification. Processing incomplete or partially transferred data can lead to an inaccurate output or a complete pipeline failure. Implementing idempotency in the data pipeline would likely solve this, but that can be time consuming and would require several code changes.

In this post, we explain how to deploy an event-driven solution that avoids these issues without changing your pipeline logic or data source. The solution monitors DataSync activity to kick off your existing data pipeline only when DataSync validation is complete for all files. It includes instructions on how to configure an existing DataSync task and deploy an AWS CloudFormation template that creates AWS Lambda, Amazon DynamoDB, and Amazon SQS resources. With this solution, you can ensure that incomplete data does not negatively effect your downstream processes or the accuracy or timeliness of your pipeline reports.

Solution overview

This solution is intended for customers who use DataSync to transfer files from an on-premises storage system to Amazon S3 for processing and assumes you have an existing DataSync task with an S3 bucket as its destination.

Diagram showing on-premesis to AWS using DataSync

After configuring your DataSync task, you will deploy the CloudFormation template, which builds the solution mainly composed of two Lambda functions, one DynamoDB table, one SQS queue, and an AWS CloudWatch Logs subscription filter. All these components are used together and triggered by enabling additional logging in your DataSync task to assert if file validation has completed before initiating processing. When DataSync detects that a file has changed mid-transfer, it logs a warning within CloudWatch Logs but still commits the file to S3. The subscription filter only triggers the Lambda function when DataSync successfully verifies files. This Lambda function then gathers the required file metadata from the log event and saves it in an Amazon DynamoDB table. Once the DataSync task is complete, a second Lambda function is triggered that takes that metadata from DynamoDB and passes it to an SQS queue.

DataSync Pipeline Trigger Solution Architecture

After deploying the solution, you will be able to verify it is working by placing files in your source location, running the DataSync task, and confirming the SQS queue contains appropriate file information in the payload. This confirms that files were transferred completely and that the information needed to kick off your data pipeline was successfully pushed to your SQS queue. From that point, you only need to modify your existing pipeline to trigger these messages with the following format.

{
  "log_stream": "<logstream-id>",
  "fileName": "<original file name and path>", 
  "s3uri": "<s3 uri to object>",
  "state": {
    "XXXtransferred_state" : "true or false",
    "verified_state" : "true or false", 
    "completed_state" : "true or false"
  }
}

Prerequisites

Before you begin the walkthrough, you must have an AWS account. If you don’t have one already you can sign up here. You should also have a basic knowledge of CloudFormation, AWS DataSync, Amazon S3, and Amazon SQS.

This solution assumes you already have the following in place:

  • An S3 bucket to use as your DataSync task’s destination location
  • A DataSync agent installed on your on-premises server
  • A DataSync task using your existing source location and the destination S3 bucket mentioned previously

If you have not configured your DataSync task, please follow the documentation.

After you’ve completed the prerequisites, we’ll walk you through enabling enhanced logging for your DataSync task specifically for this solution. Then, we’ll walk you through deploying the CloudFormation template.

Walkthrough

Once you’ve completed the prerequisites, you can start deploying the solution in the AWS console.

Step 1: Enable task logging in your DataSync task

  1. From the DataSync console, select your existing task, then click Edit from the task page. Scroll down to Task Logging and change the Log level to Log all transferred objects and files. Click Save changes. Please note the CloudWatch log group information as you will need it later.

Enable task logging in your DataSync task image

Step 2: Launch the stack from the AWS CloudFormation console

  1. Use the “Launch Stack” button below to open the CloudFormation stack in the AWS Management Console.Launch Stack button
  2. On the Create stack page, choose Next.
  3. On the Specify stack details page, in the Stack name field, give your stack a name. In the Parameters field, specify your S3TargetBaseURI (with no trailing slash) from your current DataSync task’s destination. Optionally, you can change the other parameters if your environment is different, for example the CloudWatch log group noted above. Then click Next.CloudFormation Parameters image
  4. On the Configure stack options, select Next.
  5. On the Review page, scroll to the bottom. You’ll need to agree to the three statements about Transforms might require access capabilities. Then Submit. You’ll know the stack is created successfully when the status is CREATE_COMPLETE.

Verify your setup works

You can use the following steps to test your CloudFormation deployment and confirm the architecture works as expected.

Step 1: Place a file in your source location

Step 2: Run your DataSync task

  1. From the DataSync page, select Tasks. Check the box next to the task you created. Then from the Actions pull down, select Start. The status change to Running.

Run your DataSync task image

Step 3: Check S3, DynamoDB and SQS

  1. When the task is complete, the Status changes to Available and your file(s) should be in the S3 output bucket you created. To verify the process worked, you can also view DynamoDB and SQS.
  2. From the DynamoDB page, click Tables, then your table name. You should be able to see the metadata for the file you just transferred.DynamoDB table list image
  3. From the SQS console, select the radio button next to your queue, then click Send and receive messages. From the Send and receive message page, click Poll for messages. Click on the message, then view the Body of the message verifying your data has been transferred completely.SQS message body
  1. You can now create a new (or update an existing) lambda function that is subscribed to this SQS queue. When a new file is processed, you can use the information in SQS to kick off your processing pipeline knowing that the data has been validated successfully.

Cleaning up

To avoid incurring future charges, delete the CloudFormation stack. For instructions, refer to Deleting a stack on the CloudFormation console. You might also want to revert the task logging settings that were changed in Walkthrough Step 1: Enable task logging in your DataSync task.

Conclusion

In this post, we discussed overcoming challenges imposed by low bandwidth and large data sets to ensure a complete data set is transferred to AWS before an ETL pipeline is triggered. We discussed how this solution monitors CloudWatch logs for DataSync tasks and only adds them to the SQS queue after DataSync validation is complete. This is important as it ensures that even if your source data changes while DataSync transfers it, the actual data pipeline does not get triggered until the file is completely transferred to the AWS cloud. This helps avoid any negative downstream effects such as, or that come from, processing incomplete data.

You can now combine your legacy data source with the processing power of the AWS cloud to run your ETL pipelines without having to worry about partial data sets polluting outcomes. This contributes to more accurate report generation and the ability to make business decisions based on complete information.

If you have feedback about this post, please submit your thoughts in the comments section.

Sue Tarazi

Sue Tarazi

Sue Tarazi is a Customer Solutions Manager within the AWS Industries - Strategic Accounts organization. She is passionate about helping enterprise customers find and adopt the right AWS solutions to meet their business strategic objectives.

David Selberg

David Selberg

David Selberg is a Solutions Architect at AWS who is passionate about helping customers build Well-Architected solutions in the AWS cloud. He works with customers of all size, from startups to enterprise, and loves to dive deep on security topics.

Max Tybar

Max Tybar

Max Tybar is a Solutions Architect at AWS with a background in computer science and application development. He enjoys leveraging DevOps practices to architect and build reliable cloud infrastructure that helps solve customer problems. Max is passionate about enabling customers to efficiently manage, automate, and orchestrate large-scale hybrid infrastructure projects.