AWS Data Pipeline Documentation
As a managed ETL (Extract-Transform-Load) service, AWS Data Pipeline is designed to allow you to define data movement and transformations across various AWS services, as well as for on-premises resources. Using Data Pipeline, you define the dependent processes to create your pipeline comprised of the data nodes that contain your data; the activities, or business logic, such as EMR jobs or SQL queries that will run sequentially; and the schedule on which your business logic executes.
For example, if you want to move clickstream data stored in Amazon S3 to Amazon Redshift, you would define a pipeline with an S3DataNode that stores your log files, a HiveActivity that will convert your log files to a .csv file using an Amazon EMR cluster and store it back to S3, a RedshiftCopyActivity that will copy your data from S3 to Redshift, and a RedshiftDataNode that will connect to your Redshift cluster. You might then pick a schedule to run at the end of the day.
Use AWS Data Pipeline to move clickstream data from Amazon S3 to Amazon Redshift.
You can also define preconditions that can check if your data is available before kicking off a particular activity. In the above example, you can have a precondition on the S3DataNode that will check to see if the log files are available before kicking off the HiveActivity.
AWS Data Pipeline is designed to handle:
- Your jobs' scheduling, execution, and retry logic.
- Tracking the dependencies between your business logic, data sources, and previous processing steps to ensure that your logic does not run until its dependencies are met.
- Sending any necessary failure notifications.
- Creating and managing any compute resources your jobs may require.
Popular Use Cases
ETL Data to Amazon Redshift
Copy RDS or DynamoDB tables to S3, transform data structure, run analytics using SQL queries and load it to Redshift.
ETL Unstructured Data
Analyze unstructured data like clickstream logs using Hive or Pig on EMR, combine it with structured data from RDS and upload it to Redshift for querying.
Load AWS Log Data to Amazon Redshift
Load log files such as from the AWS billing logs, or AWS CloudTrail, Amazon CloudFront, and Amazon CloudWatch logs, from Amazon S3 to Redshift.
Data Loads and Extracts
Copy data from your RDS or Redshift table to S3 and vice-versa.
Move to Cloud
Copy data from your on-premises data store, like a MySQL database, and move it to an AWS data store, like S3 to make it available to a variety of AWS services such as Amazon EMR, Amazon Redshift, and Amazon RDS.
Amazon DynamoDB Backup and Recovery
Periodically backup your Dynamo DB table to S3 for disaster recovery purposes.
For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.