As a managed ETL (Extract-Transform-Load) service, AWS Data Pipeline allows you to define data movement and transformations across various AWS services, as well as for on-premises resources. Using Data Pipeline, you define the dependent processes to create your pipeline comprised of the data nodes that contain your data; the activities, or business logic, such as EMR jobs or SQL queries that will run sequentially; and the schedule on which your business logic executes.
For example, if you want to move clickstream data stored in Amazon S3 to Amazon Redshift, you would define a pipeline with an S3DataNode that stores your log files, a HiveActivity that will convert your log files to a .csv file using an Amazon EMR cluster and store it back to S3, a RedshiftCopyActivity that will copy your data from S3 to Redshift, and a RedshiftDataNode that will connect to your Redshift cluster. You might then pick a schedule to run at the end of the day.
You can also define preconditions that can check if your data is available before kicking off a particular activity. In the above example, you can have a precondition on the S3DataNode that will check to see if the log files are available before kicking off the HiveActivity.
AWS Data Pipeline handles ..
- Your jobs' scheduling, execution, and retry logic.
- Tracking the dependencies between your business logic, data sources, and previous processing steps to ensure that your logic does not run until all of its dependencies are met.
- Sending any necessary failure notifications.
- Creating and managing any compute resources your jobs may require.
ETL Data to Amazon Redshift
Copy RDS or DynamoDB tables to S3, transform data structure, run analytics using SQL queries and load it to Redshift.
ETL Unstructured Data
Analyze unstructured data like clickstream logs using Hive or Pig on EMR, combine it with structured data from RDS and upload it to Redshift for easy querying.
Load AWS Log Data to Amazon Redshift
Load log files such as from the AWS billing logs, or AWS CloudTrail, Amazon CloudFront, and Amazon CloudWatch logs, from Amazon S3 to Redshift.
Data Loads and Extracts
Copy data from your RDS or Redshift table to S3 and vice-versa.
Move to Cloud
Easily copy data from your on-premises data store, like a MySQL database, and move it to an AWS data store, like S3 to make it available to a variety of AWS services such as Amazon EMR, Amazon Redshift, and Amazon RDS.
Amazon DynamoDB Backup and Recovery
Periodically backup your Dynamo DB table to S3 for disaster recovery purposes.