As a managed ETL (Extract-Transform-Load) service, AWS Data Pipeline allows you to define data movement and transformations across various AWS services, as well as for on-premises resources. Using Data Pipeline, you define the dependent processes to create your pipeline comprised of the data nodes that contain your data; the activities, or business logic, such as EMR jobs or SQL queries that will run sequentially; and the schedule on which your business logic executes.

For example, if you want to move clickstream data stored in Amazon S3 to Amazon Redshift, you would define a pipeline with an S3DataNode that stores your log files, a HiveActivity that will convert your log files to a .csv file using an Amazon EMR cluster and store it back to S3, a RedshiftCopyActivity that will copy your data from S3 to Redshift, and a RedshiftDataNode that will connect to your Redshift cluster. You might then pick a schedule to run at the end of the day.

ExampleWorkflow

PLACEHOLDER: Use AWS Data Pipeline to move clickstream data from Amazon S3 to Amazon Redshift.

Get Started with AWS for Free

Create a Free Account
Or Sign In to the Console

AWS Free Tier includes 3 Low Frequency Preconditions and 5 Low Frequency Activities with AWS Data Pipeline.

View AWS Free Tier Details »

You can also define preconditions that can check if your data is available before kicking off a particular activity. In the above example, you can have a precondition on the S3DataNode that will check to see if the log files are available before kicking off the HiveActivity.

AWS Data Pipeline handles ..

  • Your jobs' scheduling, execution, and retry logic.
  • Tracking the dependencies between your business logic, data sources, and previous processing steps to ensure that your logic does not run until all of its dependencies are met.
  • Sending any necessary failure notifications.
  • Creating and managing any compute resources your jobs may require.

 

Cognito_page_divider

ETL Data to Amazon Redshift

Copy RDS or DynamoDB tables to S3, transform data structure, run analytics using SQL queries and load it to Redshift.

ETL Unstructured Data

Analyze unstructured data like clickstream logs using Hive or Pig on EMR, combine it with structured data from RDS and upload it to Redshift for easy querying.

Load AWS Log Data to Amazon Redshift

Load log files such as from the AWS billing logs, or AWS CloudTrail, Amazon CloudFront, and Amazon CloudWatch logs, from Amazon S3 to Redshift.

Data Loads and Extracts

Copy data from your RDS or Redshift table to S3 and vice-versa.

Move to Cloud

Easily copy data from your on-premises data store, like a MySQL database, and move it to an AWS data store, like S3 to make it available to a variety of AWS services such as Amazon EMR, Amazon Redshift, and Amazon RDS.

 

Amazon DynamoDB Backup and Recovery

Periodically backup your Dynamo DB table to S3 for disaster recovery purposes.

Start using AWS Data Pipeline now via the AWS Management Console, the AWS Command Line Interface, or the service APIs.