AWS Data Pipeline

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).

AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.

Get Started with
AWS for Free

AWS Free Tier includes 3 Low Frequency Preconditions and 5 Low Frequency Activities with AWS Data Pipeline.

View AWS Free Tier Details »

This page contains the following categories of information. Click to jump down:

AWS Data Pipeline Functionality

AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email. AWS Data Pipeline handles

  • Your jobs' scheduling, execution, and retry logic
  • Tracking the dependencies between your business logic, datasources, and previous processing steps to ensure that your logic does not run until all of its dependencies are met
  • Sending any necessary failure notifications
  • Creating and managing any temporary compute resources your jobs may require
To ensure that data is available prior to the execution of an activity, AWS Data Pipeline allows you to optionally create data availability checks called “preconditions.” These checks will repeatedly attempt to verify data availability and will block any dependent activities from executing until the preconditions succeed.

To use AWS Data Pipeline, you simply:

  • Use the AWS Management Console, Command Line Interface, or the service APIs to define your data sources, preconditions, activities, the schedule on which you want them to execute, and any optional notification conditions
  • Receive configurable, automatic notifications if your data doesn’t become available when expected or if your activities encounter errors

You can find (and use) a variety of popular AWS Data Pipeline tasks in the AWS Management Console’s template section. These tasks include:

  • Hourly analysis of Amazon S3‐based log data
  • Daily replication of AmazonDynamoDB data to Amazon S3
  • Periodic replication of on-premise JDBC database tables into RDS

For more information, see the AWS Data Pipeline Developer Guide.


Service Highlights

Reliable — AWS Data Pipeline is built on a distributed, highly available infrastructure designed for fault tolerant execution of your activities. If transient failures occur in your activity logic or data sources, AWS Data Pipeline automatically retries the activity a configurable number of times. If the failure persists, AWS Data Pipeline sends you automated failure notifications via Amazon Simple Notification Service (Amazon SNS).

Simple — Creating a pipeline is quick and easy via our drag-and-drop console. Common preconditions, are built into the service, so you don’t need to write any extra logic to use them. For example, you can check for the existence of an Amazon S3 file by simply providing the name of the Amazon S3 bucket and the path of the file that you want to check for, and AWS Data Pipeline does the rest. 
In addition to its easy visual pipeline creator, AWS Data Pipeline provides a library of pipeline templates. These templates make it simple to create pipelines for a number of more complex use cases, such as regularly processing your log files, archiving data to Amazon S3, or running periodic SQL queries.
In addition to its easy visual pipeline creator, AWS Data Pipeline provides a library of pipeline templates. These templates make it especially easy to create pipelines for a number of more complex use cases, such as regularly processing your log files, archiving data to Amazon S3, or running periodic SQL queries.

Flexible— AWS Data Pipeline allows you to take advantage of a variety of features such as scheduling, dependency tracking, and error handling. You can use activities and preconditions that AWS provides and/or write your own custom ones. This means that you can configure an AWS Data Pipeline to take actions like run Amazon Elastic MapReduce jobs, execute SQL queries directly against databases, or execute custom applications running on Amazon EC2 or in your own datacenter. This allows you to create powerful custom pipelines to analyze and process your data without having to deal with the complexities of reliably scheduling and executing your application logic.

Scalable — AWS Data Pipeline makes it equally easy to dispatch work to one machine or many, in serial or parallel. With AWS Data Pipeline’s flexible design, processing a million files is as easy as processing a single file.

Transparent — You have full control over the computational resources that execute your business logic, making it easy to enhance or debug your logic. Additionally, full execution logs are automatically delivered to Amazon S3, giving you a persistent, detailed record of what has happened in your pipeline.


Pricing

AWS Data Pipeline currently is available in the US East region. Pay only for what you use – there is no minimum fee.

Free Tier*

As part of AWS’s Free Usage Tier, AWS Data Pipeline offers the following each month to new customers:

  • 3 Low Frequency preconditions running on AWS at no charge
  • 5 Low Frequency activities running on AWS at no charge
Low Frequency activities and preconditions are ones scheduled to run one time a day or less.

AWS Data Pipeline is billed based on how often your activities and preconditions are scheduled to run and where they run (AWS or on-premise). High Frequency activities are ones scheduled to execute more than once a day; for example, an activity scheduled to execute every hour or every 12 hours is High Frequency. Low Frequency activities are ones scheduled to execute one time a day or less.

For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month. If an Amazon EC2 activity was added to this same pipeline to produce a report based on the data in Amazon S3, the total cost of the pipeline would be $1.20 per month (two activities X $0.60 per activity per month). If the pipeline was changed to run every 6 hours, it would cost $2.00 per month, because it would then consist of two High Frequency activities (at $1.00 per month for each activity).

Activities or preconditions that are active for part of a month are pro-rated hourly. Your Amazon EC2, Amazon Elastic MapReduce, Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon SNS activities associated with AWS Data Pipeline are billed separately according to those services’ normal prices.

* Your free usage is calculated each month and automatically applied to your bill – free usage does not accumulate; unused free usage cannot be rolled over into the next month.

Inactive pipelines: $1.00 per month

Intended Usage and Restrictions

Your use of this service is subject to the Amazon Web Services Customer Agreement


©2013, Amazon Web Services, Inc. or its affiliates. All rights reserved.