AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon Elastic MapReduce (EMR).
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. You don’t have to worry about ensuring resource availability, managing inter-task dependencies, retrying transient failures or timeouts in individual tasks, or creating a failure notification system. AWS Data Pipeline also allows you to move and process data that was previously locked up in on-premise data silos.
AWS Free Tier includes 3 Low Frequency Preconditions and 5 Low Frequency Activities with AWS Data Pipeline.View AWS Free Tier Details »
AWS Data Pipeline is a web service that provides a simple management system for data-driven workflows. Using AWS Data Pipeline, you define a pipeline composed of the “data sources” that contain your data, the “activities” or business logic such as EMR jobs or SQL queries, and the “schedule” on which your business logic executes. For example, you could define a job that, every hour, runs an Amazon Elastic MapReduce (Amazon EMR)–based analysis on that hour’s Amazon Simple Storage Service (Amazon S3) log data, loads the results into a relational database for future lookup, and then automatically sends you a daily summary email. AWS Data Pipeline handles
To use AWS Data Pipeline, you simply:
You can find (and use) a variety of popular AWS Data Pipeline tasks in the AWS Management Console’s template section. These tasks include:
For more information, see the AWS Data Pipeline Developer Guide.
Reliable — AWS Data Pipeline is built on a distributed, highly available infrastructure designed for fault tolerant execution of your activities. If transient failures occur in your activity logic or data sources, AWS Data Pipeline automatically retries the activity a configurable number of times. If the failure persists, AWS Data Pipeline sends you automated failure notifications via Amazon Simple Notification Service (Amazon SNS).
Simple — Creating a pipeline is quick and easy via our drag-and-drop console. Common preconditions are built into the service, so you don’t need to write any extra logic to use them. For example, you can check for the existence of an Amazon S3 file by simply providing the name of the Amazon S3 bucket and the path of the file that you want to check for, and AWS Data Pipeline does the rest. In addition to its easy visual pipeline creator, AWS Data Pipeline provides a library of pipeline templates. These templates make it simple to create pipelines for a number of more complex use cases, such as regularly processing your log files, archiving data to Amazon S3, or running periodic SQL queries.
Flexible— AWS Data Pipeline allows you to take advantage of a variety of features such as scheduling, dependency tracking, and error handling. You can use activities and preconditions that AWS provides and/or write your own custom ones. This means that you can configure an AWS Data Pipeline to take actions like run Amazon Elastic MapReduce jobs, execute SQL queries directly against databases, or execute custom applications running on Amazon EC2 or in your own datacenter. This allows you to create powerful custom pipelines to analyze and process your data without having to deal with the complexities of reliably scheduling and executing your application logic.
Scalable — AWS Data Pipeline makes it equally easy to dispatch work to one machine or many, in serial or parallel. With AWS Data Pipeline’s flexible design, processing a million files is as easy as processing a single file.
Transparent — You have full control over the computational resources that execute your business logic, making it easy to enhance or debug your logic. Additionally, full execution logs are automatically delivered to Amazon S3, giving you a persistent, detailed record of what has happened in your pipeline.
AWS Data Pipeline currently is available in the US East region. Pay only for what you use – there is no minimum fee.
AWS Data Pipeline is billed based on how often your activities and preconditions are scheduled to run and where they run (AWS or on-premise). High Frequency activities are ones scheduled to execute more than once a day; for example, an activity scheduled to execute every hour or every 12 hours is High Frequency. Low Frequency activities are ones scheduled to execute one time a day or less.
For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month. If an Amazon EC2 activity was added to this same pipeline to produce a report based on the data in Amazon S3, the total cost of the pipeline would be $1.20 per month (two activities X $0.60 per activity per month). If the pipeline was changed to run every 6 hours, it would cost $2.00 per month, because it would then consist of two High Frequency activities (at $1.00 per month for each activity).
Activities or preconditions that are active for part of a month are pro-rated hourly. Your Amazon EC2, Amazon Elastic MapReduce, Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon SNS activities associated with AWS Data Pipeline are billed separately according to those services’ normal prices.
* Your free usage is calculated each month and automatically applied to your bill – free usage does not accumulate; unused free usage cannot be rolled over into the next month.