AWS Data Pipeline is a managed extract-transform-load (ETL) service that helps you reliably and cost-effectively move and process data across your on-premise data stores and AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB and Amazon Redshift. Common use cases include loading data from on-premise databases and S3 into Amazon RDS or Amazon Redshift, cleaning and analyzing unstructured log data using Amazon Elastic MapReduce (EMR) and scheduling data backups on services such as DynamoDB.
AWS Data Pipeline allows you to schedule your pipeline to execute based on your business logic. You can have a pipeline that runs every hour to write log data to Amazon S3 and another pipeline that runs daily to analyze the log data and load the results to Amazon Redshift. You can also schedule your pipeline to backfill the tasks for time intervals in the past or run once.
If transient failures occur in your activity logic or data sources, AWS Data Pipeline automatically retries the activity. If the failure persists, AWS Data Pipeline sends you failure notifications via Amazon Simple Notification Service (Amazon SNS). You can configure your notifications for successful runs, delays in planned activities, or failures.
AWS Data Pipeline removes the barriers between your on-premise resources and your AWS data sources. You can have one pipeline that handles data flows across all your data stores regardless of its location. Data Pipeline’s native connectors for various AWS compute and storage services further make it easy for you to define how you want to process and store your data. For example, you can configure your pipeline to copy clickstream data from your on-premise source to S3, use a HiveActivity to spin up an EMR cluster and analyze the logs and then write the output to Redshift where it can be easily queried.
AWS Data Pipeline takes care of managing the resources required to execute your pipeline logic. It can create the appropriate resources on your behalf, such as an EC2 instance, and shut it down after it has finished executing the necessary tasks, or connect to your existing resources to submit jobs. You can specify resource details, such as instance type, or let AWS Data Pipeline pick the right resource for the job.
Using AWS Data Pipeline, you can clearly define data dependencies between two activities using preconditions. You can rest assured that your pipeline will not execute a task until certain conditions have been met such as availability of data in an Amazon S3 bucket. This ensures that your business logic is followed and data integrity is always maintained.