Tag: Data Pipeline

Introducing On-Demand Pipeline Execution in AWS Data Pipeline

by Marc Beitchman | on | | Comments

Marc Beitchman is a Software Development Engineer in the AWS Database Services team

Now it is possible to trigger activation of pipelines in AWS Data Pipeline using the new on-demand schedule type. You can access this functionality through the existing AWS Data Pipeline activation API. On-demand schedules make it easy to integrate pipelines in AWS Data Pipeline with other AWS services and with on-premise orchestration engines.

For example, you can build AWS Lambda functions to activate an AWS Data Pipeline execution in response to AWS CloudWatch cron expression events or AWS S3 event notifications. You can also invoke the AWS Data Pipeline activation API directly from the AWS CLI and SDK.

To get started, create a new pipeline and use the default object to specify a property of ‘scheduleType”:”ondemand”. Setting this parameter enables on-demand activation of the pipeline.


Automating Analytic Workflows on AWS

by Wangechi Doble | on | | Comments

Wangechi Doble is a Solutions Architect with AWS

Organizations are experiencing a proliferation of data. This data includes logs, sensor data, social media data, and transactional data, and resides in the cloud, on premises, or as high-volume, real-time data feeds. It is increasingly important to analyze this data: stakeholders want information that is timely, accurate, and reliable. This analysis ranges from simple batch processing to complex real-time event processing. Automating workflows can ensure that necessary activities take place when required to drive the analytic processes.

With Amazon Simple Workflow (Amazon SWF), AWS Data Pipeline, and, AWS Lambda, you can build analytic solutions that are automated, repeatable, scalable, and reliable. In this post, I show you how to use these services to migrate and scale an on-premises data analytics workload.

Workflow basics

A business process can be represented as a workflow. Applications often incorporate a workflow as steps that must take place in a predefined order, with opportunities to adjust the flow of information based on certain decisions or special cases.

The following is an example of an ETL workflow:


Using AWS Data Pipeline’s Parameterized Templates to Build Your Own Library of ETL Use-case Definitions

by Leena Joseph | on | | Comments

Leena Joseph is an SDE for AWS Data Pipeline

In an earlier post, we introduced you to ETL processing using AWS Data Pipeline and Amazon EMR. This post shows how to build ETL workflow templates with AWS Data Pipeline, and build a library of recipes to implement common use cases. This is an introduction to parameterized templates, which serve as proven recipes and can be shared as a library of reusable pipeline definitions with other parts of the organization/company or contributed to the larger community.

Data Pipeline allows you to define complex workflows for data movement and transformation. You can easily create complex data processing workloads, manage inter-task dependencies, and configure retries of transient failures and failure notification for individual tasks.

Creating Pipelines Using Built-in Templates

Data Pipeline supports a library of parameterized templates for common ETL use cases. The Create Pipeline page on the AWS Management Console provides an option to create your pipeline using templates.

Create Pipeline page on AWS Management Console


ETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce

by Manjeet Chayel | on | | Comments

Manjeet Chayel is an AWS Solutions Architect

This blog post shows you how to build an ETL workflow that uses AWS Data Pipeline to schedule an Amazon Elastic MapReduce (Amazon EMR) cluster to clean and process web server logs stored in an Amazon Simple Storage Service (Amazon S3) bucket. AWS Data Pipeline is an ETL service that you can use to automate the movement and transformation of data. It launches an Amazon EMR cluster for each scheduled interval, submits jobs as steps to the cluster, and terminates the cluster after tasks have completed.

In this post, you’ll create the following ETL workflow:

To create the workflow, we’ll use the Pig and Hive examples discussed in the blog post “Ensuring Consistency when Using Amazon S3 and Amazon EMR.” This ETL workflow pushes webserver logs to an Amazon S3 bucket, cleans and filters the data using Pig scripts, and then generates analytical reports from this data using Hive scripts. AWS Data Pipeline allows you to run this workflow for a schedule in the future and lets you backfill data by scheduling a pipeline to run from a start date in the past.