AWS Big Data Blog
Category: AWS Data Pipeline
Analyze data in Amazon DynamoDB using Amazon SageMaker for real-time prediction
I’ll describe how to read the DynamoDB backup file format in Data Pipeline, how to convert the objects in S3 to a CSV format that Amazon ML can read, and I’ll show you how to schedule regular exports and transformations using Data Pipeline.
Read MoreHow Realtor.com Monitors Amazon Athena Usage with AWS CloudTrail and Amazon QuickSight
In this post, I discuss how to build a solution for monitoring Athena usage. To build this solution, you rely on AWS CloudTrail. CloudTrail is a web service that records AWS API calls for your AWS account and delivers log files to an S3 bucket.
Read MoreIntroducing On-Demand Pipeline Execution in AWS Data Pipeline
Marc Beitchman is a Software Development Engineer in the AWS Database Services team Now it is possible to trigger activation of pipelines in AWS Data Pipeline using the new on-demand schedule type. You can access this functionality through the existing AWS Data Pipeline activation API. On-demand schedules make it easy to integrate pipelines in AWS […]
Read MoreUsing AWS Lambda for Event-driven Data Processing Pipelines
awVadim Astakhov is a Solutions Architect with AWS Some big data customers want to analyze new data in response to a specific event, and they might already have well-defined pipelines to perform batch processing, orchestrated by AWS Data Pipeline. One example of event-triggered pipelines is when data analysts must analyze data as soon as it […]
Read MoreAutomating Analytic Workflows on AWS
Wangechi Doble is a Solutions Architect with AWS Organizations are experiencing a proliferation of data. This data includes logs, sensor data, social media data, and transactional data, and resides in the cloud, on premises, or as high-volume, real-time data feeds. It is increasingly important to analyze this data: stakeholders want information that is timely, accurate, […]
Read MoreHow Coursera Manages Large-Scale ETL using AWS Data Pipeline and Dataduct
This is a guest post by Sourabh Bajaj, a Software Engineer at Coursera. Coursera in their own words: “Coursera is an online educational startup with over 14 million learners across the globe. We offer more than 1000 courses from over 120 top universities.” At Coursera, we use Amazon Redshift as our primary data warehouse because […]
Read MoreUsing AWS Data Pipeline’s Parameterized Templates to Build Your Own Library of ETL Use-case Definitions
Leena Joseph is an SDE for AWS Data Pipeline In an earlier post, we introduced you to ETL processing using AWS Data Pipeline and Amazon EMR. This post shows how to build ETL workflow templates with AWS Data Pipeline, and build a library of recipes to implement common use cases. This is an introduction to […]
Read MoreETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce
Manjeet Chayel is an AWS Solutions Architect This blog post shows you how to build an ETL workflow that uses AWS Data Pipeline to schedule an Amazon Elastic MapReduce (Amazon EMR) cluster to clean and process web server logs stored in an Amazon Simple Storage Service (Amazon S3) bucket. AWS Data Pipeline is an ETL […]
Read More