AWS Big Data Blog

AWS re:Invent 2016 Registration is Now Open

by Andy Werth | on | Permalink | Comments |  Share

Register now for the fifth annual AWS re:Invent, the largest gathering of the global cloud computing community. Join us in Las Vegas for opportunities to connect, collaborate, and learn about AWS solutions. There will be many opportunities for developers and data scientists working in big data to sharpen their skills and learn what’s coming next […]

Read More

Simplify Management of Amazon Redshift Snapshots using AWS Lambda

Ian Meyers is a Solutions Architecture Senior Manager with AWS Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools. A cluster is automatically backed up to Amazon S3 by default, and three automatic snapshots of the cluster […]

Read More

How SmartNews Built a Lambda Architecture on AWS to Analyze Customer Behavior and Recommend Content

This is a guest post by Takumi Sakamoto, a software engineer at SmartNews. SmartNews in their own words: “SmartNews is a machine learning-based news discovery app that delivers the very best stories on the Web for more than 18 million users worldwide.” Data processing is one of the key technologies for SmartNews. Every team’s workload […]

Read More

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Kiuk Chung is a Software Development Engineer with the Amazon Personalization team In Personalization at Amazon, we use neural networks to generate personalized product recommendations for our customers. Amazon’s product catalog is huge compared to the number of products that a customer has purchased, making our datasets extremely sparse. And with hundreds of millions of […]

Read More

Month in Review: June 2016

by Andy Werth | on | Permalink | Comments |  Share

Lots to see on the Big Data Blog in June! Please take a look at the summaries below for something that catches your interest. Use Sqoop to Transfer Data from Amazon EMR to Amazon RDS Customers commonly process and transform vast amounts of data with EMR and then transfer and store summaries or aggregates of […]

Read More

Use Sqoop to Transfer Data from Amazon EMR to Amazon RDS

Sai Sriparasa is a consultant with AWS Professional Services Customers commonly process and transform vast amounts of data with Amazon EMR and then transfer and store summaries or aggregates of that data in relational databases such as MySQL or Oracle. This allows the storage footprint in these relational databases to be much smaller, yet retain […]

Read More

Analyze Realtime Data from Amazon Kinesis Streams Using Zeppelin and Spark Streaming

Manjeet Chayel is a Solutions Architect with AWS Streaming data is everywhere. This includes clickstream data, data from sensors, data emitted from billions of IoT devices, and more. Not surprisingly, data scientists want to analyze and explore these data streams in real time. This post shows you how you can use Spark Streaming to process […]

Read More

Apache Tez Now Available with Amazon EMR

Moataz Anany is a Solutions Architect with AWS Amazon EMR has added Apache Tez version 0.8.3 as a supported application in release 4.7.0. Tez is an extensible framework for building batch and interactive data processing applications on top of Hadoop YARN. By processing data flows and computations as Directed Acyclic Graphs (DAGs), Tez provides a more […]

Read More

Processing Amazon DynamoDB Streams Using the Amazon Kinesis Client Library

Asmita Barve-Karandikar is an SDE with DynamoDB Customers often want to process streams on an Amazon DynamoDB table with a significant number of partitions or with a high throughput. AWS Lambda and the DynamoDB Streams Kinesis Adapter are two ways to consume DynamoDB streams in a scalable way. While Lambda lets you run your application […]

Read More

Use Apache Oozie Workflows to Automate Apache Spark Jobs (and more!) on Amazon EMR

Mike Grimes is an SDE with Amazon EMR As a developer or data scientist, you rarely want to run a single serial job on an Apache Spark cluster. More often, to gain insight from your data you need to process it in multiple, possibly tiered steps, and then move the data into another format and […]

Read More