AWS Big Data Blog

Category: Amazon EMR*

Integrating IoT Events into Your Analytic Platform

Veronika Megler, Ph.D., is a Senior Consultant with AWS Professional Services “We have a fleet of vehicles, with GPS and a bunch of other sensors,” said Bob, the VP at a delivery company. “Today they send their update ‘breadcrumbs’ to another IoT service. We’re planning to have them send their breadcrumbs to AWS IoT instead; […]

Read More

Processing VPC Flow Logs with Amazon EMR

Michael Wallman is a senior consultant with AWS ProServ It’s easy to understand network patterns in small AWS deployments where software stacks are well defined and managed. But as teams and usage grow, its gets harder to understand which systems communicate with each other, and on what ports. This often results in overly permissive security […]

Read More

Building and Deploying Custom Applications with Apache Bigtop and Amazon EMR

Hernan Vivani is an Hadoop Systems Engineer for Amazon Web Services When you launch a cluster, Amazon EMR lets you choose applications that will run on your cluster. But what if you want to deploy your own custom application? This post shows you how to build a custom application for EMR for Apache Bigtop-based releases 4.x and greater. EMR […]

Read More

Use Spark 2.0, Hive 2.1 on Tez, and the latest from the Hadoop ecosystem on Amazon EMR release 5.0

Jonathan Fritz is a Senior Product Manager for Amazon EMR We are excited to launch Amazon EMR release 5.0 today, giving customers the latest versions of 16 supported open-source applications in the big data ecosystem, including new major versions of Spark and Hive. Almost exactly a year ago, we shipped release 4.0, which brought significant […]

Read More

Installing and Running JobServer for Apache Spark on Amazon EMR

Derek Graeber is a senior consultant in big data analytics for AWS Professional Services Working with customers who are running Apache Spark on Amazon EMR, I run into the scenario where data loaded into a SparkContext can and should be shared across multiple use cases. They ask a very valid question: “Once I load the […]

Read More

How SmartNews Built a Lambda Architecture on AWS to Analyze Customer Behavior and Recommend Content

This is a guest post by Takumi Sakamoto, a software engineer at SmartNews. SmartNews in their own words: “SmartNews is a machine learning-based news discovery app that delivers the very best stories on the Web for more than 18 million users worldwide.” Data processing is one of the key technologies for SmartNews. Every team’s workload […]

Read More

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Kiuk Chung is a Software Development Engineer with the Amazon Personalization team In Personalization at Amazon, we use neural networks to generate personalized product recommendations for our customers. Amazon’s product catalog is huge compared to the number of products that a customer has purchased, making our datasets extremely sparse. And with hundreds of millions of […]

Read More

Use Sqoop to Transfer Data from Amazon EMR to Amazon RDS

Sai Sriparasa is a consultant with AWS Professional Services Customers commonly process and transform vast amounts of data with Amazon EMR and then transfer and store summaries or aggregates of that data in relational databases such as MySQL or Oracle. This allows the storage footprint in these relational databases to be much smaller, yet retain […]

Read More

Analyze Realtime Data from Amazon Kinesis Streams Using Zeppelin and Spark Streaming

Manjeet Chayel is a Solutions Architect with AWS Streaming data is everywhere. This includes clickstream data, data from sensors, data emitted from billions of IoT devices, and more. Not surprisingly, data scientists want to analyze and explore these data streams in real time. This post shows you how you can use Spark Streaming to process […]

Read More

Apache Tez Now Available with Amazon EMR

Moataz Anany is a Solutions Architect with AWS Amazon EMR has added Apache Tez version 0.8.3 as a supported application in release 4.7.0. Tez is an extensible framework for building batch and interactive data processing applications on top of Hadoop YARN. By processing data flows and computations as Directed Acyclic Graphs (DAGs), Tez provides a more […]

Read More