AWS Big Data Blog

Category: Amazon EMR*

Real-time Stream Processing Using Apache Spark Streaming and Apache Kafka on AWS

Prasad Alle is a consultant with AWS Professional Services Intuit, a creator of business and financial management solutions, is a leading enterprise customer for AWS. The Intuit Data team (IDEA) at Intuit is responsible for building platforms and products that enable a data-driven personalized experience across Intuit products and services. One dimension of this platform […]

Read More

Amazon EMR-DynamoDB Connector Repository on AWSLabs GitHub

Mike Grimes is a Software Development Engineer with Amazon EMR Amazon Web Services is excited to announce that the Amazon EMR-DynamoDB Connector is now open-source. The EMR-DynamoDB Connector is a set of libraries that lets you access data stored in DynamoDB with Spark, Hadoop MapReduce, and Hive jobs. These libraries are currently shipped with EMR […]

Read More

Encrypt Data At-Rest and In-Flight on Amazon EMR with Security Configurations

Customers running analytics, stream processing, machine learning, and ETL workloads on personally identifiable information, health information, and financial data have strict requirements for encryption of data at-rest and in-transit. The Apache Spark and Hadoop ecosystems lend themselves to these big data use cases, and customers have asked us to provide a quick and easy way […]

Read More

Integrating IoT Events into Your Analytic Platform

Veronika Megler, Ph.D., is a Senior Consultant with AWS Professional Services “We have a fleet of vehicles, with GPS and a bunch of other sensors,” said Bob, the VP at a delivery company. “Today they send their update ‘breadcrumbs’ to another IoT service. We’re planning to have them send their breadcrumbs to AWS IoT instead; […]

Read More

Processing VPC Flow Logs with Amazon EMR

Michael Wallman is a senior consultant with AWS ProServ It’s easy to understand network patterns in small AWS deployments where software stacks are well defined and managed. But as teams and usage grow, its gets harder to understand which systems communicate with each other, and on what ports. This often results in overly permissive security […]

Read More

Building and Deploying Custom Applications with Apache Bigtop and Amazon EMR

Hernan Vivani is an Hadoop Systems Engineer for Amazon Web Services When you launch a cluster, Amazon EMR lets you choose applications that will run on your cluster. But what if you want to deploy your own custom application? This post shows you how to build a custom application for EMR for Apache Bigtop-based releases 4.x and greater. EMR […]

Read More

Use Spark 2.0, Hive 2.1 on Tez, and the latest from the Hadoop ecosystem on Amazon EMR release 5.0

Jonathan Fritz is a Senior Product Manager for Amazon EMR We are excited to launch Amazon EMR release 5.0 today, giving customers the latest versions of 16 supported open-source applications in the big data ecosystem, including new major versions of Spark and Hive. Almost exactly a year ago, we shipped release 4.0, which brought significant […]

Read More

Installing and Running JobServer for Apache Spark on Amazon EMR

Derek Graeber is a senior consultant in big data analytics for AWS Professional Services Working with customers who are running Apache Spark on Amazon EMR, I run into the scenario where data loaded into a SparkContext can and should be shared across multiple use cases. They ask a very valid question: “Once I load the […]

Read More

How SmartNews Built a Lambda Architecture on AWS to Analyze Customer Behavior and Recommend Content

This is a guest post by Takumi Sakamoto, a software engineer at SmartNews. SmartNews in their own words: “SmartNews is a machine learning-based news discovery app that delivers the very best stories on the Web for more than 18 million users worldwide.” Data processing is one of the key technologies for SmartNews. Every team’s workload […]

Read More

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Kiuk Chung is a Software Development Engineer with the Amazon Personalization team In Personalization at Amazon, we use neural networks to generate personalized product recommendations for our customers. Amazon’s product catalog is huge compared to the number of products that a customer has purchased, making our datasets extremely sparse. And with hundreds of millions of […]

Read More