AWS Big Data Blog

Encrypt Data At-Rest and In-Flight on Amazon EMR with Security Configurations

ustomers running analytics, stream processing, machine learning, and ETL workloads on personally identifiable information, health information, and financial data have strict requirements for encryption of data at-rest and in-transit. The Apache Spark and Hadoop ecosystems lend themselves to these big data use cases, and customers have asked us to provide a quick and easy way to encrypt data at-rest and data in-transit between nodes in each execution framework.

Read More

Month in Review: August 2016

Another month of big data solutions on the Big Data Blog. Take a look at our summaries below and learn, comment, and share. Thanks for reading! Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning With this post, learn how to apply advanced analytics concepts like pattern analysis and machine learning to do risk […]

Read More

Processing VPC Flow Logs with Amazon EMR

In this post, I show you how to gain valuable insight into your network by using Amazon EMR and Amazon VPC Flow Logs. The walkthrough implements a pattern often found in network equipment called ‘Top Talkers’, an ordered list of the heaviest network users, but the model can also be used for many other types of network analysis.

Read More

Data Lake Ingestion: Automatically Partition Hive External Tables with AWS

In this post, I introduce a simple data ingestion and preparation framework based on AWS Lambda, Amazon DynamoDB, and Apache Hive on EMR for data from different sources landing in S3. This solution lets Hive pick up new partitions as data is loaded into S3 because Hive by itself cannot detect new partitions as data lands.

Read More