AWS Big Data Blog

Month in Review: April 2016

Lots to see on the Big Data Blog in April! Please take a look at the summaries below for something that catches your interest.

Exploring Geospatial Intelligence using SparkR on Amazon EMR
The number of data sources that use location, such as smartphones and sensory devices used in IoT (Internet of things), is expanding rapidly. This explosion has increased demand for analyzing spatial data. Learn how to build a simple GEOINT application using SparkR that will allow you to appreciate GEOINT capabilities.

AWS at Strata+Hadoop 2016: Building a Scalable Architecture on AWS to Process Streaming Data
Last month, Siva Raghupathy and Manjeet Chayel presented “Building a scalable architecture for processing streaming data on AWS” at Hadoop+Strata 2016 in San Jose. This post provides several helpful links to their slides and presentation.

Using CombineInputFormat to Combat Hadoop’s Small Files Problem
Many Amazon EMR customers have architectures that track events and streams and store data in S3. This frequently leads to many small files. It’s now well known that Hadoop doesn’t deal well with small files. This post helps you manage this problem.

Combine NoSQL and Massively Parallel Analytics Using Apache HBase and Apache Hive on Amazon EMR
This post demonstrates query performance differences by showing you how to launch an EMR cluster with HBase and restore a table from a snapshot in Amazon S3. The table in the snapshot contains approximately 3.5 million rows, and you’ll perform look-ups and scans using the HBase shell as well as perform SQL queries over the same dataset using Hive with the Hive query editor in the Hue UI.

Sharpen your Skill Set with Apache Spark on the AWS Big Data Blog
The AWS Big Data Blog has a large community of authors who are passionate about Apache Spark and who regularly publish content that helps customers use Spark to build real-world solutions. You’ll see content on a variety of topics, including deep-dives on Spark’s internals, building Spark Streaming applications, creating machine learning pipelines using MLlib, and ways to apply Spark to various real-world use cases.

Process Encrypted Data in Amazon EMR with Amazon S3 and AWS KMS
In this post, you’ll learn how easy it is to create a master key in KMS, encrypt data either client-side or server-side, upload it to S3, and have EMR seamlessly read and write that encrypted data to and from S3 using the master key that you created.


Powering Gaming Applications with Amazon DynamoDB (July 2014). Learn how to use Amazon DynamoDB to quickly build a reliable and scalable database tier for a mobile game. They walk through a design example and show how to power a sizable game for less than the cost of a daily cup of coffee.


Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.