AWS Big Data Blog

Month in Review: July 2016

July was a busy month of big data solutions on the Big Data Blog. The month started with our most popular story yet, Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE. It was a great post to start a spectacular month. Take a look at our summaries below. Learn, comment, and share. Thank you for reading the AWS Big Data Blog!

Installing and Running JobServer for Apache Spark on Amazon EMR
In this blog post, learn how to install JobServer on EMR using a bootstrap action (BA) derived from the JobServer GitHub repository. Then, run JobServer using a sample dataset.

Process Large DynamoDB Streams Using Multiple Amazon Kinesis Client Library (KCL) Workers
A previous post, described how you can use the Amazon Kinesis Client Library (KCL) and DynamoDB Streams Kinesis Adapter to efficiently process DynamoDB streams. This post focuses on the KCL configurations that are likely to have an impact on the performance of your application when processing a large DynamoDB stream.

Simplify Management of Amazon Redshift Snapshots using AWS Lambda
In this blog post, learn about the new Amazon Redshift Utils module that helps you manage the Snapshots that your cluster creates. You supply a simple configuration, and then AWS Lambda ensures that you have cluster snapshots as frequently as required to meet your RPO.

How SmartNews Built a Lambda Architecture on AWS to Analyze Customer Behavior and Recommend Content
In this post, SmartNews shows you how they built their data platform on AWS. Their current system generates tens of GBs of data from multiple data sources, and runs daily aggregation queries or machine learning algorithms on datasets with hundreds of GBs. Some outputs by machine learning algorithms are joined on data streams for gathering user feedback in near real-time (e.g. the last 5 minutes). It lets them adapt their product for users with minimum latency.

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Managing a hybrid cluster of both CPU and GPU instances poses challenges because cluster managers such as Yarn/Mesos do not natively support GPUs. Even if they did have native GPU support, the open source deep learning libraries would have to be re-written to work with the cluster manager API. This post discusses an alternate solution; namely, running separate CPU and GPU clusters, and driving the end-to-end modeling process from Apache Spark.


Will Spark Power the Data behind Precision Medicine? (March 2016)
Spark is already known for being a major player in big data analysis, but it is additionally uniquely capable in advancing genomics algorithms given the complex nature of genomics research. This post introduces gene analysis using Spark on EMR and ADAM, for those new to precision medicine.


Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.