AWS Big Data Blog

Month in Review: October 2016

Another month of big data solutions on the Big Data Blog. Take a look at our summaries below and learn, comment, and share. Thanks for reading!

Building Event-Driven Batch Analytics on AWS
Modern businesses typically collect data from internal and external sources at various frequencies throughout the day. In this post, you learn an elastic and modular approach for how to collect, process, and analyze data for event-driven applications in AWS.

How Eliza Corporation Moved Healthcare Data to the Cloud
Eliza Corporation, a company that focuses on health engagement management, acts on behalf of healthcare organizations such as hospitals, clinics, pharmacies, and insurance companies. This allows them to engage people at the right time, with the right message, and in the right medium. By meeting them where they are in life, Eliza can capture relevant metrics and analyze the overall value provided by healthcare. In this post, you explore some of the practical challenges faced during the implementation of the data lake for Eliza and the corresponding details of the ways NorthBay solved these issues with AWS.

Optimizing Amazon S3 for High Concurrency in Distributed Workloads
This post demonstrates how to optimize Amazon S3 for an architecture commonly used to enable genomic data analyses. Although the focus of this post is on genomic data analyses, the optimization can be used in any discipline that has individual source data that must be analyzed together at scale.

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR
Sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. Sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API. This short post shows you how to run RStudio and sparklyr on EMR.

Fact or Fiction: Google BigQuery Outperforms Amazon Redshift as an Enterprise Data Warehouse?
One of the great things about the cloud is the transparency that customers have in testing and debunking overstated performance claims and misleading “benchmark” tests. This transparency encourages the best cloud vendors to publish clear and repeatable performance metrics, making it faster and easier for their customers to select the right cloud service for a given workload. To verify Google’s recent performance claims with our own testing, we ran the full TPC-H benchmark, consisting of all 22 queries, using a 10 TB dataset on Amazon Redshift against the latest version of BigQuery.

Using pgpool and Amazon ElastiCache for Query Caching with Amazon Redshift
It is easy to implement a caching solution using pgpool with Amazon Redshift and Amazon Elasticache. This solution significantly improves the end-user experience and alleviate the load on your cluster by orders of magnitude. In this blog post, we’ll use a real customer scenario to show you how to create a caching layer in front of Amazon Redshift using pgpool and Amazon ElastiCache.


Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE (July 2016)
Managing a hybrid cluster of both CPU and GPU instances poses challenges because cluster managers such as Yarn/Mesos do not natively support GPUs. Even if they did have native GPU support, the open source deep learning libraries would have to be re-written to work with the cluster manager API. This post discusses an alternate solution; namely, running separate CPU and GPU clusters, and driving the end-to-end modeling process from Apache Spark.



Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.