AWS Big Data Blog

Month in Review: March 2016

March provided another full slate of big data solutions on the AWS Big Data Blog! Take a look at the summaries below for something that catches your interest and share with anyone who’s interested in big data.

Will Spark Power the Data behind Precision Medicine?
Spark is already known for being a major player in big data analysis, but it is additionally uniquely capable in advancing genomics algorithms given the complex nature of genomics research. This post introduces gene analysis using Spark on EMR and ADAM, for those new to precision medicine.

Crunching Statistics at Scale with SparkR on Amazon EMR
SparkR is an R package that allows you to integrate complex statistical analysis with large datasets. In this post, we introduce you running R with the Apache SparkR project on Amazon EMR.

AWS Big Data Meetup March 31 in San Francisco: Intro to SparkR and breakout discussions
The guest speaker was Cory Dolphin from Twitter, who talked about AnswersFabric’s real-time analytics product, which processes billions of events in real time, using Twitter’s new stream processing engine, Heron. Chris Crosbie, a Solutions Architect with AWS and a statistician by training, talked about how easy and interactive cloud computing is with SparkR on Amazon EMR.

Anomaly Detection Using PySpark, Hive, and Hue on Amazon EMR
We are surrounded by more and more sensors – some of which we’re not even consciously aware. As sensors become cheaper and easier to connect, they create an increasing flood of data that’s getting cheaper and easier to store and process. This post walks through the three major steps of anomaly detection: clustering the data, choosing the number of clusters, and detecting probable anomalies.

Import Zeppelin notes from GitHub or JSON in Zeppelin 0.5.6 on Amazon EMR
With the latest Zeppelin release (0.5.6) included on Amazon EMR release 4.4.0, you can now import notes using links to S3 JSON files, raw file URLs in GitHub, or local files.

Analyze a Time Series in Real Time with AWS Lambda, Amazon Kinesis and Amazon DynamoDB Streams
As more devices, sensors and web servers continuously collect real-time streaming data, there is a growing need to analyze, understand and react to events as they occur, rather than waiting for a report that is generated the next day. This post explains how to perform time-series analysis on a stream of Amazon Kinesis records, without the need for any servers or clusters, using AWS Lambda, Amazon Kinesis Streams, Amazon DynamoDB and Amazon CloudWatch.

Big Data Website Gets a Big Makeover at AWS
We have completely redesigned the pages and updated them with some of the most common use cases, tutorials, and resources to get you started, along with customer stories and videos so that you can learn from what other organizations are doing.

Analyze Your Data on Amazon DynamoDB with Apache Spark
Every day, tons of customer data is generated, such as website logs, gaming data, advertising data, and streaming videos. Many companies capture this information as it is generated and process it in real time to understand their customers. This post shows show you how to use Apache Spark to process customer data in Amazon DynamoDB.


Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming (January 14, 2016)


Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.