AWS Big Data Blog

Month in Review: September 2016

Another month of big data solutions on the Big Data Blog. Take a look at our summaries below and learn, comment, and share. Thanks for reading!

Processing VPC Flow Logs with Amazon EMR
In this post, learn how to gain valuable insight into your network by using Amazon EMR and Amazon VPC Flow Logs. The walkthrough implements a pattern often found in network equipment called ‘Top Talkers’, an ordered list of the heaviest network users, but the model can also be used for many other types of network analysis.

Integrating IoT Events into Your Analytic Platform
AWS IoT makes it easy to integrate and control your devices from other AWS services for even more powerful IoT applications. In particular, IoT provides tight integration with AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, Amazon DynamoDB, Amazon CloudWatch, and Amazon OpenSearch Service. In this post, you’ll explore two of these integrations: Amazon S3 and Amazon Kinesis Firehose.

Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 2
This is the second of two AWS Big Data posts on Writing SQL on Streaming Data with Amazon Kinesis Analytics.This post introduces you to the different types of windows supported by Amazon Kinesis Analytics, the importance of time as it relates to stream data processing, and best practices for sending your SQL results to a configured destination.

Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics
Analyzing web log traffic to gain insights that drive business decisions has historically been performed using batch processing.  While effective, this approach results in delayed responses to emerging trends and user activities. In this post, learn how an analytics pipeline detects anomalies in real time for a web traffic stream using the RANDOM_CUT_FOREST function available in Amazon Kinesis Analytics.

Encrypt Data At-Rest and In-Flight on Amazon EMR with Security Configurations
With the release of security configurations for Amazon EMR release 5.0.0 and 4.8.0, customers can now easily enable encryption for data at-rest in Amazon S3, HDFS, and local disk, and enable encryption for data in-flight in the Apache Spark, Apache Tez, and Apache Hadoop MapReduce frameworks.

Amazon EMR-DynamoDB Connector Repository on AWSLabs GitHub
Amazon Web Services is excited to announce that the Amazon EMR-DynamoDB Connector is now open-source. The EMR-DynamoDB Connector is a set of libraries that lets you access data stored in DynamoDB with Spark, Hadoop MapReduce, and Hive jobs. These libraries are currently shipped with EMR releases, but we will now build these from the emr-dynamodb-connector GitHub repository

Real-time Stream Processing Using Apache Spark Streaming and Apache Kafka on AWS
This post demonstrates how to set up Apache Kafka on EC2, use Spark Streaming on EMR to process data coming in to Apache Kafka topics, and query streaming data using Spark SQL on EMR.


Running R on AWS (July 2015)
In this post, learn five launch steps that impact your R-based analysis environment on AWS. After that, you’ll analyze data located on Amazon S3 and configure Shiny Server. This post uses the AWS public data set CCAFS-Climate Data, a 6 TB data set with high-resolution climate data, to assess the impacts of climate change, primarily on agriculture.


Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.