AWS Big Data Blog

Category: Analytics*

Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics

Chris Marshall is a Solutions Architect for Amazon Web Services Analyzing web log traffic to gain insights that drive business decisions has historically been performed using batch processing.  While effective, this approach results in delayed responses to emerging trends and user activities.  There are solutions to deal with processing data in real time using streaming […]

Read More

Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 2

Ryan Nienhuis is a Senior Product Manager for Amazon Kinesis. This is the second of two AWS Big Data posts on Writing SQL on Streaming Data with Amazon Kinesis Analytics. In the last post, I provided an overview of streaming data and key concepts, such as the basics of streaming SQL, and completed a walkthrough […]

Read More

Integrating IoT Events into Your Analytic Platform

Veronika Megler, Ph.D., is a Senior Consultant with AWS Professional Services “We have a fleet of vehicles, with GPS and a bunch of other sensors,” said Bob, the VP at a delivery company. “Today they send their update ‘breadcrumbs’ to another IoT service. We’re planning to have them send their breadcrumbs to AWS IoT instead; […]

Read More

Processing VPC Flow Logs with Amazon EMR

Michael Wallman is a senior consultant with AWS ProServ It’s easy to understand network patterns in small AWS deployments where software stacks are well defined and managed. But as teams and usage grow, its gets harder to understand which systems communicate with each other, and on what ports. This often results in overly permissive security […]

Read More

Monitor Your Application for Processing DynamoDB Streams

Asmita Barve-Karandikar is an SDE with DynamoDB DynamoDB Streams can handle requests at scale, but you risk losing stream records if your processing application lags: DynamoDB Stream records are unavailable after 24 hours. Therefore, when you maintain multiregion read replicas of your DynamoDB table, you might be afraid of losing data. In this post, I […]

Read More

Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1

Ryan Nienhuis is a Senior Product Manager for Amazon Kinesis This is the first of two AWS Big Data blog posts on Writing SQL on Streaming Data with Amazon Kinesis Analytics. In this post, I provide an overview of streaming data and key concepts like the basics of streaming SQL, and complete a walkthrough using […]

Read More

Building and Deploying Custom Applications with Apache Bigtop and Amazon EMR

Hernan Vivani is an Hadoop Systems Engineer for Amazon Web Services When you launch a cluster, Amazon EMR lets you choose applications that will run on your cluster. But what if you want to deploy your own custom application? This post shows you how to build a custom application for EMR for Apache Bigtop-based releases 4.x and greater. EMR […]

Read More

Use Spark 2.0, Hive 2.1 on Tez, and the latest from the Hadoop ecosystem on Amazon EMR release 5.0

Jonathan Fritz is a Senior Product Manager for Amazon EMR We are excited to launch Amazon EMR release 5.0 today, giving customers the latest versions of 16 supported open-source applications in the big data ecosystem, including new major versions of Spark and Hive. Almost exactly a year ago, we shipped release 4.0, which brought significant […]

Read More

Installing and Running JobServer for Apache Spark on Amazon EMR

Derek Graeber is a senior consultant in big data analytics for AWS Professional Services Working with customers who are running Apache Spark on Amazon EMR, I run into the scenario where data loaded into a SparkContext can and should be shared across multiple use cases. They ask a very valid question: “Once I load the […]

Read More

Process Large DynamoDB Streams Using Multiple Amazon Kinesis Client Library (KCL) Workers

Asmita Barve-Karandikar is an SDE with DynamoDB Introduction Imagine you own a popular mobile health app, with millions of users worldwide, that continuously records new information. It sends over one million updates per second to its master data store and needs the updates to be relayed to various replicas across different regions in real time. […]

Read More