AWS Big Data Blog

Optimizing for Star Schemas and Interleaved Sorting on Amazon Redshift

Chris Keyser is a Solutions Architect for AWS Many organizations implement star and snowflake schema data warehouse designs and many BI tools are optimized to work with dimensions, facts, and measure groups. Customers have moved data warehouses of all types to Amazon Redshift with great success. The Amazon Redshift team has released support for interleaved […]

Read More

Using AWS Data Pipeline’s Parameterized Templates to Build Your Own Library of ETL Use-case Definitions

Leena Joseph is an SDE for AWS Data Pipeline In an earlier post, we introduced you to ETL processing using AWS Data Pipeline and Amazon EMR. This post shows how to build ETL workflow templates with AWS Data Pipeline, and build a library of recipes to implement common use cases. This is an introduction to […]

Read More

Building a Numeric Regression Model with Amazon Machine Learning

Guy Ernest is a Solutions Architect with AWS We need to predict future values in our businesses. These predictions are important for better planning of resource allocation and making other business decisions. Often, we settle for a simplified heuristic of average values from the past and some change assumption because more accurate alternatives are too […]

Read More

Running a High Performance SAS Grid Manager Cluster on AWS with Intel Cloud Edition for Lustre

Chris Keyser is a Solutions Architect for Amazon Web Services This post was co-authored by Margaret Crevar, Sr. Manager, Performance Validation at SAS. SAS is an AWS Technology Partner. SAS (www.sas.com) is an integrated environment designed for business and advanced data analytics by enterprise and government organizations. SAS and AWS recently performed testing using the […]

Read More

Launching and Running an Amazon EMR Cluster in your VPC – Part 2: Custom DNS

Daniel Garrison is a Big Data Support Engineer for Amazon Web Services In Part 1 you learned how Amazon EMR uses Amazon VPC DNS hostname and DHCP settings to satisfy the Hadoop requirements. Because it’s common to change the domain name setting in your DHCP options set to a custom internal domain name, this post […]

Read More

Nasdaq’s Architecture using Amazon EMR and Amazon S3 for Ad Hoc Access to a Massive Data Set

This is a guest post by Nate Sammons, a Principal Architect for Nasdaq The Nasdaq group of companies operates financial exchanges around the world and processes large volumes of data every day. We run a wide variety of analytic and surveillance systems, all of which require access to essentially the same data sets. The Nasdaq […]

Read More

Processing Amazon Kinesis Stream Data Using Amazon KCL for Node.js

Manan Gosalia is an SDE for Amazon Kinesis This blog post shows you how to get started with the Amazon Kinesis Client Library (KCL) for Node.js. The Node.js framework uses an event-driven, non-blocking I/O model that makes it lightweight, efficient, and perfect for data-intensive, real-time applications that run across distributed devices. JavaScript is also simple […]

Read More

Streaming Analytics with DataTorrent RTS and Amazon EMR

Nick Durkin is a Senior Solution Engineer for DataTorrent. DataTorrent is an AWS Technology Partner. In this blog post, we introduce fast big data and provide context about the DataTorrent RTS streaming analytics platform. In addition, we show you how to implement a real-time, streaming analytics application for capturing social media trends from Twitter using […]

Read More

Launching and Running an Amazon EMR Cluster inside a VPC

Daniel Garrison is a Big Data Support Engineer for Amazon Web Services Introduction With Amazon EC2 now firmly in the VPC-by-default model, it’s important to understand the ins and outs of running your Amazon EMR cluster successfully inside the Amazon VPC environment. In this post, we’ll explore the requirements for Hadoop to operate inside the […]

Read More

Using Amazon EMR and Hunk for Rapid Response Log Analysis and Review

Patrick Shumate is a Solutions Architect for AWS. Introduction It is fairly common to collect access and application logs but never interactively review them. Monitoring dashboards, coupled with well-instrumented applications, allow operators to manage day-to-day operations without ever digging into the flood of logs silently stored in Amazon S3. That works until the monitoring dashboard […]

Read More