AWS Big Data Blog

Category: Analytics

Test drive two big data scenarios from the ‘Building a Big Data Platform on AWS’ bootcamp

Matt Yanchyshyn is a Sr. Manager for AWS Solutions Architecture AWS offers a number of events during the year such as our annual AWS re:Invent conference, the AWS Summit series, the AWS Pop-up Loft, and a variety of roadshows. All of these provide opportunities for AWS customers to attend talks focused on big data and […]

Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch

Hernan Vivani is a Big Data Support Engineer for Amazon Web Services A previous post showed you how to get started with Elasticsearch and Kibana on Amazon EMR. In that post, we installed Elasticsearch and Kibana on an Amazon EMR cluster using bootstrap actions. This post shows you how to build a simple application with […]

Optimizing for Star Schemas and Interleaved Sorting on Amazon Redshift

Chris Keyser is a Solutions Architect for AWS Many organizations implement star and snowflake schema data warehouse designs and many BI tools are optimized to work with dimensions, facts, and measure groups. Customers have moved data warehouses of all types to Amazon Redshift with great success. The Amazon Redshift team has released support for interleaved […]

Using AWS Data Pipeline’s Parameterized Templates to Build Your Own Library of ETL Use-case Definitions

February 2023 Update: Console access to the AWS Data Pipeline service will be removed on April 30, 2023. On this date, you will no longer be able to access AWS Data Pipeline though the console. You will continue to have access to AWS Data Pipeline through the command line interface and API. Please note that […]

Nasdaq’s Architecture using Amazon EMR and Amazon S3 for Ad Hoc Access to a Massive Data Set

This is a guest post by Nate Sammons, a Principal Architect for Nasdaq The Nasdaq group of companies operates financial exchanges around the world and processes large volumes of data every day. We run a wide variety of analytic and surveillance systems, all of which require access to essentially the same data sets. The Nasdaq […]

Processing Amazon Kinesis Stream Data Using Amazon KCL for Node.js

Manan Gosalia is an SDE for Amazon Kinesis This blog post shows you how to get started with the Amazon Kinesis Client Library (KCL) for Node.js. The Node.js framework uses an event-driven, non-blocking I/O model that makes it lightweight, efficient, and perfect for data-intensive, real-time applications that run across distributed devices. JavaScript is also simple […]

Streaming Analytics with DataTorrent RTS and Amazon EMR

Nick Durkin is a Senior Solution Engineer for DataTorrent. DataTorrent is an AWS Technology Partner. In this blog post, we introduce fast big data and provide context about the DataTorrent RTS streaming analytics platform. In addition, we show you how to implement a real-time, streaming analytics application for capturing social media trends from Twitter using […]

Launching and Running an Amazon EMR Cluster inside a VPC

NOTE: This article contains information and instructions only pertinent to older EMR releases (emr-4.6.0 and earlier) and may no longer be applicable.  For latest information please refer to the current user guide. Daniel Garrison is a Big Data Support Engineer for Amazon Web Services Introduction With Amazon EC2 now firmly in the VPC-by-default model, it’s […]

Using Amazon EMR and Hunk for Rapid Response Log Analysis and Review

Patrick Shumate is a Solutions Architect for AWS. Introduction It is fairly common to collect access and application logs but never interactively review them. Monitoring dashboards, coupled with well-instrumented applications, allow operators to manage day-to-day operations without ever digging into the flood of logs silently stored in Amazon S3. That works until the monitoring dashboard […]