Tag: Month in Review


Month in Review: February 2017

by Derek Young | on | | Comments

Another month of big data solutions on the Big Data Blog!

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

NEW POSTS

Implement Serverless Log Analytics Using Amazon Kinesis Analytics
In this post, learn how how to implement a solution that analyzes streaming Apache access log data from an EC2 instance aggregated over 5 minutes.

Migrate External Table Definitions from a Hive Metastore to Amazon Athena
For customers who use Hive external tables on Amazon EMR, or any flavor of Hadoop, a key challenge is how to effectively migrate an existing Hive metastore to Amazon Athena, an interactive query service that directly analyzes data stored in Amazon S3. In this post, learn an approach to migrate an existing Hive metastore to Athena, as well as how to use the Athena JDBC driver to run scripts.

AWS Big Data is Coming to HIMSS!
This year’s HIMSS conference was held at the Orange County Convention Center in Orlando, Florida from February 20 – 23. This blog post lists past AWS Big Data Blog posts to show how AWS technologies are being used to improve healthcare.

Create Tables in Amazon Athena from Nested JSON and Mappings Using JSONSerDe
In this post, you will use the tightly coupled integration of Amazon Kinesis Firehose for log delivery, Amazon S3 for log storage, and Amazon Athena with JSONSerDe to run SQL queries against these logs without the need for data transformation or insertion into a database.

Scheduled Refresh for SPICE Data Sets on Amazon QuickSight
QuickSight uses SPICE (Super-fast, Parallel, In-Memory Calculation Engine), a fully managed data store that enables blazing fast visualizations and can ingest data from AWS, on-premises, and cloud sources. Data in SPICE could be refreshed at any time with the click of a button within QuickSight. This post announced the ability to schedule these refreshes!

Harmonize, Search, and Analyze Loosely Coupled Datasets on AWS
You have come up with an exciting hypothesis, and now you are keen to find and analyze as much data as possible to prove (or refute) it. There are many datasets that might be applicable, but they have been created at different times by different people and don’t conform to any common standard. In this blog post, we will describe a sample application that illustrates how to solve these problems. You can install our sample app, which will harmonize and index three disparate datasets to make them searchable, present a data-driven, customizable UI for searching the datasets to do preliminary analysis and to locate relevant datasets, and integrate with Amazon Athena and Amazon QuickSight for custom analysis and visualization.

(more…)

Month in Review: January 2017

by Derek Young | on | | Comments

Another month of big data solutions on the Big Data Blog!

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

NEW POSTS

Decreasing Game Churn: How Upopa used ironSource Atom and Amazon ML to Engage Users
Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave? In this post, learn how ironSource worked with gaming studio Upopa to build an efficient, cheap, and accurate way to battle churn and make data-driven decisions using ironSource Atom’s data pipeline and Amazon ML.

Create a Healthcare Data Hub with AWS and Mirth Connect
Healthcare providers record patient information across different software platforms. Each of these platforms can have varying implementations of complex healthcare data standards. Also, each system needs to communicate with a central repository called a health information exchange (HIE) to build a central, complete clinical record for each patient. In this post, learn how to consume different data types as messages, transform the information within the messages, and then use AWS services to take action depending on the message type.

Call for Papers! DEEM: 1st Workshop on Data Management for End-to-End Machine Learning
Amazon and Matroid will hold the first workshop on Data Management for End-to-End Machine Learning (DEEM) on May 14th, 2017 in conjunction with the premier systems conference SIGMOD/PODS 2017 in Raleigh, North Carolina. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management, and systems research to discuss data management issues in ML application scenarios. The workshop is soliciting research papers that describe preliminary and ongoing research results.

Converging Data Silos to Amazon Redshift Using AWS DMS
In this post, learn to use AWS Database Migration Service (AWS DMS) and other AWS services to easily converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.

Run Mixed Workloads with Amazon Redshift Workload Management
It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time. Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs. This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.

Secure Amazon EMR with Encryption
In this post, learn how to set up encryption of data at multiple levels using security configurations with EMR. You’ll walk through the step-by-step process to achieve all the encryption prerequisites, such as building the KMS keys, building SSL certificates, and launching the EMR cluster with a strong security configuration.

(more…)

Month in Review: December 2016

by Derek Young | on | | Comments

Another month of big data solutions on the Big Data Blog.

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. In this post, walk through the steps to enable authorization and audit for Amazon EMR clusters using Apache Ranger.

Amazon Redshift Engineering’s Advanced Table Design Playbook
Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. In practice, the best way to improve query performance by orders of magnitude is by tuning Amazon Redshift tables to better meet your workload requirements. This five-part blog series will guide you through applying distribution styles, sort keys, and compression encodings and configuring tables for data durability and recovery purposes.

Interactive Analysis of Genomic Datasets Using Amazon Athena
In this post, learn to prepare genomic data for analysis with Amazon Athena. We’ll demonstrate how Athena is well-adapted to address common genomics query paradigms using the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study. Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.

Joining and Enriching Streaming Data on Amazon Kinesis
In this blog post, learn three approaches for joining and enriching streaming data on Amazon Kinesis Streams by using Amazon Kinesis Analytics, AWS Lambda, and Amazon DynamoDB.

Using SaltStack to Run Commands in Parallel on Amazon EMR
SaltStack is an open source project for automation and configuration management. It started as a remote execution engine designed to scale to many machines while delivering high-speed execution. You can now use the new bootstrap action that installs SaltStack on Amazon EMR. It provides a basic configuration that enables selective targeting of the nodes based on instance roles, instance groups, and other parameters.

Building an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakaway
Amazon Game Studios’ new title Breakaway is an online 4v4 team battle sport that delivers fast action, teamwork, and competition. In this post, learn the technical details of how the Breakaway team uses AWS to collect, process, and analyze gameplay telemetry to answer questions about arena design.

(more…)

Month in Review: November 2016

by Derek Young | on | | Comments

Another month of big data solutions on the Big Data Blog.

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

Use Apache Flink on Amazon EMR
It is even easier to run Flink on AWS as it is now natively supported in Amazon EMR 5.1.0. EMR supports running Flink-on-YARN so you can create either a long-running cluster that accepts multiple jobs or a short-running Flink session in a transient cluster that helps reduce your costs by only charging you for the time that you use.

Scale Your Amazon Kinesis Stream Capacity with UpdateShardCount
With the new Amazon Kinesis Streams UpdateShardCount API operation, you can automatically scale your stream shard capacity by using Amazon CloudWatch alarms, Amazon SNS, and AWS Lambda. In this post, walk through an example of how you can automatically scale your shards using a few lines of code.

Build a Community of Analysts with Amazon QuickSight
In this post, learn how Amazon QuickSight can be used to share dashboards, analyses, and stories. Although fictitious, CoffeeCo, like many companies, benefits from distributing information to people who understand its context and can act on the insights that it contains. 

(more…)

Month in Review: October 2016

by Derek Young | on | | Comments

Another month of big data solutions on the Big Data Blog. Take a look at our summaries below and learn, comment, and share. Thanks for reading!

Building Event-Driven Batch Analytics on AWS
Modern businesses typically collect data from internal and external sources at various frequencies throughout the day. In this post, you learn an elastic and modular approach for how to collect, process, and analyze data for event-driven applications in AWS.

How Eliza Corporation Moved Healthcare Data to the Cloud
Eliza Corporation, a company that focuses on health engagement management, acts on behalf of healthcare organizations such as hospitals, clinics, pharmacies, and insurance companies. This allows them to engage people at the right time, with the right message, and in the right medium. By meeting them where they are in life, Eliza can capture relevant metrics and analyze the overall value provided by healthcare. In this post, you explore some of the practical challenges faced during the implementation of the data lake for Eliza and the corresponding details of the ways NorthBay solved these issues with AWS.

Optimizing Amazon S3 for High Concurrency in Distributed Workloads
This post demonstrates how to optimize Amazon S3 for an architecture commonly used to enable genomic data analyses. Although the focus of this post is on genomic data analyses, the optimization can be used in any discipline that has individual source data that must be analyzed together at scale.

(more…)

Month in Review: September 2016

by Derek Young | on | | Comments

Another month of big data solutions on the Big Data Blog. Take a look at our summaries below and learn, comment, and share. Thanks for reading!

Processing VPC Flow Logs with Amazon EMR
In this post, learn how to gain valuable insight into your network by using Amazon EMR and Amazon VPC Flow Logs. The walkthrough implements a pattern often found in network equipment called ‘Top Talkers’, an ordered list of the heaviest network users, but the model can also be used for many other types of network analysis.

Integrating IoT Events into Your Analytic Platform
AWS IoT makes it easy to integrate and control your devices from other AWS services for even more powerful IoT applications. In particular, IoT provides tight integration with AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, Amazon DynamoDB, Amazon CloudWatch, and Amazon Elasticsearch Service. In this post, you’ll explore two of these integrations: Amazon S3 and Amazon Kinesis Firehose.

Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 2
This is the second of two AWS Big Data posts on Writing SQL on Streaming Data with Amazon Kinesis Analytics.This post introduces you to the different types of windows supported by Amazon Kinesis Analytics, the importance of time as it relates to stream data processing, and best practices for sending your SQL results to a configured destination.

(more…)

Month in Review: August 2016

by Andy Werth | on | | Comments

Another month of big data solutions on the Big Data Blog. Take a look at our summaries below and learn, comment, and share. Thanks for reading!

Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning
With this post, learn how to apply advanced analytics concepts like pattern analysis and machine learning to do risk stratification for patient cohorts.

Building and Deploying Custom Applications with Apache Bigtop and Amazon EMR
When you launch a cluster, Amazon EMR lets you choose applications that will run on your cluster. But what if you want to deploy your own custom application? This post shows you how to build a custom application for EMR for Apache Bigtop-based releases 4.x and greater.

Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1
This post introduces you to Amazon Kinesis Analytics, the fundamentals of writing ANSI-Standard SQL over streaming data, and works through a simple example application that continuously generates metrics over time windows.

(more…)

Month in Review: July 2016

by Derek Young | on | | Comments

July was a busy month of big data solutions on the Big Data Blog. The month started with our most popular story yet, Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE. It was a great post to start a spectacular month. Take a look at our summaries below. Learn, comment, and share. Thank you for reading the AWS Big Data Blog!

Installing and Running JobServer for Apache Spark on Amazon EMR
In this blog post, learn how to install JobServer on EMR using a bootstrap action (BA) derived from the JobServer GitHub repository. Then, run JobServer using a sample dataset.

Process Large DynamoDB Streams Using Multiple Amazon Kinesis Client Library (KCL) Workers
A previous post, described how you can use the Amazon Kinesis Client Library (KCL) and DynamoDB Streams Kinesis Adapter to efficiently process DynamoDB streams. This post focuses on the KCL configurations that are likely to have an impact on the performance of your application when processing a large DynamoDB stream.

Simplify Management of Amazon Redshift Snapshots using AWS Lambda
In this blog post, learn about the new Amazon Redshift Utils module that helps you manage the Snapshots that your cluster creates. You supply a simple configuration, and then AWS Lambda ensures that you have cluster snapshots as frequently as required to meet your RPO.

(more…)

Month in Review: June 2016

by Andy Werth | on | | Comments

Lots to see on the Big Data Blog in June! Please take a look at the summaries below for something that catches your interest.

Use Sqoop to Transfer Data from Amazon EMR to Amazon RDS
Customers commonly process and transform vast amounts of data with EMR and then transfer and store summaries or aggregates of that data in relational databases such as MySQL or Oracle. In this post, learn how to transfer data using Apache Sqoop, a tool designed to transfer data between Hadoop and relational databases.

Analyze Realtime Data from Amazon Kinesis Streams Using Zeppelin and Spark Streaming
Streaming data is everywhere. This includes clickstream data, data from sensors, data emitted from billions of IoT devices, and more. Not surprisingly, data scientists want to analyze and explore these data streams in real time. This post shows you how you can use Spark Streaming to process data coming from Amazon Kinesis streams, build some graphs using Zeppelin, and then store the Zeppelin notebook in Amazon S3.

Processing Amazon DynamoDB Streams Using the Amazon Kinesis Client Library
This post demystifies the KCL by explaining some of its important configurable properties and estimate its resource consumption

(more…)

Month in Review: April 2016

by Andy Werth | on | | Comments

Lots to see on the Big Data Blog in April! Please take a look at the summaries below for something that catches your interest.

Exploring Geospatial Intelligence using SparkR on Amazon EMR
The number of data sources that use location, such as smartphones and sensory devices used in IoT (Internet of things), is expanding rapidly. This explosion has increased demand for analyzing spatial data. Learn how to build a simple GEOINT application using SparkR that will allow you to appreciate GEOINT capabilities.

AWS at Strata+Hadoop 2016: Building a Scalable Architecture on AWS to Process Streaming Data
Last month, Siva Raghupathy and Manjeet Chayel presented “Building a scalable architecture for processing streaming data on AWS” at Hadoop+Strata 2016 in San Jose. This post provides several helpful links to their slides and presentation.

Using CombineInputFormat to Combat Hadoop’s Small Files Problem
Many Amazon EMR customers have architectures that track events and streams and store data in S3. This frequently leads to many small files. It’s now well known that Hadoop doesn’t deal well with small files. This post helps you manage this problem.

(more…)