AWS Big Data Blog

Fact or Fiction: Google Big Query Outperforms Amazon Redshift as an Enterprise Data Warehouse?

by Randall Hunt | on | | Comments

Randall Hunt is a Technical Evangelist for Amazon Web Services

A few weeks ago, 2nd Watch (a leading cloud native Systems Integrator) wrote the Benchmarking Amazon Aurora post, analyzing Google’s benchmark of their Cloud SQL database service against AWS’s Amazon Aurora. In that analysis, 2nd Watch found that Aurora outperforms Cloud SQL consistently and that Google’s benchmarks were incorrect and misleading. 2nd Watch also pointed out the peculiar aspects in Google’s approach (i.e., artificially constraining database threads to remove Aurora’s advantage of much better performance at high thread counts—something most Aurora customers take advantage of). 2nd Watch then provided their take on what a fair and reasonable performance benchmark would look like.

On September 29, 2016, Google began presenting a new data warehouse performance benchmark that had many people scratching their heads again. This time, Google compared AWS’s Amazon Redshift with Google BigQuery. Similar to the previous example, the approach taken on the tests and the marketing claims they derived from the results were oddly misleading.

Publishing misleading performance benchmarks is a classic old guard marketing tactic. It’s not surprising to see old guard companies (like Oracle) doing this, but we were kind of surprised to see Google take this approach, too. So, when Google presented their BigQuery vs. Amazon Redshift benchmark results at a private event in San Francisco on September 29, 2016, it piqued our interest and we decided to dig deeper.

For their tests, Google used the TPC-H benchmark, which measures performance against 22 different queries and is typically used to evaluate data warehouses and other decision support systems. Instead of presenting the results from all the queries, as is the standard practice, Google cherry-picked one single query that generated favorable results for BigQuery (Query #22, which happens to be one of the least sophisticated queries—one with simple filters and no joins), and used this data to make the broad claim that BigQuery outperforms Amazon Redshift.

To verify Google’s claim with our own testing, we ran the full TPC-H benchmark, consisting of all 22 queries, using a 10 TB dataset on Amazon Redshift against the latest version of BigQuery. We set up Amazon Redshift with basic configurations that our customers typically put in place, like compression, distribution keys on large tables, and sort keys on commonly filtered columns. We used an 8-node DC1.8XL Amazon Redshift cluster for the tests. Below is a summary of our findings.

Amazon Redshift outperformed BigQuery on 18 of 22 TPC-H benchmark queries by an average of 3.6X

When we ran the entire 22-query benchmark, we found that Amazon Redshift outperformed BigQuery by 3.6X on average on 18 of 22 TPC-H queries. Looking at relative performance for the entire set of queries, Amazon Redshift outperforms BigQuery by 2X. The chart below summarizes the comparison of elapsed times between the two services for the entire TPC-H benchmark (lower is better).

o_Fact_1_2 (more…)

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR

by Tom Zeng | on | | Comments

Tom Zeng is a Solutions Architect for Amazon EMR

The recently released sparklyr package by RStudio has made processing big data in R a lot easier. sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API.

Amazon EMR is a popular, hosted big data processing service on AWS that provides the latest version of Spark and other Hadoop ecosystem applications, such as Hive, Pig, Tez, and Presto.

Running RStudio and sparklyr on EMR is simple; the following AWS CLI command launches a 6-node (1 master node and 5 worker nodes) EMR 5.0 cluster with Spark, RStudio, and sparklyr pre-installed and ready to use:

aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Hive Name=Pig Name=Tez Name=Ganglia \
--release-label emr-5.0.0 --name "EMR 5.0 RStudio + sparklyr" --service-role EMR_DefaultRole \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.2xlarge \
InstanceGroupType=CORE,InstanceCount=5,InstanceType=m3.2xlarge --bootstrap-actions \
Name="Install RStudio" --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=<Your Key> \
--configurations '[{"Classification":"spark","Properties":{"maximizeResourceAllocation":"true"}}]' \
--region us-east-1


Optimizing Amazon S3 for High Concurrency in Distributed Workloads

by Aaron Friedman | on | | Comments

Aaron Friedman is a Healthcare and Life Sciences Solution Architect with Amazon Web Services

The healthcare and life sciences landscape is being transformed rapidly by big data. By intersecting petabytes of genomic data with clinical information, AWS customers and partners are already changing healthcare as we know it.

One of the most important things in any type of data analysis is to represent data in a cost-optimized and performance-efficient manner. Before we can derive insights from the genomes of thousands of individuals, genomic data must first be transformed into a queryable format. This information often starts in a raw Variant Call Format (VCF) file stored in an S3 bucket. To make it queryable across many patients at once, the data can be stored as Apache Parquet files in a data lake built in either the same or a different S3 bucket.

Apache Parquet is a columnar storage file format that is designed for querying large amounts of data, regardless of the data processing framework, data model, or programming language. Amazon S3 is a secure, durable, and highly scalable home for Parquet files. When using computational-intensive algorithms, you can get maximum performance through small renaming optimizations of S3 objects. The extract, transform, load (ETL) processes occur in a write-once, read-many fashion and can produce many S3 objects that collectively are stored and referenced as a Parquet file. Then data scientists can query the Parquet file to identify trends.

In today’s blog post, I will discuss how to optimize Amazon S3 for an architecture commonly used to enable genomic data analyses. This optimization is important to my work in genomics because, as genome sequencing continues to drop in price, the rate at which data becomes available is accelerating.

Although the focus of this post is on genomic data analyses, the optimization can be used in any discipline that has individual source data that must be analyzed together at scale.


This architecture has no administration costs. In addition to being scalable, elastic, and automatic, it handles errors and has no impact on downstream users who might be querying the data from S3.

S3 is a massively scalable key-based object store that is well-suited for storing and retrieving large datasets. Due to its underlying infrastructure, S3 is excellent for retrieving objects with known keys. S3 maintains an index of object keys in each region and partitions the index based on the key name. For best performance, keys that are often read together should not have sequential prefixes. Keys should be distributed across many partitions rather than on the same partition.

For large datasets like genomics, population-level analyses of these data can require many concurrent S3 reads by many Spark executors. To maximize performance of high-concurrency operations on S3, we need to introduce randomness into each of the Parquet object keys to increase the likelihood that the keys are distributed across many partitions.

The following diagram shows the ETL process, S3 object naming, and error reporting and handling steps of genomic data. This post covers steps 3-5.



How Eliza Corporation Moved Healthcare Data to the Cloud

by NorthBay Solutions | on | | Comments

This is a guest post by Laxmikanth Malladi, Chief Architect at NorthBay. NorthBay is an AWS Advanced Consulting Partner and an AWS Big Data Competency Partner

“Pay-for-performance” in healthcare pays providers more to keep the people under their care healthier. This is a departure from fee-for-service where payments are for each service used. Pay-for-performance arrangements provide financial incentives to hospitals, physicians, and other healthcare providers to carry out improvements and achieve optimal outcomes for patients.

Eliza Corporation, a company that focuses on health engagement management, acts on behalf of healthcare organizations such as hospitals, clinics, pharmacies, and insurance companies. This allows them to engage people at the right time, with the right message, and in the right medium. By meeting them where they are in life, Eliza can capture relevant metrics and analyze the overall value provided by healthcare.

Eliza analyzes more than 200 million such outreaches per year, primarily through outbound phone calls with interactive voice responses (IVR) and other channels. For Eliza, outreach results are the questions and responses that form a decision tree, with each question and response captured as a pair:

<question, response>: <“Did you visit your physician in the last 30 days?” , “Yes”>

This type of data has been characteristic and distinctive for Eliza and poses challenges in processing and analyzing. For example, you can’t have a table with fixed columns to store the data.

The majority of data at Eliza takes the form of outreach results captured as a set of <attribute> and <attribute value> pairs. Other data sets at Eliza include structured data for the members to target for outreach. This data is received from various systems that include customers, claims data, pharmacy data, electronic medical records (EMR/EHR) data, and enrichment data. There are considerable variety and quality considerations in the data that Eliza deals with for keeping the business running.

NorthBay was chosen as the big data partner to architect and implement a data infrastructure to improve the overall performance of Eliza’s process. NorthBay architected a data lake on AWS for Eliza’s use case and implemented majority of the data lake components by following the best practice recommendations from the AWS white paper “Building a Data Lake on AWS.”

In this post, I discuss some of the practical challenges faced during the implementation of the data lake for Eliza and the corresponding details of the ways we solved these issues with AWS. The challenges we faced involved the variety of data and a need for a common view of the data.

Data transformation

This section highlights some of the transformations done to overcome the challenges related to data obfuscation, cleansing, and mapping.

The following architecture depicts the flow for each of these processes.


  • The Amazon S3 manifest file or time-based event triggers an AWS Lambda function.
  • The Lambda function launches an AWS Data Pipeline orchestration process passing the relevant parameters.
  • The Data Pipeline process creates a transient Amazon EMR resource and submits the appropriate Hadoop job.
  • The Hadoop job is configured to read the relevant metadata tables from Amazon DynamoDB and AWS KMS (for encrypt/decrypt operations).
  • Using the metadata, the Hadoop job transforms the input data to put results in the appropriate S3 location.
  • When the Hadoop job is complete, an Amazon SNS topic is notified for further processing.


Building Event-Driven Batch Analytics on AWS

by Karthik Sonti | on | | Comments

Karthik Sonti is a Senior Big Data Architect with AWS Professional Services

Modern businesses typically collect data from internal and external sources at various frequencies throughout the day. These data sources could be franchise stores, subsidiaries, or new systems integrated as a result of merger and acquisitions.

For example, a retail chain might collect point-of-sale (POS) data from all franchise stores three times a day to get insights into sales as well as to identify the right number of staff at a given time in any given store. As each franchise functions as an independent business, the format and structure of the data might not be consistent across the board. Depending on the geographical region, each franchise would provide data at a different frequency and the analysis of these datasets should wait until all the required data is provided (event-driven) from the individual franchises. In most cases, the individual data volumes received from each franchise are usually small but the velocity of the data being generated and the collective volume can be challenging to manage.

In this post, I walk you through an architectural approach as well as a sample implementation on how to collect, process, and analyze data for event-driven applications in AWS.


The architecture diagram below depicts the components and the data flow needed for a event-driven batch analytics system. At a high-level, this architecture approach leverages Amazon S3 for storing source, intermediate, and final output data; AWS Lambda for intermediate file level ETL and state management; Amazon RDS as the state persistent store; Amazon EMR for aggregated ETL (heavy lifting, consolidated transformation, and loading engine); and Amazon Redshift as the data warehouse hosting data needed for reporting.

In this architecture, each location on S3 stores data at a certain state of transformation. When new data is placed at a specific location, an S3 event is raised that triggers a Lambda function responsible for the next transformation in the chain. You can use this event-driven approach to create sophisticated ETL processes, and to syndicate data availability at a given point in the chain.



Month in Review: September 2016

by Derek Young | on | | Comments

Another month of big data solutions on the Big Data Blog. Take a look at our summaries below and learn, comment, and share. Thanks for reading!

Processing VPC Flow Logs with Amazon EMR
In this post, learn how to gain valuable insight into your network by using Amazon EMR and Amazon VPC Flow Logs. The walkthrough implements a pattern often found in network equipment called ‘Top Talkers’, an ordered list of the heaviest network users, but the model can also be used for many other types of network analysis.

Integrating IoT Events into Your Analytic Platform
AWS IoT makes it easy to integrate and control your devices from other AWS services for even more powerful IoT applications. In particular, IoT provides tight integration with AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, Amazon DynamoDB, Amazon CloudWatch, and Amazon Elasticsearch Service. In this post, you’ll explore two of these integrations: Amazon S3 and Amazon Kinesis Firehose.

Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 2
This is the second of two AWS Big Data posts on Writing SQL on Streaming Data with Amazon Kinesis Analytics.This post introduces you to the different types of windows supported by Amazon Kinesis Analytics, the importance of time as it relates to stream data processing, and best practices for sending your SQL results to a configured destination.


Real-time Stream Processing Using Apache Spark Streaming and Apache Kafka on AWS

by Prasad Alle | on | | Comments

Prasad Alle is a consultant with AWS Professional Services

Intuit, a creator of business and financial management solutions, is a leading enterprise customer for AWS. The Intuit Data team (IDEA) at Intuit is responsible for building platforms and products that enable a data-driven personalized experience across Intuit products and services.

One dimension of this platform is the streaming data pipeline that enables event-based data to be available for both analytic and real time applications. These include—but are not limited to—applications for personalization, product discovery, fraud detection, and more.

The challenge is building a platform that can support and integrate to over 50+ products and services across Intuit and one that further considers seasonality and the evolution of use cases. Intuit requires a data platform that can scale and abstract the underlying complexities of a distributed architecture, allowing users to focus on leveraging the data rather than managing ingestion.

Amazon EMR, Amazon Kinesis, and Amazon S3 were among the initial considerations to build out this architecture at scale. Given that Intuit had existing infrastructure leveraging Kafka on AWS, the first version was designed using Apache Kafka on Amazon EC2, EMR, and S3 for persistence. Amazon Kinesis provides an alternative managed solution for streaming, which reduces the amount of administration and monitoring required. For more information about Amazon Kinesis reference architectures, see Amazon Kinesis Streams Product Details.

This post demonstrates how to set up Apache Kafka on EC2, use Spark Streaming on EMR to process data coming in to Apache Kafka topics, and query streaming data using Spark SQL on EMR.

Note: This is an example and should not be implemented in a production environment without considering additional operational issues about Apache Kafka and EMR, including monitoring and failure handling.

Intuit’s application architecture

Before detailing Intuit’s implementation, it is helpful to consider the application architecture and physical architecture in the AWS Cloud. The following application architecture can launch via a public subnet or within a private subnet.

o_RealtimeStream_1 (more…)

Join us This Week at Strata + Hadoop World in New York City

by Jorge A. Lopez | on | | Comments

We’re back in Manhattan for the Strata + Hadoop World conference, from Tuesday, September 27-29. Come see the AWS Big Data team at Booth #738, where big data experts will be happy to answer your questions, hear about your specific requirements, and help you with your big data initiatives.

Catch a presentation

Get technical details and best practices from AWS experts. Hear directly from customers and learn from the experience of other organizations that are deploying big data solutions on AWS.

Below is a highlight of some AWS and customer sessions:

FINRA: A unified ecosystem for market data visualization
Janaki Parameswaran, Kishore Ramachandran, FINRA
Wed 2:05pm, Room 1E10/1E11

Big data architectural patterns and best practices on AWS
Siva Raghupathy, AWS
Wed 4:35PM, Room 3D12

Netflix: The Netflix data platform – Now and in the future
Kurt Brown, Netflix
Wed 5:25pm, Room 1E07/1E08

Running Presto and Spark on AWS: From zero to insight in less than five minutes
Jonathan Fritz, AWS
Thu 2:05PM, Room: 1E09

Hearst:  Life of a click: How Hearst manages clickstream analytics in the cloud
Rick McFarland, Hearst
Thu 2:55pm, Room 3D12

Amazon Kinesis – Real-time streaming data in the AWS cloud
Roy Ben-Alta, AWS
Thu 4:35PM, Room 3D 08

Hope to see you there to have fun, share your experiences, and learn more about big data and AWS!

Amazon EMR-DynamoDB Connector Repository on AWSLabs GitHub

by Mike Grimes | on | | Comments

Mike Grimes is a Software Development Engineer with Amazon EMR

Amazon Web Services is excited to announce that the Amazon EMR-DynamoDB Connector is now open-source. The EMR-DynamoDB Connector is a set of libraries that lets you access data stored in DynamoDB with Spark, Hadoop MapReduce, and Hive jobs. These libraries are currently shipped with EMR releases, but we will now build these from the emr-dynamodb-connector GitHub repository. The code you see in the repository is exactly what is available on your EMR cluster, making it easier to build applications with this component.

Amazon EMR regularly contributes open-source changes to improve applications for developers, data scientists, and analysts in the community. Check out the README to learn how to build, contribute to, and use the Amazon EMR-DynamoDB Connector!

Encrypt Data At-Rest and In-Flight on Amazon EMR with Security Configurations

by Jonathan Fritz | on | | Comments

Customers running analytics, stream processing, machine learning, and ETL workloads on personally identifiable information, health information, and financial data have strict requirements for encryption of data at-rest and in-transit. The Apache Spark and Hadoop ecosystems lend themselves to these big data use cases, and customers have asked us to provide a quick and easy way to encrypt data at-rest and data in-transit between nodes in each execution framework.

With the release of security configurations for Amazon EMR release 5.0.0 and 4.8.0, customers can now easily enable encryption for data at-rest in Amazon S3, HDFS, and local disk, and enable encryption for data in-flight in the Apache Spark, Apache Tez, and Apache Hadoop MapReduce frameworks.

Security configurations make it easy to specify the encryption keys and certificates to use, ranging from AWS Key Management Service to supplying your own custom encryption materials provider (for an example of custom providers, see the Nasdaq about EMRFS and Amazon S3 client-side encryption post). Additionally, you can apply a security configuration to multiple clusters, making it easy to standardize your security settings. For instance, this makes it easy for customers to encrypt data across their HIPAA-compliant Amazon EMR workloads.

The following is an example security configuration specifying SSE-KMS for Amazon S3 encryption (using EMRFS), AWS KMS key for local disk encryption (which will also encrypt HDFS blocks), and a set of TLS certificates in Amazon S3 for applications that require them for encryption in-transit:

After you create a security configuration, you can specify it when creating a cluster and apply the settings. Security configurations can also be created using the AWS CLI or SDK. For more information, see Encrypting Data with Amazon EMR. If you have any questions or would like to share an interesting use case about encryption on Amazon EMR, please leave a comment below.