AWS Big Data Blog

Dispatches from re:Invent – Day 2

Matt Yanchyshyn is a Principal Solutions Architect at AWS

Today hundreds of AWS customers participated in bootcamps at re:Invent, including three sessions in the big data space: Store, Manage, and Analyze Big Data in the Cloud, Real Time Data Processing and Analysis with Amazon Redshift and Amazon Kinesis and Building High-Performance Applications on DynamoDB.  Chris Keyser and I led the first bootcamp in this list.  There were a lot of questions about how to leverage AWS Data Pipeline to automate big data pipelines, how to use bootstrap actions to run big data ecosystem applications like Spark, Presto and Impala, and lots of interest in the new native Hue integration – Hue’s integrated Amazon S3 browser and query visualizer definitely got the crowd’s attention.

The best part of the session for me was learning about how customers are leveraging AWS big data services. For example, I met one customer who is running HBase on Amazon EMR to store millions of metadata records about genomic sequencing.  They also store additional data in flat files on HDFS, and they’re using a Java MapReduce program to do massive joins of the data in these files with the HBase metadata.  We explored other potential options such as migrating the flat file data from HDFS to Amazon S3 so it could be accessed via EMRFS, potentially leveraging Amazon DynamoDB as the metadata store, and also whether Presto could one day be an option when its upcoming HBase connector is released.  It’s great seeing customers leverage Amazon EMR to run multiple applications and think creatively about how to store and join disparate data sets at large scale.

JSON was also on several customers’ minds.  One was having performance issues running MapReduce jobs against data stored in a large number of JSON files and was happy to learn about how to use s3distcp to aggregate and compress these files so that their jobs could run faster.  Other customers were exploring Amazon DynamoDB’s new Map and List data types which allow them to write JSON documents directly to their database.  Lastly, we were able to greatly simplify one customer’s workflow by leveraging Amazon Redshift’s “copy from JSON” functionality to load nested JSON documents directly into data warehouse tables, instead of having to transform them before load.

We’re all really looking forward to tomorrow’s keynote, big data breakout sessions and customer interactions at the AWS booth.  Hope to see you there!