AWS Big Data Blog

re:Invent 2016: AWS Big Data & Machine Learning Sessions

Roy Ben-Alta is Sr. Business Development Manager at AWS – Big Data & Machine Learning
Updated December 9, 2016 with links to session videos.


We can’t believe that there are just a couple of weeks left before re:Invent 2016. If you are attending this year, you will want to check out our Big Data sessions! Unlike in previous years, these sessions are covered in multiple tracks, such as Big Data & Analytics, Architecture, Databases, and IoT. We will also have—for the first time—two mini-conferences: Big Data and Machine Learning. These resource mini-conferences include full-day technical deep dives on a broad variety of topics, including big data, IoT, machine learning, and more.

This year, we have over 40 sessions!

We have great sessions from Netflix, Chick-fil-A, Under Armour, FINRA,, Beeswax, GE, Toyota Racing Development, Quantcast, Groupon,, Scholastic,Thomson Reuters, DataXu, Sony, EA, and many more. All sessions are recorded and made available on YouTube. Also, all slide decks from the sessions are made available on after the conference.

Today, I highlight the sessions to be presented as part of the Big Data & Machine Learning mini-conferences, Big Data analytics, and relevant sessions from other tracks. The following sessions are in this year’s session catalog. Choose any link to learn more or to add a session to your schedule.

We are looking forward to meeting you at re:invent.

State of the union

BDM205 – Big Data Mini-Con State of the Union –  (Video)
Join us for this general session where AWS big data experts present an in-depth look at the current state of big data. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data announcements, as we kick off the Big Data Mini-Con.

MAC206 – Amazon Machine Learning State of the Union Mini-Con  – (Video)
With the growing number of business cases for artificial intelligence (AI), machine learning and deep learning continue to drive the development of state-of-the-art technology. We see this manifested in computer vision, predictive modeling, natural language understanding, and recommendation engines. During this full day of sessions and workshops, learn how we use some of these technologies within Amazon, and how you can develop your applications to leverage the benefits of these AI services.

Deep dive customer use case sessions

ARC306 – Event Handling at Scale: Designing an Auditable Ingestion and Persistence Architecture for 10K+ events/second – (Video)
How does McGraw-Hill Education use the AWS platform to scale and reliably receive 10,000 learning events per second? How do we provide near-real-time reporting and event-driven analytics for hundreds of thousands of concurrent learners in a reliable, secure, and auditable manner that is cost effective? MHE designed and implemented a robust solution that integrates AWS API Gateway, AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Elasticsearch Service, Amazon DynamoDB, HDFS, Amazon EMR, Amazon EC2, and other technologies to deliver this cloud-native platform across the US and soon the world. This session describes the challenges we faced, architecture considerations, how we gained confidence for a successful production roll-out, and the behind-the-scenes lessons we learned.

ARC308 – Metering Big Data at AWS: From 0 to 100 Million Records in 1 Second – (Video)
Learn how AWS processes millions of records per second to support accurate metering across AWS and our customers. This session shows how we migrated from traditional frameworks to AWS managed services to support a broad processing pipeline. You gain insights on how we used AWS services to build a reliable, scalable, and fast processing system using Amazon Kinesis, Amazon S3, and Amazon EMR. Along the way, we dive deep into use cases that deal with scaling and accuracy constraints. Attend this session to see AWS’s end-to-end solution that supports metering at AWS.

BDA203 – Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift  – (Video)
Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift GE Power & Water develops advanced technologies to help solve some of the world’s most complex challenges related to water availability and quality. They had amassed billions of rows of data on on-premises databases but decided to migrate some of their core big data projects to the AWS Cloud. When they decided to transform and store it all in Amazon Redshift, they knew they needed an ETL/ELT tool that could handle this enormous amount of data and safely deliver it to its destination.

In this session, Ryan Oates, Enterprise Architect at GE Water, shares his use case, requirements, outcomes and lessons learned. He also shares the details of his solution stack, including Amazon Redshift and Matillion ETL for Amazon Redshift in AWS Marketplace. You learn best practices on Amazon Redshift ETL supporting enterprise analytics and big data requirements, simply and at scale. You learn how to simplify data loading, transformation and orchestration on to Amazon Redshift and how to build out a real data pipeline.

BDA204 – Leverage the Power of the Crowd To Work with Amazon Mechanical Turk – (Video)
With Amazon Mechanical Turk (MTurk), you can leverage the power of the crowd for a host of tasks ranging from image moderation and video transcription to data collection and user testing. You simply build a process that submits tasks to the Mechanical Turk marketplace and get results quickly, accurately, and at scale. In this session, Russ, from Rainforest QA, shares best practices and lessons learned from his experience using MTurk. The session covers the key concepts of MTurk, getting started as a Requester, and using MTurk via the API. You learn how to set and manage Worker incentives, achieve great Worker quality, and how to integrate and scale your crowdsourced application. By the end of this session, you have a comprehensive understanding of MTurk and know how to get started harnessing the power of the crowd.

BDA205 – Delighting Customers Through Device Data with Salesforce IoT Cloud and AWS IoT  – (Video)
The Internet of Things (IoT) produces vast quantities of data that promise a deep, always connected view into customer experiences through their devices. In this connected age, the question is no longer how do you gather customer data, but what do you do with all that data. How do you ingest at massive scale and develop meaningful experiences for your customers? In this session, you’ll learn how Salesforce IoT Cloud works in concert with the AWS IoT engine to ingest and transform all of the data generated by every one of your customers, partners, devices, and sensors into meaningful action. You’ll also see how customers are using Salesforce and AWS together to process massive quantities of data, build business rules with simple, intuitive tools, and engage proactively with customers in real time. Session sponsored by Salesforce.

BDM203 – FINRA: Building a Secure Data Science Platform on AWS – (Video)
Data science is a key discipline in a data-driven organization. Through analytics, data scientists can uncover previously unknown relationships in data to help an organization make better decisions. However, data science is often performed from local machines with limited resources and multiple datasets on a variety of databases. Moving to the cloud can help organizations provide scalable compute and storage resources to data scientists, while freeing them from the burden of setting up and managing infrastructure. In this session, FINRA, the Financial Industry Regulatory Authority, shares best practices and lessons learned when building a self-service, curated data science platform on AWS. A project that allowed us to remove the technology middleman and empower users to choose the best compute environment for their workloads. Understand the architecture and underlying data infrastructure services to provide a secure, self-service portal to data scientists, learn how we built consensus for tooling from of our data science community, hear about the benefits of increased collaboration among the scientists due to the standardized tools, and learn how you can retain the freedom to experiment with the latest technologies while retaining information security boundaries within a virtual private cloud (VPC).

BDM204 – Visualizing Big Data Insights with Amazon QuickSight – (Video)
Amazon QuickSight is a fast BI service that makes it easy for you to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. QuickSight is built to harness the power and scalability of the cloud, so you can easily run analysis on large datasets, and support hundreds of thousands of users. In this session, we’ll demonstrate how you can easily get started with Amazon QuickSight, uploading files, connecting to Amazon S3 and Amazon Redshift and creating analyses from visualizations that are optimized based on the underlying data.  After we’ve built our analysis and dashboard, we’ll show you easy it is to share it with colleagues and stakeholders in just a few seconds.

BDM206 – Understanding IoT Data: How to Leverage Amazon Kinesis in Building an IoT Analytics Platform on AWS – (Video)
The growing popularity and breadth of use cases for IoT are challenging the traditional thinking of how data is acquired, processed, and analyzed to quickly gain insights and act promptly. Today, the potential of this data remains largely untapped. In this session, we explore architecture patterns for building comprehensive IoT analytics solutions using AWS big data services. We walk through two production-ready implementations. First, we present an end-to-end solution using AWS IoT, Amazon Kinesis, and AWS Lambda. Next, Hello discusses their consumer IoT solution built on top of Amazon Kinesis, Amazon DynamoDB, and Amazon Redshift.

BDM303 – JustGiving: Serverless Data Pipelines, Event-Driven ETL, and Stream Processing  – (Video)
Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), application programming interfaces (API), clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. Building scalable big data pipelines with automated extract-transform-load (ETL) and machine learning processes can address these limitations. JustGiving is the world’s social platform for giving. In this session, we describe how we created several scalable and loosely coupled event-driven ETL and ML pipelines as part of our in-house data science platform called RAVEN. You learn how to leverage AWS Lambda, Amazon S3, Amazon EMR, Amazon Kinesis, and other services to build serverless, event-driven, data and stream processing pipelines in your organization. We review common design patterns, lessons learned, and best practices, with a focus on serverless big data architectures with AWS Lambda.

BDM306 – Netflix: Using Amazon S3 as the fabric of our big data ecosystem – (Video)
Amazon S3 is the central data hub for Netflix’s big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Amazon Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We also provide solutions and methodologies on how you can build your S3 big data hub.

BDM402 – Best Practices for Data Warehousing with Amazon Redshift – (Video)
In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to deliver high throughput and query performance and you learn from how to design optimal schemas, load data efficiently, and use workload management.

BDM403 – Building a Real-Time Streaming Data Platform on AWS – (Video)
In this session, we discuss best practices for building an end-to-end streaming data application using Amazon Kinesis and the CTO of Beeswax share how they build they built their real-time streaming data platform using Amazon Kinesis streams.

DAT201 – Cross-Region Replication with Amazon DynamoDB Streams – (Video)
Learn how Under Armour implemented cross-region replication with Amazon DynamoDB Streams. Come listen as they share the keys to success.

DAT202 – Migrating Your Data Warehouse to Amazon Redshift – (Video)
Amazon Redshift is a fast, simple, cost-effective data warehousing solution, and in this session, we look at the tools and techniques you can use to migrate your existing data warehouse to Amazon Redshift. We then present a case study on Scholastic’s migration to Amazon Redshift. Scholastic, a large 100-year-old publishing company, was running their business with older, on-premise, data warehousing and analytics solutions, which could not keep up with business needs and were expensive. Scholastic also needed to include new capabilities like streaming data and real-time analytics. Scholastic migrated to Amazon Redshift, and achieved agility and faster time to insight while dramatically reducing costs. In this session, Scholastic discusses how they achieved this, including options considered, technical architecture implemented, results, and lessons learned.

DAT204 – How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to Minutes with MongoDB & AWS  – (Video)
Mass spectrometry is the gold standard for determining chemical compositions, with spectrometers often measuring the mass of a compound down to a single electron. This level of granularity produces an enormous amount of hierarchical data that doesn’t fit well into rows and columns. In this talk, learn how Thermo Fisher is using MongoDB Atlas on AWS to allow their users to get near real-time insights from mass spectrometry experiments—a process that used to take days. We also share how the underlying database service used by Thermo Fisher was built on AWS.

DAT308 – Fireside chat with Groupon, Intuit, and LifeLock on solving Big Data database challenges with Redis – (Video)
Redis Labs’ CMO is hosting a fireside chat with leaders from multiple industries including Groupon (e-commerce), Intuit (Finance), and LifeLock (Identity Protection). This conversation-style session covers the Big Data related challenges faced by these leading companies as they scale their applications, ensure high availability, serve the best user experience at lowest latencies, and optimize between cloud and on-premises operations. The introductory level session can appeal to both developer and DevOps functions. Attendees hear about diverse use cases such as recommendations engine, hybrid transactions and analytics operations, and time-series data analysis. The audience learns how the Redis in-memory database platform addresses the above use cases with its multi-model capability and in a cost effective manner to meet the needs of the next generation applications. Session sponsored by Redis Labs.

DAT309 – How Fulfillment by Amazon (FBA) and Scopely Improved Results and Reduced Costs with a Serverless Architecture – (Video)
In this session, we share an overview of leveraging serverless architectures to support high-performance data intensive applications. Fulfillment by Amazon (FBA) built the Seller Inventory Authority Platform (IAP) using Amazon DynamoDB Streams, AWS Lambda functions, Amazon Elasticsearch Service, and Amazon Redshift to improve results and reduce costs. Scopely shares how they used a flexible logging system built on Amazon Kinesis, Lambda, and Amazon ES to provide high-fidelity reporting on hotkeys in Memcached and DynamoDB, and drastically reduce the incidence of hotkeys. Both of these customers are using managed services and serverless architecture to build scalable systems that can meet the projected business growth without a corresponding increase in operational costs

DAT310 – Building Real-Time Campaign Analytics Using AWS Services – (Video)
Quantcast provides its advertising clients the ability to run targeted ad campaigns reaching millions of online users. The real-time bidding for campaigns runs on thousands of machines across the world. When Quantcast wanted to collect and analyze campaign metrics in real time, they turned to AWS to rapidly build a scalable, resilient, and extensible framework. Quantcast used Amazon Kinesis streams to stage data, Amazon EC2 instances to shuffle and aggregate the data, and Amazon DynamoDB and Amazon ElastiCache for building scalable time-series databases. With Elastic Load Balancing and Auto Scaling groups, they can set up distributed microservices with minimal operation overhead. This session discusses their use case, how they architected the application with AWS technologies integrated with their existing home-grown stack, and the lessons they learned.

DAT311 – How Toyota Racing Development Makes Racing Decisions in Real Time with AWS – (Video)
In this session, you learn how Toyota Racing Development (TRD) developed a robust and highly performant real-time data analysis tool for professional racing. In this talk, learn how we structured a reliable, maintainable, decoupled architecture built around Amazon DynamoDB as both a streaming mechanism and a long-term persistent data store. In racing, milliseconds matter and even moments of downtime can cost a race. You’ll see how we used DynamoDB together with Amazon Kinesis Streams and Amazon Kinesis Firehose to build a real-time streaming data analysis tool for competitive racing.

DAT312 – How DataXu scaled its Attribution System to handle billions of events per day with Amazon DynamoDB – (Video)
“Attribution” is the marketing term of art for allocating full or partial credit to advertisements that eventually lead to purchase, sign up, download, or other desired consumer interaction. DataXu shares how we use DynamoDB at the core of our attribution system to store terabytes of advertising history data. The system is cost effective and dynamically scales from 0 to 300K requests per second on demand with predictable performance and low operational overhead.

DAT313 – 6 Million New Registrations in 30 Days: How the Chick-fil-A One App Scaled with AWS – (Video)
Chris leads the team providing back-end services for the massively popular Chick-fil-A One mobile app that launched in June 2016. Chick-fil-A follows AWS best practices for web services and leverages numerous AWS services, including Elastic Beanstalk, DynamoDB, Lambda, and Amazon S3. This was the largest technology-dependent promotion in Chick-fil-A history. To ensure their architecture would perform at unknown and massive scale, Chris worked with AWS Support through an AWS Infrastructure Event Management (IEM) engagement and leaned on automated operations to enable load testing before launch.

DAT316 – How Telltale Games migrated its story analytics from Apache CouchDB to Amazon DynamoDB – (Video)
Every choice made in Telltale Games titles influences how your character develops and how the world responds to you. With millions of users making thousands of choices in a single episode, Telltale Games tracks this data and leverages it to build more relevant stories in real time as the season is developed. In this session, you’ll learn about Telltale Games’ migration from Apache CouchDB to Amazon DynamoDB, the challenges of adjusting capacity to handling spikes in database activity, and how it streamlined its analytics storage to provide new perspectives on player interaction to improve its games.

DAT318 – Migrating from RDBMS to NoSQL: How Sony Moved from MySQL to Amazon DynamoDB  – (Video)
In this session, you learn the key differences between a relational database management service (RDBMS) and non-relational (NoSQL) databases like Amazon DynamoDB. You learn about suitable and unsuitable use cases for NoSQL databases. You’ll learn strategies for migrating from an RDBMS to DynamoDB through a 5-phase, iterative approach. See how Sony migrated an on-premises MySQL database to the cloud with Amazon DynamoDB, and see the results of this migration.

GAM301 – How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful Player Insights – (Video)
In November 2015, Capital Games launched a mobile game accompanying a major feature film release. The back end of the game is hosted in AWS and uses big data services like Amazon Kinesis, Amazon EC2, Amazon S3, Amazon Redshift, and AWS Data Pipeline. Capital Games describe some of their challenges on their initial setup and usage of Amazon Redshift and Amazon EMR. They then go over their engagement with AWS Partner 47lining and talk about specific best practices regarding solution architecture, data transformation pipelines, and system maintenance using AWS big data services. Attendees of this session should expect a candid view of the process to implementing a big data solution. From problem statement identification to visualizing data, with an in-depth look at the technical challenges and hurdles along the way.

LFS303 – How to Build a Big Data Analytics Data Lake – (Video)
For discovery phase research, life sciences companies have to support infrastructure that processes millions to billions of transactions. The advent of a data lake to accomplish such a task is showing itself to be a stable and productive data platform pattern to meet the goal. We discuss how to build a data lake on AWS, using services and techniques such as AWS CloudFormation, Amazon EC2, Amazon S3, IAM, and AWS Lambda. We also review a reference architecture from Amgen that uses a data lake to aid in their Life Science Research.

SVR301 – Real-time Data Processing Using AWS Lambda, Amazon Kinesis  – (Video)
In this session, you learn from Thomson Reuters how they leverage AWS for its Product Insight service. The service provides insights to collect usage analytics for Thomson Reuters products. They walk through its architecture and demonstrate how they leverage Amazon Kinesis Streams, Amazon Kinesis Firehose, AWS Lambda, Amazon S3, Amazon Route 53, and AWS KMS for near real-time access to data being collected around the globe. They also outline how applying AWS methodologies benefited its business, such as time-to-market and cross-region ingestion, auto-scaling capabilities, low-latency, security features, and extensibility.

SVR305 – ↑↑↓↓←→←→ BA Lambda Start – (Video)
Ever wished you had a list of cheat codes to unleash the full power of AWS Lambda for your production workload? Come learn how to build a robust, scalable, and highly available serverless application using AWS Lambda. In this session, we discuss hacks and tricks for maximizing your AWS Lambda performance, such as leveraging customer reuse, using the 500 MB scratch space and local cache, creating custom metrics for managing operations, aligning upstream and downstream services to scale along with Lambda, and many other workarounds and optimizations across your entire function lifecycle. You also learn how Hearst converted its real-time clickstream analytics data pipeline from a server-based model to a serverless one. The infrastructure of the data pipeline relied on Amazon EC2 instances and cron jobs to shepherd data through the process. In 2016, Hearst converted its data pipeline architecture to a serverless process that is based on event triggers and the power of AWS Lambda. By moving from a time-based process to a trigger-based process, Hearst improved its pipeline latency times by 50%.

SVR308 – Content and Data Platforms at Vevo: Rebuilding and Scaling from Zero in One Year  – (Video)
Vevo has undergone a complete strategic and technical reboot, driven not only by product but also by engineering. Since November 2015, Vevo has been replacing monolithic, legacy content services with a modern, modular, microservices architecture, all while developing new features and functionality. In parallel, Vevo has built its data platform from scratch to power the internal analytics as well as a unique music video consumption experience through a new personalized feed of recommendations — all in less than one year. This has been a monumental effort that was made possible in this short time span largely because of AWS technologies. The content team has been heavily using serverless architectures and AWS Lambda in the form of microservices, taking a similar approach to functional programming, which has helped us speed up the development process and time to market. The data team has been building the data platform by heavily leveraging Amazon Kinesis for data exchange across services, Amazon Aurora for consumer-facing services, Apache Spark on Amazon EMR for ETL + Machine Learning, as well as Amazon Redshift as the core analytics data store..

Machine learning sessions

MAC201 – Getting to Ground Truth with Amazon Mechanical Turk – (Video)
Jump-start your machine learning project by using the crowd to build your training set. Before you can train your machine learning algorithm, you need to take your raw inputs and label, annotate, or tag them to build your ground truth. Learn how to use the Amazon Mechanical Turk marketplace to perform these tasks. We share Amazon’s best practices, developed while training our own machine learning algorithms and walk you through quickly getting affordable and high-quality training data.

MAC202 – Deep Learning in Alexa  – (Video)
Neural networks have a long and rich history in automatic speech recognition. In this talk, we present a brief primer on the origin of deep learning in spoken language, and then explore today’s world of Alexa. Alexa is the AWS service that understands spoken language and powers Amazon Echo. Alexa relies heavily on machine learning and deep neural networks for speech recognition, text-to-speech, language understanding, and more. We also discuss the Alexa Skills Kit, which lets any developer teach Alexa new skills.

MAC205 – Deep Learning at Cloud Scale: Improving Video Discoverability by Scaling Up Caffe on AWS  – (Video)
Deep learning continues to push state of the art in domains such as video analytics, computer vision, and speech recognition. Deep networks are powered by amazing levels of representational power, feature learning, and abstraction. This approach comes at the cost of a significant increase in required compute power, which makes the AWS cloud an excellent environment for training.  Innovators in this space are applying deep learning to a variety of applications. One such innovator, Vilynx, a startup based in Palo Alto, realized that the current pre-roll advertising-based models for mobile video weren’t returning publishers’ desired levels of engagement. In this session, we explain the algorithmic challenges of scaling across multiple nodes, and what Intel is doing on AWS to overcome them. We describe the benefits of using AWS CloudFormation to set up a distributed training environment for deep networks. We also showcase Vilynx’s contributions to video discoverability and explain how Vilynx uses AWS tools to understand video content.

MAC302 – Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate  – (Video)
The Howard Hughes Corporation partnered with 47Lining to develop a managed enterprise data lake based on Amazon S3. The purpose of the managed EDL is to fuse relevant on-premises and third-party data to enable Howard Hughes to answer its most valuable business questions. Their first analysis was a lead-scoring model that uses Amazon Machine Learning (Amazon ML) to predict propensity to purchase high-end real estate. The model is based on a combined set of public and private data sources, including all publicly recorded real estate transactions in the US for the past 35 years. By changing their business process for identifying and qualifying leads to use the results of data-driven analytics from their managed data lake in AWS, Howard Hughes increased the number of identified qualified leads in their pipeline by over 400% and reduced the acquisition cost per lead by more than 10 times. In this session, you see a practical example of how to use Amazon ML to improve business results, how to architect a data lake with Amazon S3 that fuses on-premises, third-party, and public datasets, and how to train and run an Amazon ML model to attain predictions

MAC303 – Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark  – (Video)
Customers are adopting Apache Spark‒a set of open-source distributed machine learning algorithms‒on Amazon EMR for large-scale machine learning workloads, especially for applications that power customer segmentation and content recommendation. By leveraging Spark ML, customers can quickly build and execute massively parallel machine learning jobs. Additionally, Spark applications can train models in streaming or batch contexts and can access data from Amazon S3, Amazon Kinesis, Apache Kafka, Amazon Elasticsearch Service, Amazon Redshift, and other services. This session explains how to quickly and easily create scalable Spark clusters with Amazon EMR, build and share models using Apache Zeppelin notebooks, and create a sample application using Spark Streaming, which updates models with real-time data.

MAC306 – Using MXNet for Recommendation Modeling at Scale  – (Video)
For many companies, recommendation systems solve important machine learning problems. But as recommendation systems grow to millions of users and millions of items, they pose significant challenges when deployed at scale. The user-item matrix can have trillions of entries (or more), most of which are zero. To make common ML techniques practical, sparse data requires special techniques. Learn how to use MXNet to build neural network models for recommendation systems that can scale efficiently to large sparse datasets.

MAC307 – Predicting Customer Churn with Amazon Machine Learning – (Video)
In this session, we take a specific business problem—predicting Telco customer churn—and explore the practical aspects of building and evaluating an Amazon Machine Learning model. We explore considerations ranging from assigning a dollar value to applying the model using the relative cost of false positive and false negative errors. We discuss all aspects of putting Amazon ML to practical use, including how to build multiple models to choose from, put models into production, and update them. We also discuss using Amazon Redshift and Amazon S3 with Amazon ML.

Services sessions: Architecture and best practices

BDM201 – Big Data Architectural Patterns and Best Practices on AWS – (Video)
The world is producing an ever-increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost

BDM301 – Best Practices for Apache Spark on Amazon EMR – (Video)
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.

BDM302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana  – (Video)
Elasticsearch is a fully featured search engine used for real-time analytics, and Amazon Elasticsearch Service makes it easy to deploy Elasticsearch clusters on AWS. With Amazon ES, you can ingest and process billions of events per day, and explore the data using Kibana to discover patterns. In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data into it using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.

BDM304 – Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics – (Video)
As more and more organizations strive to gain real-time insights into their business, streaming data has become ubiquitous. Typical streaming data analytics solutions require specific skills and complex infrastructure. However, with Amazon Kinesis Analytics, you can analyze streaming data in real time with standard SQL—there is no need to learn new programming languages or processing frameworks. In this session, we dive deep into the capabilities of Amazon Kinesis Analytics using real-world examples. We’ll present an end-to-end streaming data solution using Amazon Kinesis Streams for data ingestion, Amazon Kinesis Analytics for real-time processing, and Amazon Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Amazon Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system.

BDM401 – Deep Dive: Amazon EMR Best Practices & Design Patterns – (Video)
Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.

DAT304 – Deep Dive on Amazon DynamoDB – (Video)
Explore Amazon DynamoDB capabilities and benefits in detail and learn how to get the most out of your DynamoDB database. We go over best practices for schema design with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, DynamoDB Streams, and more. We also provide lessons learned from operating DynamoDB at scale, including provisioning DynamoDB for IoT.


BDM202 – Workshop: Building Your First Big Data Application with AWS 
Want to get ramped up on how to use Amazon’s big data web services and launch your first big data application on AWS? Join us in this workshop as we build a big data application in real-time using Amazon EMR, Amazon Redshift, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. We review architecture design patterns for big data solutions on AWS and give you access to a take-home lab so that you can rebuild and customize the application yourself.

IOT306 – IoT Visualizations and Analytics 
In this workshop, we focus on visualizations of IoT data using ELK, Amazon Elasticsearch Service, Logstash, and Kibana or Amazon Kinesis. We dive into how these visualizations can give you new capabilities and understanding when interacting with your device data from the context they provide on the world around them.

MAC401 – Scalable Deep Learning Using MXNet  
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding, and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. During this workshop, members of the Amazon Machine Learning team provide a short background on Deep Learning focusing on relevant application domains and an introduction to using the powerful and scalable Deep Learning framework, MXNet. At the end of this tutorial, you’ll gain hands-on experience targeting a variety of applications including computer vision and recommendation engines as well as exposure to how to use preconfigured Deep Learning AMIs and CloudFormation Templates to help speed your development.

STG312 – Workshop: Working with AWS Snowball – Accelerating Data Ingest into the Cloud 
This workshop provides customers with the opportunity to work hands-on with the AWS Snowball service, with attendees broken out into small teams to perform various on-premises to cloud data transfer scenarios using actual Snowball devices. These scenarios include migrating backup & archive data to S3-IA and Amazon Glacier, HDFS cluster migration to S3 for use with Amazon EMR and Amazon Redshift, and leveraging the Snowball API & SDK to build AWS Snowball service integration into a custom application. The session opens with an overview of the service, objectives, and guidance on where to find resources. Attendees should bring their own laptops and should have a basic familiarity with AWS storage services (S3 and Amazon Glacier). Prerequisites: Participants should have an AWS account established and available for use during the workshop. Please bring your own laptop.