Amazon EMR | AWS Big Data Blog

Analyze data in Amazon DynamoDB using Amazon SageMaker for real-time prediction

I’ll describe how to read the DynamoDB backup file format in Data Pipeline, how to convert the objects in S3 to a CSV format that Amazon ML can read, and I’ll show you how to schedule regular exports and transformations using Data Pipeline.

Getting started: Training resources for Big Data on AWS

Whether you’ve just signed up for your first AWS account or you’ve been with us for some time, there’s always something new to learn as our services evolve to meet the ever-changing needs of our customers. To help ensure you’re set up for success as you build with AWS, we put together this quick reference guide for Big Data training and resources available here on the AWS site.

How to migrate a Hue database from an existing Amazon EMR cluster

This post describes the step-by-step process for migrating the Hue database from an existing EMR cluster.

Easily manage table metadata for Presto running on Amazon EMR using the AWS Glue Data Catalog

In this post, we will explore how the AWS Glue Data Catalog addresses discoverability and manageability for table metadata for Presto on Amazon EMR.

Build a Multi-Tenant Amazon EMR Cluster with Kerberos, Microsoft Active Directory Integration and IAM Roles for EMRFS

In this post, we will discuss what EMRFS authorization is (Amazon S3 storage-level access control) and show how to configure the role mappings with detailed examples.

Dynamically Create Friendly URLs for Your Amazon EMR Web Interfaces

This solution provides a serverless approach to automatically assigning a friendly name for your EMR cluster for easy access to popular notebooks and other web interfaces.

Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory

This post walks you through the process of using AWS CloudFormation to set up a cross-realm trust and extend authentication from an Active Directory network into an Amazon EMR cluster with Kerberos enabled. By establishing a cross-realm trust, Active Directory users can use their Active Directory credentials to access an Amazon EMR cluster and run jobs as themselves.

Custom Log Presto Query Events on Amazon EMR for Auditing and Performance Insights

In this blog post, we will demonstrate how to implement and install a Presto event listener for purposes of custom logging, debugging and performance analysis for queries executed on an EMR cluster.

Genomic Analysis with Hail on Amazon EMR and Amazon Athena

For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.

Create Custom AMIs and Push Updates to a Running Amazon EMR Cluster Using Amazon EC2 Systems Manager

In this post, I show how Systems Manager Automation can be used to automate the creation and patching of custom Amazon Linux AMIs for EMR. I also show how you can use Run Command to send commands to all nodes of a running EMR cluster.

AWS Big Data Blog

Category: Amazon EMR