AWS Big Data Blog

Tag: Amazon EMR

Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory

This post walks you through the process of using AWS CloudFormation to set up a cross-realm trust and extend authentication from an Active Directory network into an Amazon EMR cluster with Kerberos enabled. By establishing a cross-realm trust, Active Directory users can use their Active Directory credentials to access an Amazon EMR cluster and run jobs as themselves.

Read More

Genomic Analysis with Hail on Amazon EMR and Amazon Athena

For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.

Read More

Turbocharge your Apache Hive Queries on Amazon EMR using LLAP

Apache Hive is one of the most popular tools for analyzing large datasets stored in a Hadoop cluster using SQL. Data analysts and scientists use Hive to query, summarize, explore, and analyze big data. With the introduction of Hive LLAP (Low Latency Analytical Processing), the notion of Hive being just a batch processing tool has […]

Read More

Run Common Data Science Packages on Anaconda and Oozie with Amazon EMR

In the world of data science, users must often sacrifice cluster set-up time to allow for complex usability scenarios. Amazon EMR allows data scientists to spin up complex cluster configurations easily, and to be up and running with complex queries in a matter of minutes. Data scientists often use scheduling applications such as Oozie to […]

Read More