AWS Big Data Blog

Tag: Amazon EMR

Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory

This post walks you through the process of using AWS CloudFormation to set up a cross-realm trust and extend authentication from an Active Directory network into an Amazon EMR cluster with Kerberos enabled. By establishing a cross-realm trust, Active Directory users can use their Active Directory credentials to access an Amazon EMR cluster and run jobs as themselves.

Read More

Genomic Analysis with Hail on Amazon EMR and Amazon Athena

For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.

Read More

Turbocharge your Apache Hive Queries on Amazon EMR using LLAP

NOTE: The bootstrap action script in this post may need periodic updates as newer EMR releases become available. Please leave a comment if you have any trouble implementing this solution. ——————————- Apache Hive is one of the most popular tools for analyzing large datasets stored in a Hadoop cluster using SQL. Data analysts and scientists […]

Read More

Run Common Data Science Packages on Anaconda and Oozie with Amazon EMR

In the world of data science, users must often sacrifice cluster set-up time to allow for complex usability scenarios. Amazon EMR allows data scientists to spin up complex cluster configurations easily, and to be up and running with complex queries in a matter of minutes. Data scientists often use scheduling applications such as Oozie to […]

Read More

Setting up Read Replica Clusters with HBase on Amazon S3

Many customers have taken advantage of the numerous benefits of running Apache HBase on Amazon S3 for data storage, including lower costs, data durability, and easier scalability. Customers such as FINRA have lowered their costs by 60% by moving to an HBase on S3 architecture along with the numerous operational benefits that come with decoupling […]

Read More

Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3

Although it’s common for Amazon EMR customers to process data directly in Amazon S3, there are occasions where you might want to copy data from S3 to the Hadoop Distributed File System (HDFS) on your Amazon EMR cluster. Additionally, you might have a use case that requires moving large amounts of data between buckets or regions. In these use cases, large datasets are too big for a simple copy operation.

Read More

Build a Healthcare Data Warehouse Using Amazon EMR, Amazon Redshift, AWS Lambda, and OMOP

In the healthcare field, data comes in all shapes and sizes. Despite efforts to standardize terminology, some concepts (e.g., blood glucose) are still often depicted in different ways. This post demonstrates how to convert an openly available dataset called MIMIC-III, which consists of de-identified medical data for about 40,000 patients, into an open source data […]

Read More