AWS Big Data Blog

Category: Amazon EMR

Automating EMR workloads using AWS Step Functions

Amazon EMR allows you to process vast amounts of data quickly and cost-effectively at scale. Using open-source tools such as Apache Spark, Apache Hive, and Presto, and coupled with the scalable storage of Amazon Simple Storage Service (Amazon S3), Amazon EMR gives analytical teams the engines and elasticity to run petabyte-scale analysis for a fraction […]

Implementing LDAP authentication for Hive on a multi-tenant Amazon EMR cluster

As Amazon EMR continues its widespread adoption, it’s important to enforce separation of duties using role-based access when submitting your hive jobs on EMR clusters in multi-tenant environments. In this post, we walk through the steps to set up authentication for Hive using Lightweight Directory Access Protocol (LDAP) and Microsoft Active Directory Domain Controller. Solution […]

Amazon EMR supports Apache Hive ACID transactions

December 2022: The best practice of using EMRFS consistent in this blog post is now obsolete as Amazon S3 has supported strong read-after-write consistency since December, 2020.  Apache Hive is an open-source data warehouse package that runs on top of an Apache Hadoop cluster. You can use Hive for batch processing and large-scale data analysis. […]

Build a self-service environment for each line of business using Amazon EMR and AWS Service Catalog

Enterprises often want to centralize governance and compliance requirements, and provide a common set of policies on how Amazon EMR instances should be set up. You can use AWS Service Catalog to centrally manage commonly deployed Amazon EMR cluster configurations, and this helps you achieve consistent governance and meet your compliance requirements, while at the […]

Enhancing customer safety by leveraging the scalable, secure, and cost-optimized Toyota Connected Data Lake

February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What’s New post to learn more. Toyota Motor Corporation (TMC), a global automotive manufacturer, has made “connected cars” a core priority as part of its broader transformation from an auto company to a mobility company. In recent years, […]

Monitor and Optimize Analytic Workloads on Amazon EMR with Prometheus and Grafana

This post discusses installing and configuring Prometheus and Grafana on an Amazon Elastic Compute Cloud (Amazon EC2) instance, configuring an EMR cluster to emit metrics that Prometheus can scrape from the cluster, and using the Grafana dashboards to analyze the metrics for a workload on the EMR cluster and optimize it. Additionally, we also cover how Prometheus can push alerts to the Alertmanager, and configuring Amazon SNS to send email notifications.

Build a distributed big data reconciliation engine using Amazon EMR and Amazon Athena

This is a guest post by Sara Miller, Head of Data Management and Data Lake, Direct Energy; and Zhouyi Liu, Senior AWS Developer, Direct Energy. Enterprise companies like Direct Energy migrate on-premises data warehouses and services to AWS to achieve fully manageable digital transformation of their organization. Freedom from traditional data warehouse constraints frees up […]

Enable fine-grained data access in Zeppelin Notebook with AWS Lake Formation

This post explores how you can use AWS Lake Formation integration with Amazon EMR (still in beta) to implement fine-grained column-level access controls while using Spark in a Zeppelin Notebook. My previous post Extract Salesforce.com data using AWS Glue and analyzing with Amazon Athena showed you a simple use case for extracting any Salesforce object data using AWS Glue and Apache Spark, saving it to Amazon Simple Storage Service (Amazon S3), cataloging the data using the Data Catalog in Glue, and querying it using Amazon Athena.

Improving RAPIDS XGBoost performance and reducing costs with Amazon EMR running Amazon EC2 G4 instances

This is a guest post by Kong Zhao, Solution Architect at NVIDIA Corporation This post shares how NVIDIA sped up RAPIDS XGBoost performance up to 4.5 times faster and reduced costs up to 5.4 times less by using Amazon EMR running Amazon Elastic Compute Cloud (Amazon EC2) G4 instances. Gradient boosting is a powerful machine […]

Control data access and permissions with AWS Lake Formation and Amazon EMR

What if you could control the access to your data lake centrally? Would it be more convenient to share specific data securely with internal and external customers? With AWS Lake Formation and its integration with Amazon EMR, you can easily perform these administrative tasks. This post goes through a use case and reviews the steps to control the data access and permissions of your existing data lake.