EMR | AWS Big Data Blog

EMR Notebooks: A managed analytics environment based on Jupyter notebooks

Notebooks are increasingly becoming the standard tool for interactively developing big data applications. It’s easy to see why. Their flexible architecture allows you to experiment with data in multiple languages, test code interactively, and visualize large datasets. To help scientists and developers easily access notebook tools, we launched Amazon EMR Notebooks, a managed notebook environment […]

Test data quality at scale with Deequ

In this blog post, we introduce Deequ, an open source tool developed and used at Amazon. Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look.

Optimize Amazon EMR costs with idle checks and automatic resource termination using advanced Amazon CloudWatch metrics and AWS Lambda

Many customers use Amazon EMR to run big data workloads, such as Apache Spark and Apache Hive queries, in their development environment. Data analysts and data scientists frequently use these types of clusters, known as analytics EMR clusters. Users often forget to terminate the clusters after their work is done. This leads to idle running […]

Spark enhancements for elasticity and resiliency on Amazon EMR

This blog post provides an overview of the issues with how open-source Spark handles node loss and the improvements in Amazon EMR to address the issues.

Dynamically scale up storage on Amazon EMR clusters

February 2025: The bootstrap action script in this blog post uses IMDS v1 for accessing EC2 instance metadata. The script does not support IMDS v2 and cannot be used in an AWS account which has IMDS v2 enforced across the account. Using the script in an IMDS v2 enabled account will cause issues and unexpected […]

Sharpen your Skill Set with Apache Spark on the AWS Big Data Blog

The AWS Big Data Blog has a large community of authors who are passionate about Apache Spark and who regularly publish content that helps customers use Spark to build real-world solutions. You’ll see content on a variety of topics, including deep-dives on Spark’s internals, building Spark Streaming applications, creating machine learning pipelines using MLlib, and ways […]

AWS Big Data Blog

Tag: EMR

EMR Notebooks: A managed analytics environment based on Jupyter notebooks

Test data quality at scale with Deequ

Optimize Amazon EMR costs with idle checks and automatic resource termination using advanced Amazon CloudWatch metrics and AWS Lambda

Spark enhancements for elasticity and resiliency on Amazon EMR

Dynamically scale up storage on Amazon EMR clusters

Sharpen your Skill Set with Apache Spark on the AWS Big Data Blog

Learn

Resources

Developers

Help