AWS Big Data Blog

Optimizing costs and performance with Advanced Managed Scaling on Amazon EMR on EC2

In this post, we discuss the benefits of Advanced Scaling for Amazon EMR on Amazon EC2 and demonstrate how it works through some example scenarios. You’ll learn when to prioritize utilization optimized settings for cost savings with conservative scaling, balanced approaches for mixed workloads, or performance optimized configurations for SLA-sensitive jobs requiring aggressive scaling.

Deliver Apache Kafka data to streaming tables for Apache Iceberg with Amazon MSK Express brokers

Announcing delivery to streaming tables on Apache Iceberg for Amazon MSK Express brokers, a fully managed capability that continuously materializes your Kafka streaming data as queryable Iceberg tables on Amazon S3 Tables. No connectors, Flink jobs, or custom consumers to manage, and no code to write.

Lowering AWS KMS decrypt API costs in EMR Spark jobs

Processing encrypted data in Amazon S3 with Amazon EMR and Apache Spark can drive up AWS KMS decrypt API costs as the number of objects grows. This post shows three techniques to reduce those costs without compromising encryption: optimizing file formats (including Apache Iceberg), aggregating data, and using AWS Glue Data Catalog partition indexes.

Amazon EMR Serverless now supports 32 vCPU workers for the most demanding Spark jobs

Accelerate Spark on EMR Serverless with larger workers and shuffle-optimized disks

Amazon EMR Serverless now supports a 32 vCPU / 244 GB worker configuration for the most demanding Spark jobs. Across 126 TPC-DS and TPC-H queries, larger workers delivered an average 29% faster query execution and 29% lower cost, with the biggest gains on shuffle-heavy, multi-table join queries.

Automate Spark Scala migration to 4.x with AWS Spark Upgrade Agent

Learn how to automate Apache Spark 3.x to 4.0 Scala migration on Amazon EMR using the AWS Spark Upgrade Agent. This post covers API deprecations, behavioral changes, build configuration updates, and job validation, turning months of manual effort into hours.

Build a contract compliance search system with Amazon OpenSearch

In this post, you build a contract compliance search system that combines semantic search with semantic highlighting in Amazon OpenSearch Service. You deploy the solution using two AWS CloudFormation stacks, test it with synthetic contract documents, and see how a single query surfaces both the right contracts and the right clauses within them.

Building a scalable personalized recommendation system on AWS: From batch to real-time

Learn how the Everyday Essentials team built a scalable personalized recommendation platform on AWS using a batch-first architecture with Amazon MWAA for orchestration, Amazon SageMaker for training and vector search, and AWS Lake Formation for governed data access, then extended it to real-time with Amazon MemoryDB.

Automate creating AWS Glue Data Catalog views with AWS SDK for data mesh use case

This post shows you how to use the Catalog objects API CreateTable() to programmatically create ATHENA and SPARK dialects using cross-account IAM definer roles, and how to add the ATHENA dialect programmatically for the views that were created earlier with only SPARK dialect.

Efficient log management with Amazon OpenSearch Service data streams

In this post, we show you how to implement data streams with Index State Management (ISM) in Amazon OpenSearch Service. This approach automatically manages your time series data lifecycle and optimizes both performance and costs. Data streams distribute incoming data across multiple backing indices, helping to reduce single-index bottlenecks, while ISM policies automate rollover, retention, and storage tiering to help manage costs.

Alight OpenSearch Service architecture showing cross-account log ingestion from Amazon ECS and Amazon EC2 workloads through OpenSearch Ingestion to Amazon OpenSearch Service

How Alight Solutions achieved 55% cost savings with Amazon OpenSearch Service

In this post, we share how Alight Solutions migrated from self-managed Elasticsearch to Amazon OpenSearch Service. The migration achieved a 55% cost reduction, alleviated approximately 2,000 hours per year of operational overhead, and gave Alight access to advanced observability features they could not prioritize before.