AWS Big Data Blog

Your guide to AWS Analytics at AWS re:Invent 2025

It’s that time of year again — AWS re:Invent is here! At re:Invent, bold ideas come to life. Get a front-row seat to hear inspiring stories from AWS experts, customers, and leaders as they explore today’s most impactful topics, from data analytics to AI. For all the data enthusiasts and professionals, we’ve curated a comprehensive […]

How Yelp modernized its data infrastructure with a streaming lakehouse on AWS

This is a guest post by Umesh Dangat, Senior Principal Engineer for Distributed Services and Systems at Yelp, and Toby Cole, Principle Engineer for Data Processing at Yelp, in partnership with AWS. Yelp processes massive amounts of user data daily—over 300 million business reviews, 100,000 photo uploads, and countless check-ins. Maintaining sub-minute data freshness with […]

Amazon MSK Express brokers now support Intelligent Rebalancing for 180 times faster operation performance

Effective today, all new Amazon Managed Streaming for Apache Kafka (Amazon MSK) Provisioned clusters with Express brokers will support Intelligent Rebalancing at no additional cost. In this post we’ll introduce the Intelligent Rebalancing feature and show an example of how it works to improve operation performance.

Analyzing Amazon EC2 Spot instance interruptions by using event-driven architecture

In this post, you’ll learn how to build this comprehensive monitoring solution step-by-step. You’ll gain practical experience designing an event-driven pipeline, implementing data processing workflows, and creating insightful dashboards that help you track interruption trends, optimize ASG configurations, and improve the resilience of your Spot Instance workloads.

Enhanced search with match highlights and explanations in Amazon SageMaker

Amazon SageMaker now enhances search results in Amazon SageMaker Unified Studio with additional context that improves transparency and interpretability. The capability introduces inline highlighting for matched terms and an explanation panel that details where and how each match occurred across metadata fields such as name, description, glossary, and schema. In this post, we demonstrate how to use enhanced search in Amazon SageMaker.

Amazon Kinesis Data Streams launches On-demand Advantage for instant throughput increases and streaming at scale

Today, AWS announced the new Amazon Kinesis Data Streams On-demand Advantage mode, which includes warm throughput capability and an updated pricing structure. With this feature you can enable instant scaling for traffic surges while optimizing costs for consistent streaming workloads. In this post, we explore this new feature, including key use cases, configuration options, pricing considerations, and best practices for optimal performance.

Scaling data governance with Amazon DataZone: Covestro success story

In this post, we show you how Covestro transformed its data architecture by implementing Amazon DataZone and AWS Serverless Data Lake Framework, transitioning from a centralized data lake to a data mesh architecture. The implementation enabled streamlined data access, better data quality, and stronger governance at scale, achieving a 70% reduction in time-to-market for over 1,000 data pipelines.

Use trusted identity propagation for Apache Spark interactive sessions in Amazon SageMaker Unified Studio

In this post, we provide step-by-step instructions to set up Amazon EMR on EC2, EMR Serverless, and AWS Glue within SageMaker Unified Studio, enabled with trusted identity propagation. We use the setup to illustrate how different IAM Identity Center users can run their Spark sessions, using each compute setup, within the same project in SageMaker Unified Studio. We show how each user will see only tables or part of tables that they’re granted access to in Lake Formation.

Amazon Kinesis Data Streams now supports 10x larger record sizes: Simplifying real-time data processing

Today, AWS announced that Amazon Kinesis Data Streams now supports record sizes up to 10MiB – a tenfold increase from the previous limit. In this post, we explore Amazon Kinesis Data Streams large record support, including key use cases, configuration of maximum record sizes, throttling considerations, and best practices for optimal performance.