AWS Big Data Blog
AWS analytics at re:Invent 2025: Unifying Data, AI, and governance at scale
re:Invent 2025 showcased the bold Amazon Web Services (AWS) vision for the future of analytics, one where data warehouses, data lakes, and AI development converge into a seamless, open, intelligent platform, with Apache Iceberg compatibility at its core. Across over 18 major announcements spanning three weeks, AWS demonstrated how organizations can break down data silos, accelerate insights with AI, and maintain robust governance without sacrificing agility.
Amazon SageMaker: Your data platform, simplified
AWS introduced a faster, simpler approach to data platform onboarding for Amazon SageMaker Unified Studio. The new one-click onboarding experience eliminates weeks of setup, so teams can start working with existing datasets in minutes using their current AWS Identity and Access Management (IAM) roles and permissions. Accessible directly from Amazon SageMaker, Amazon Athena, Amazon Redshift, and Amazon S3 Tables consoles, this streamlined experience automatically creates SageMaker Unified Studio projects with existing data permissions intact. At its core is a powerful new serverless notebook that reimagines how data professionals work. This single interface combines SQL queries, Python code, Apache Spark processing, and natural language prompts, backed by Amazon Athena for Apache Spark to scale from interactive exploration to petabyte-scale jobs. Data engineers, analysts, and data scientists no longer need to context-switch between different tools based on workload—they can explore data with SQL, build models with Python, and use AI assistance, all in one place.
The introduction of Amazon SageMaker Data Agent in the new SageMaker notebooks marks a pivotal moment in AI-assisted development for data builders. This built-in agent doesn’t only generate code, it understands your data context, catalog information, and business metadata to create intelligent execution plans from natural language descriptions. When you describe an objective, the agent breaks down complex analytics and machine learning (ML) tasks into manageable steps, generates the required SQL and Python code, and maintains awareness of your notebook environment throughout the entire process. This capability transforms hours of manual coding into minutes of guided development, which means teams can focus on gleaning insights rather than repetitive boilerplate.
Embracing open data with Apache Iceberg
One significant theme across this year’s launches was the widespread adoption of Apache Iceberg across AWS analytics, transforming how organizations manage petabyte-scale data lakes. Catalog federation to remote Iceberg catalogs through the AWS Glue Data Catalog addresses a critical challenge in modern data architectures. You can now query remote Iceberg tables, stored in Amazon Simple Storage Service (Amazon S3) and catalogued in remote Iceberg catalogs, using preferred AWS analytics services such as Amazon Redshift, Amazon EMR, Amazon Athena, AWS Glue, and Amazon SageMaker, without moving or copying tables. Metadata synchronizes in real time, providing query results that reflect the current state. Catalog federation supports both coarse-grained access control and fine-grained access permissions through AWS Lake Formation enabling cross-account sharing and trusted identity propagation while maintaining consistent security across federated catalogs.
Amazon Redshift now writes directly to Apache Iceberg tables, enabling true open lakehouse architectures where analytics seamlessly span data warehouses and lakes. Apache Spark on Amazon EMR 7.12, AWS Glue, Amazon SageMaker notebooks, Amazon S3 Tables, and the AWS Glue Data Catalog now support Iceberg V3’s capabilities, including deletion vectors that mark deleted rows without expensive file rewrites, dramatically reducing pipeline costs and accelerating data modifications and row lineage. V3 automatically tracks every record’s history, creating audit trails essential for compliance and has table-level encryption that helps organizations meet stringent privacy regulations. These innovations mean faster writes, lower storage costs, comprehensive audit trails, and efficient incremental processing across your data architecture.
Governance that scales with your organization
Data governance received substantial attention at re:Invent with major enhancements to Amazon SageMaker Catalog. Organizations can now curate data at the column level with custom metadata forms and rich text descriptions, indexed in real time for immediate discoverability. New metadata enforcement rules require data producers to classify assets with approved business vocabulary before publication, providing consistency across the enterprise. The catalog uses Amazon Bedrock large language models (LLMs) to automatically suggest relevant business glossary terms by analyzing table metadata and schema information, bridging the gap between technical schemas and business language. Perhaps most importantly, SageMaker Catalog now exports its entire asset metadata as queryable Apache Iceberg tables through Amazon S3 Tables. This way, teams can analyze catalog inventory with standard SQL to answer questions like “which assets lack business descriptions?” or “how many confidential datasets were registered last month?” without building custom ETL infrastructure.
As organizations adopt multi-warehouse architectures to scale and isolate workloads, the new Amazon Redshift federated permissions capability eliminates governance complexity. Define data permissions one time from a Amazon Redshift warehouse, and they automatically enforce them across the warehouses in your account. Row-level, column-level, and masking controls apply consistently regardless of which warehouse queries originate from, and new warehouses automatically inherit permission policies. This horizontal scalability means organizations can add warehouses without increasing governance overhead, and analysts immediately see the databases from registered warehouses.
Accelerating AI innovation with Amazon OpenSearch Service
Amazon OpenSearch Service introduced powerful new capabilities to simplify and accelerate AI application development. With support for OpenSearch 3.3, agentic search enables precise results using natural language inputs without the need for complex queries, making it easier to build intelligent AI agents. The new Apache Calcite-powered PPL engine delivers query optimization and an extensive library of commands for more efficient data processing.
As seen in Matt Garman’s keynote, building large-scale vector databases is now dramatically faster with GPU acceleration and auto-optimization. Previously, creating large-scale vector indexes required days of building time and weeks of manual tuning by experts, which slowed innovation and prevented cost-performance optimizations. The new serverless auto-optimize jobs automatically evaluate index configurations—including k-nearest neighbors (k-NN) algorithms, quantization, and engine settings—based on your specified search latency and recall requirements. Combined with GPU acceleration, you can build optimized indexes up to ten times faster at 25% of the indexing cost, with serverless GPUs that activate dynamically and bill only when providing speed boosts. These advancements simplify scaling AI applications such as semantic search, recommendation engines, and agentic systems, so teams can innovate faster by dramatically reducing the time and effort needed to build large-scale, optimized vector databases.
Performance and cost optimization
Also announced in the keynote, Amazon EMR Serverless now eliminates local storage provisioning for Apache Spark workloads, introducing serverless storage that reduces data processing costs by up to 20% while preventing job failures from disk capacity constraints. The fully managed, auto scaling storage encrypts data in transit and at rest with job-level isolation, allowing Spark to release workers immediately when idle rather than keeping them active to preserve temporary data. Additionally, AWS Glue introduced materialized views based on Apache Iceberg, storing precomputed query results that automatically refresh as source data changes. Spark engines across Amazon Athena, Amazon EMR, and AWS Glue intelligently rewrite queries to use these views, accelerating performance by up to eight times while reducing compute costs. The service handles refresh schedules, change detection, incremental updates, and infrastructure management automatically.
The new Apache Spark upgrade agent for Amazon EMR transforms version upgrades from months-long projects into week-long initiatives. Using conversational interfaces, engineers express upgrade requirements in natural language while the agent automatically identifies API changes and behavioral modifications across PySpark and Scala applications. Engineers review and approve suggested changes before implementation, maintaining full control while the agent validates functional correctness through data quality checks. Currently supporting upgrades from Spark 2.4 to 3.5, this capability is available through SageMaker Unified Studio, Kiro CLI, or an integrated development environment (IDE) with Model Context Protocol compatibility.
For workflow optimization, AWS introduced a new Serverless deployment option for Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which eliminates the operational overhead of managing Apache Airflow environments while optimizing costs through serverless scaling. This new offering addresses key challenges of operational scalability, cost optimization, and access management that data engineers and DevOps teams face when orchestrating workflows. With Amazon MWAA Serverless, data engineers can focus on defining their workflow logic rather than monitoring for provisioned capacity. They can now submit their Airflow workflows for execution on a schedule or on demand, paying only for the actual compute time used during each task’s execution.
Looking forward
These launches collectively represent more than incremental improvements. They signal a fundamental shift in how organizations are approaching analytics. By unifying data warehousing, data lakes, and ML under a common framework built on Apache Iceberg, simplifying access through intelligent interfaces powered by AI, and maintaining robust governance that scales effortlessly, AWS is giving organizations the tools to focus on insights rather than infrastructure. The emphasis on automation, from AI-assisted development to self-managing materialized views and serverless storage, reduces operational overhead while improving performance and cost efficiency. As data volumes continue to grow and AI becomes increasingly central to business operations, these capabilities position AWS customers to accelerate their data-driven initiatives with unprecedented simplicity and power. To view the Re:Invent 2025 Innovation Talk on analytics, visit Harnessing analytics for humans and AI on YouTube.