AWS Big Data Blog

Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog

This post demonstrates how to implement reliable concurrent write handling mechanisms in Iceberg tables. We will explore Iceberg’s concurrency model, examine common conflict scenarios, and provide practical implementation patterns of both automatic retry mechanisms and situations requiring custom conflict resolution logic for building resilient data pipelines. We will also cover the pattern with automatic compaction through AWS Glue Data Catalog table optimization.

Enhance Agentforce data security with Private Connect for Salesforce Data Cloud and Amazon Redshift – Part 3

In this post, we discuss how to create AWS endpoint services to improve data security with Private Connect for Salesforce Data Cloud.

Optimize multimodal search using the TwelveLabs Embed API and Amazon OpenSearch Service

In this blog post, we show you the process of integrating TwelveLabs Embed API with OpenSearch Service to create a multimodal search solution. You’ll learn how to generate rich, contextual embeddings from video content and use OpenSearch Service’s vector database capabilities to enable search functionalities. By the end of this post, you’ll be equipped with the knowledge to implement a system that can transform the way your organization handles and extracts value from video content.

Correlate telemetry data with Amazon OpenSearch Service and Amazon Managed Grafana

In this post, we show you how to use Amazon OpenSearch Service and Amazon Managed Grafana to correlate the various observability signals that improve root cause analysis, thereby resulting in reduced Mean Time to Resolution (MTTR). We also provide a reference solution that can be used at scale for proactive monitoring of enterprise applications to avoid a problem before they occur.

Enhance governance with metadata enforcement rules in Amazon SageMaker

Amazon SageMaker Catalog now supports metadata rules allowing organizations to enforce metadata standards across data publishing and subscription workflows. In this post, we guide you through two workflows: setting up metadata enforcement rules for a specific domain and publishing an asset or data product in a catalog, and setting up metadata enforcement rules for a specific domain and subscribing to an asset or data product that is owned by a project within that domain.

Build multi-Region resilient Apache Kafka applications with identical topic names using Amazon MSK and Amazon MSK Replicator

This post explains how to use MSK Replicator for cross-cluster data replication and details the failover and failback processes while keeping the same topic name across Regions.

Using Amazon S3 Tables with Amazon Redshift to query Apache Iceberg tables

In this post, we demonstrate how to get started with S3 Tables and Amazon Redshift Serverless for querying data in Iceberg tables. We show how to set up S3 Tables, load data, register them in the unified data lake catalog, set up basic access controls in SageMaker Lakehouse through AWS Lake Formation, and query the data using Amazon Redshift.

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

In this blog post, we will demonstrate how business units can use Amazon SageMaker Unified Studio to discover, subscribe to, and analyze these distributed data assets. Through this unified query capability, you can create comprehensive insights into customer transaction patterns and purchase behavior for active products without the traditional barriers of data silos or the need to copy data between systems.

Introducing vector search with UltraWarm in Amazon OpenSearch Service

Amazon OpenSearch Service also offers a multi-tiered storage solution to its customers in the form of UltraWarm and Cold tiers. In this post, we discuss this new capability and its use cases, and provide a cost-benefit analysis in different scenarios.

Build a data lakehouse in a hybrid Environment using Amazon EMR Serverless, Apache DolphinScheduler, and TiDB

This post discusses a decoupled approach of building a serverless data lakehouse using AWS Cloud-centered services, including Amazon EMR Serverless, Amazon Athena, Amazon Simple Storage Service (Amazon S3), Apache DolphinScheduler (an open source data job scheduler) as well as PingCAP TiDB, a third-party data warehouse product that can be deployed either on premises or on the cloud or through a software as a service (SaaS).