AWS Big Data Blog

Category: Analytics

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, […]

Build a pseudonymization service on AWS to protect sensitive data: Part 2

Part 1 of this two-part series described how to build a pseudonymization service that converts plain text data attributes into a pseudonym or vice versa. A centralized pseudonymization service provides a unique and universally recognized architecture for generating pseudonyms. Consequently, an organization can achieve a standard process to handle sensitive data across all platforms. Additionally, […]

Bring your workforce identity to Amazon EMR Studio and Athena

September 2025: This post was updated for accuracy. Customers today may struggle to implement proper access controls and auditing at the user level when multiple applications are involved in data access workflows. The key challenge is to implement proper least-privilege access controls based on user identity when one application accesses data on behalf of the […]

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. However, altering schema and table partitions in traditional data lakes can be a disruptive and time-consuming task, requiring renaming or recreating entire tables and reprocessing large datasets. […]

How BMO improved data security with Amazon Redshift and AWS Lake Formation

This post is cowritten with Amy Tseng, Jack Lin and Regis Chow from BMO. BMO is the 8th largest bank in North America by assets. It provides personal and commercial banking, global markets, and investment banking services to 13 million customers. As they continue to implement their Digital First strategy for speed, scale and the […]

Introducing Terraform support for Amazon OpenSearch Ingestion

Today, we are launching Terraform support for Amazon OpenSearch Ingestion. Terraform is an infrastructure as code (IaC) tool that helps you build, deploy, and manage cloud resources efficiently. OpenSearch Ingestion is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to Amazon OpenSearch Service domains and Amazon OpenSearch Serverless collections. […]

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Amazon OpenSearch Serverless is an on-demand auto scaling configuration for Amazon OpenSearch Service. Since its release, the interest for OpenSearch Serverless had been steadily growing. Customers prefer to let the service manage its capacity automatically rather than having to manually provision capacity. Until now, customers have had to rely on using custom code or third-party […]

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

This post is co-written with Toney Thomas and Ben Vengerovsky from Bluestone. In the ever-evolving world of finance and lending, the need for real-time, reliable, and centralized data has become paramount. Bluestone, a leading financial institution, embarked on a transformative journey to modernize its data infrastructure and transition to a data-driven organization. In this post, […]

Enable advanced search capabilities for Amazon Keyspaces data by integrating with Amazon OpenSearch Service

In this post, we explore the process of integrating Amazon Keyspaces and Amazon OpenSearch Service using AWS Lambda and Amazon OpenSearch Ingestion to enable advanced search capabilities. The content includes a reference architecture, a step-by-step guide on infrastructure setup, sample code for implementing the solution within a use case, and an AWS Cloud Development Kit (AWS CDK) application for deployment.

Simplify data streaming ingestion for analytics using Amazon MSK and Amazon Redshift

Towards the end of 2022, AWS announced the general availability of real-time streaming ingestion to Amazon Redshift for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), eliminating the need to stage streaming data in Amazon Simple Storage Service (Amazon S3) before ingesting it into Amazon Redshift. Streaming ingestion from Amazon […]