AWS Big Data Blog

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, […]

Build a pseudonymization service on AWS to protect sensitive data: Part 2

Part 1 of this two-part series described how to build a pseudonymization service that converts plain text data attributes into a pseudonym or vice versa. A centralized pseudonymization service provides a unique and universally recognized architecture for generating pseudonyms. Consequently, an organization can achieve a standard process to handle sensitive data across all platforms. Additionally, […]

Bring your workforce identity to Amazon EMR Studio and Athena

Customers today may struggle to implement proper access controls and auditing at the user level when multiple applications are involved in data access workflows. The key challenge is to implement proper least-privilege access controls based on user identity when one application accesses data on behalf of the user in another application. It forces you to […]

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. However, altering schema and table partitions in traditional data lakes can be a disruptive and time-consuming task, requiring renaming or recreating entire tables and reprocessing large datasets. […]

How BMO improved data security with Amazon Redshift and AWS Lake Formation

This post is cowritten with Amy Tseng, Jack Lin and Regis Chow from BMO. BMO is the 8th largest bank in North America by assets. It provides personal and commercial banking, global markets, and investment banking services to 13 million customers. As they continue to implement their Digital First strategy for speed, scale and the […]

Data governance steps in data pipelines

Data governance in the age of generative AI

Data is your generative AI differentiator, and a successful generative AI implementation depends on a robust data strategy incorporating a comprehensive data governance approach. Working with large language models (LLMs) for enterprise use cases requires the implementation of quality and privacy considerations to drive responsible AI. However, enterprise data generated from siloed sources combined with […]

Introducing Terraform support for Amazon OpenSearch Ingestion

Today, we are launching Terraform support for Amazon OpenSearch Ingestion. Terraform is an infrastructure as code (IaC) tool that helps you build, deploy, and manage cloud resources efficiently. OpenSearch Ingestion is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to Amazon OpenSearch Service domains and Amazon OpenSearch Serverless collections. […]

Introducing Amazon MWAA support for Apache Airflow version 2.8.1

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it straightforward to set up and operate end-to-end data pipelines in the cloud. Organizations use Amazon MWAA to enhance their business workflows. For example, C2i Genomics uses Amazon MWAA in their data platform to orchestrate the validation […]

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Amazon OpenSearch Serverless is an on-demand auto scaling configuration for Amazon OpenSearch Service. Since its release, the interest for OpenSearch Serverless had been steadily growing. Customers prefer to let the service manage its capacity automatically rather than having to manually provision capacity. Until now, customers have had to rely on using custom code or third-party […]

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

This post is co-written with Toney Thomas and Ben Vengerovsky from Bluestone. In the ever-evolving world of finance and lending, the need for real-time, reliable, and centralized data has become paramount. Bluestone, a leading financial institution, embarked on a transformative journey to modernize its data infrastructure and transition to a data-driven organization. In this post, […]