AWS Big Data Blog
Category: Technical How-to
Perform data parity at scale for data modernization programs using AWS Glue Data Quality
In this post, we show you how to use AWS Glue Data Quality, a feature of AWS Glue, to establish data parity during data modernization and migration programs with minimal configuration and infrastructure setup. AWS Glue Data Quality enables you to automatically measure and monitor the quality of your data in data repositories and AWS Glue ETL pipelines.
Access private code repositories for installing Python dependencies on Amazon MWAA
This post demonstrates a method to selectively install Python dependencies based on the Amazon MWAA component type (web server scheduler, or worker) from a Git repository only accessible from your virtual private cloud (VPC).
Enrich your serverless data lake with Amazon Bedrock
Organizations are collecting and storing vast amounts of structured and unstructured data like reports, whitepapers, and research documents. By consolidating this information, analysts can discover and integrate data from across the organization, creating valuable data products based on a unified dataset. This post shows how to integrate Amazon Bedrock with the AWS Serverless Data Analytics Pipeline architecture using Amazon EventBridge, AWS Step Functions, and AWS Lambda to automate a wide range of data enrichment tasks in a cost-effective and scalable manner.
How to track Amazon OpenSearch Service domain-level cost
Amazon OpenSearch Service Pricing is based on three dimensions: instances, storage, and data transfer. Storage pricing depends on the chosen storage type and also the storage tier. Visibility into domain-level charges enables accurate budgeting, efficient resource allocation, fair cost attribution across projects, and overall cost transparency. In this post, we show you how to view the OpenSearch Service domain-level cost using AWS Cost Explorer.
Use Batch Processing Gateway to automate job management in multi-cluster Amazon EMR on EKS environments
AWS customers often process petabytes of data using Amazon EMR on EKS. In enterprise environments with diverse workloads or varying operational requirements, customers frequently choose a multi-cluster setup due to the following advantages: Better resiliency and no single point of failure – If one cluster fails, other clusters can continue processing critical workloads, maintaining business […]
Developer guidance on how to do local testing with Amazon MSK Serverless
In this post, I present you with guidance on how developers can connect to Amazon MSK Serverless from local environments. The connection is done using an Amazon MSK endpoint through an SSH tunnel and a bastion host. This enables developers to experiment and test locally, without needing to setup a separate Kafka cluster.
Integrate sparse and dense vectors to enhance knowledge retrieval in RAG using Amazon OpenSearch Service
In this post, instead of using the BM25 algorithm, we introduce sparse vector retrieval. This approach offers improved term expansion while maintaining interpretability. We walk through the steps of integrating sparse and dense vectors for knowledge retrieval using Amazon OpenSearch Service and run some experiments on some public datasets to show its advantages.
Integrate Tableau and Microsoft Entra ID with Amazon Redshift using AWS IAM Identity Center
This blog post provides a step-by-step guide to integrating IAM Identity Center with Microsoft Entra ID as the IdP and configuring Amazon Redshift as an AWS managed application. Additionally, you’ll learn how to set up the Amazon Redshift driver in Tableau, enabling SSO directly within Tableau Desktop.
Attribute Amazon EMR on EC2 costs to your end-users
In this post, we share a chargeback model that you can use to track and allocate the costs of Spark workloads running on Amazon EMR on EC2 clusters. We describe an approach that assigns Amazon EMR costs to different jobs, teams, or lines of business. You can use this feature to distribute costs across various business units. This can assist you in monitoring the return on investment for your Spark-based workloads.
Copy and mask PII between Amazon RDS databases using visual ETL jobs in AWS Glue Studio
In this post, I’ll walk you through how to copy data from one Amazon Relational Database Service (Amazon RDS) for PostgreSQL database to another, while scrubbing PII along the way using AWS Glue. You will learn how to prepare a multi-account environment to access the databases from AWS Glue, and how to model an ETL data flow that automatically masks PII as part of the transfer process, so that no sensitive information will be copied to the target database in its original form.









