AWS Big Data Blog
Category: Compute
Use Batch Processing Gateway to automate job management in multi-cluster Amazon EMR on EKS environments
AWS customers often process petabytes of data using Amazon EMR on EKS. In enterprise environments with diverse workloads or varying operational requirements, customers frequently choose a multi-cluster setup due to the following advantages: Better resiliency and no single point of failure – If one cluster fails, other clusters can continue processing critical workloads, maintaining business […]
How ZS built a clinical knowledge repository for semantic search using Amazon OpenSearch Service and Amazon Neptune
In this blog post, we will highlight how ZS Associates used multiple AWS services to build a highly scalable, highly performant, clinical document search platform. This platform is an advanced information retrieval system engineered to assist healthcare professionals and researchers in navigating vast repositories of medical documents, medical literature, research articles, clinical guidelines, protocol documents, […]
Build a serverless data quality pipeline using Deequ on AWS Lambda
Poor data quality can lead to a variety of problems, including pipeline failures, incorrect reporting, and poor business decisions. For example, if data ingested from one of the systems contains a high number of duplicates, it can result in skewed data in the reporting system. To prevent such issues, data quality checks are integrated into […]
Stream data to Amazon S3 for real-time analytics using the Oracle GoldenGate S3 handler
Modern business applications rely on timely and accurate data with increasing demand for real-time analytics. There is a growing need for efficient and scalable data storage solutions. Data at times is stored in different datasets and needs to be consolidated before meaningful and complete insights can be drawn from the datasets. This is where replication […]
Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch
In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. Data lakes provide a unified repository for organizations to store and use […]
Disaster recovery strategies for Amazon MWAA – Part 2
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a fully managed orchestration service that makes it straightforward to run data processing workflows at scale. Amazon MWAA takes care of operating and scaling Apache Airflow so you can focus on developing workflows. However, although Amazon MWAA provides high availability within an AWS Region through features […]
Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless
Designing a full stack search application requires addressing numerous challenges to provide a smooth and effective user experience. This encompasses tasks such as integrating diverse data from various sources with distinct formats and structures, optimizing the user experience for performance and security, providing multilingual support, and optimizing for cost, operations, and reliability. Amazon OpenSearch Serverless […]
Introducing Amazon EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform
AWS recently announced that Apache Flink is generally available for Amazon EMR on Amazon Elastic Kubernetes Service (EKS). Apache Flink is a scalable, reliable, and efficient data processing framework that handles real-time streaming and batch workloads (but is most commonly used for real-time streaming). Amazon EMR on EKS is a deployment option for Amazon EMR […]
Enable advanced search capabilities for Amazon Keyspaces data by integrating with Amazon OpenSearch Service
In this post, we explore the process of integrating Amazon Keyspaces and Amazon OpenSearch Service using AWS Lambda and Amazon OpenSearch Ingestion to enable advanced search capabilities. The content includes a reference architecture, a step-by-step guide on infrastructure setup, sample code for implementing the solution within a use case, and an AWS Cloud Development Kit (AWS CDK) application for deployment.
Disaster recovery strategies for Amazon MWAA – Part 1
In the dynamic world of cloud computing, ensuring the resilience and availability of critical applications is paramount. Disaster recovery (DR) is the process by which an organization anticipates and addresses technology-related disasters. For organizations implementing critical workload orchestration using Amazon Managed Workflows for Apache Airflow (Amazon MWAA), it is crucial to have a DR plan […]