AWS Big Data Blog

Category: Advanced (300)

Multicloud data lake analytics with Amazon Athena

Many organizations operate data lakes spanning multiple cloud data stores. This could be for various reasons, such as business expansions, mergers, or specific cloud provider preferences for different business units. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics […]

Build a RAG data ingestion pipeline for large-scale ML workloads

For building any generative AI application, enriching the large language models (LLMs) with new data is imperative. This is where the Retrieval Augmented Generation (RAG) technique comes in. RAG is a machine learning (ML) architecture that uses external documents (like Wikipedia) to augment its knowledge and achieve state-of-the-art results on knowledge-intensive tasks. For ingesting these […]

How the GoDaddy data platform achieved over 60% cost reduction and 50% performance boost by adopting Amazon EMR Serverless

This is a guest post co-written with Brandon Abear, Dinesh Sharma, John Bush, and Ozcan IIikhan from GoDaddy. GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, […]

Real-time cost savings for Amazon Managed Service for Apache Flink

When running Apache Flink applications on Amazon Managed Service for Apache Flink, you have the unique benefit of taking advantage of its serverless nature. This means that cost-optimization exercises can happen at any time—they no longer need to happen in the planning phase. With Managed Service for Apache Flink, you can add and remove compute […]

In-stream anomaly detection with Amazon OpenSearch Ingestion and Amazon OpenSearch Serverless

Unsupervised machine learning analytics has emerged as a powerful tool for anomaly detection in today’s data-rich landscape, especially with the growing volume of machine-generated data. In-stream anomaly detection offers real-time insights into data anomalies, enabling proactive response. Amazon OpenSearch Serverless focuses on delivering seamless scalability and management of search workloads; Amazon OpenSearch Ingestion complements this […]

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, […]

Bring your workforce identity to Amazon EMR Studio and Athena

Customers today may struggle to implement proper access controls and auditing at the user level when multiple applications are involved in data access workflows. The key challenge is to implement proper least-privilege access controls based on user identity when one application accesses data on behalf of the user in another application. It forces you to […]

Enable advanced search capabilities for Amazon Keyspaces data by integrating with Amazon OpenSearch Service

In this post, we explore the process of integrating Amazon Keyspaces and Amazon OpenSearch Service using AWS Lambda and Amazon OpenSearch Ingestion to enable advanced search capabilities. The content includes a reference architecture, a step-by-step guide on infrastructure setup, sample code for implementing the solution within a use case, and an AWS Cloud Development Kit (AWS CDK) application for deployment.

Simplify data streaming ingestion for analytics using Amazon MSK and Amazon Redshift

Towards the end of 2022, AWS announced the general availability of real-time streaming ingestion to Amazon Redshift for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), eliminating the need to stage streaming data in Amazon Simple Storage Service (Amazon S3) before ingesting it into Amazon Redshift. Streaming ingestion from Amazon […]

Combine AWS Glue and Amazon MWAA to build advanced VPC selection and failover strategies

AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue customers often have to meet strict security requirements, which sometimes involve locking down the network connectivity allowed to the job, or running inside […]