AWS Big Data Blog

Optimize write throughput for Amazon Kinesis Data Streams

Amazon Kinesis Data Streams is used by many customers to capture, process, and store data streams at any scale. This level of unparalleled scale is enabled by dividing each data stream into multiple shards. Each shard in a stream has a 1 Mbps or 1,000 records per second write throughput limit. Whether your data streaming […]

Integrate Tableau and Okta with Amazon Redshift using AWS IAM Identity Center

This blog post is co-written with Sid Wray and Jake Koskela from Salesforce, and Adiascar Cisneros from Tableau.  Amazon Redshift is a fast, scalable cloud data warehouse built to serve workloads at any scale. With Amazon Redshift as your data warehouse, you can run complex queries using sophisticated query optimization to quickly deliver results to […]

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

Designing a full stack search application requires addressing numerous challenges to provide a smooth and effective user experience. This encompasses tasks such as integrating diverse data from various sources with distinct formats and structures, optimizing the user experience for performance and security, providing multilingual support, and optimizing for cost, operations, and reliability. Amazon OpenSearch Serverless […]

AWS named a Leader in IDC MarketScape: Worldwide Analytic Stream Processing Software 2024 Vendor Assessment

We’re thrilled to announce that AWS has been named a Leader in the IDC MarketScape: Worldwide Analytic Stream Processing Software 2024 Vendor Assessment (doc #US51053123, March 2024). We believe this recognition validates the power and performance of Apache Flink for real-time data processing, and how AWS is leading the way to help customers build and […]

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands […]

Introducing support for Apache Kafka on Raft mode (KRaft) with Amazon MSK clusters

Organizations are adopting Apache Kafka and Amazon Managed Streaming for Apache Kafka (Amazon MSK) to capture and analyze data in real time. Amazon MSK helps you build and run production applications on Apache Kafka without needing Kafka infrastructure management expertise or having to deal with the complex overhead associated with setting up and running Apache […]

Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and Amazon S3 Access Grants

Many organizations use external identity providers (IdPs) such as Okta or Microsoft Azure Active Directory to manage their enterprise user identities. These users interact with and run analytical queries across AWS analytics services. To enable them to use the AWS services, their identities from the external IdP are mapped to AWS Identity and Access Management […]

Introducing Amazon EMR on EKS with Apache Flink: A scalable, reliable, and efficient data processing platform

AWS recently announced that Apache Flink is generally available for Amazon EMR on Amazon Elastic Kubernetes Service (EKS). Apache Flink is a scalable, reliable, and efficient data processing framework that handles real-time streaming and batch workloads (but is most commonly used for real-time streaming). Amazon EMR on EKS is a deployment option for Amazon EMR […]

Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents

In this post, we show how to build a Q&A bot with RAG (Retrieval Augmented Generation). RAG uses data sources like Amazon Redshift and Amazon OpenSearch Service to retrieve documents that augment the LLM prompt. For getting data from Amazon Redshift, we use the Anthropic Claude 2.0 on Amazon Bedrock, summarizing the final response based on pre-defined prompt template libraries from LangChain. To get data from Amazon OpenSearch Service, we chunk, and convert the source data chunks to vectors using Amazon Titan Text Embeddings model.

Architectural Patterns for Real Time Analytics using Amazon Kinesis Data Streams, Part 2 – AI Applications

Architectural Patterns for real-time analytics using Amazon Kinesis Data Streams, Part 2: AI Applications

Welcome back to our exciting exploration of architectural patterns for real-time analytics with Amazon Kinesis Data Streams! In this fast-paced world, Kinesis Data Streams stands out as a versatile and robust solution to tackle a wide range of use cases with real-time data, from dashboarding to powering artificial intelligence (AI) applications. In this series, we […]