AWS Big Data Blog

Build and optimize real-time stream processing pipeline with Amazon Kinesis Data Analytics for Apache Flink, Part 2

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. In Part 1 of this series, you learned how to calibrate Amazon Kinesis Data Streams stream and Apache Flink application deployed in Amazon Kinesis Data Analytics for […]

Build and optimize a real-time stream processing pipeline with Amazon Kinesis Data Analytics for Apache Flink, Part 1

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. In real-time stream processing, it becomes critical to collect, process, and analyze high-velocity real-time data to provide timely insights and react quickly to new information. Streaming data […]

Data preparation using Amazon Redshift with AWS Glue DataBrew

July 2023: This post was reviewed for accuracy. With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, Amazon Aurora, and other Amazon Relational Database Service (Amazon RDS) databases. You can choose from over […]

real-time streaming data analytics architecture

Build a real-time streaming analytics pipeline with the AWS CDK

A recurring business problem is achieving the ability to capture data in near-real time to act upon any significant event close to the moment it happens. For example, you may want to tap into a data stream and monitor any anomalies that need to be addressed immediately rather than during a nightly batch. Building these […]

Effective data lakes using AWS Lake Formation, Part 1: Implementing cell-level and row-level security

July 2023: This post was reviewed for accuracy. We announced the general availability of AWS Lake Formation transactions, cell-level and row-level security, and acceleration at AWS re: Invent 2021. In this post, we focus on cell-level and row-level security and show you how to enforce business needs by restricting access to specific rows. Effective data […]

Work with semistructured data using Amazon Redshift SUPER

With the new SUPER data type and the PartiQL language, Amazon Redshift expands data warehouse capabilities to natively ingest, store, transform, and analyze semi-structured data. Semi-structured data (such as weblogs and sensor data) fall under the category of data that doesn’t conform to a rigid schema expected in relational databases. It often contain complex values […]

Increase Amazon Elasticsearch Service performance by upgrading to Graviton2

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. Amazon OpenSearch Service supports multiple instance types based on your use case. In 2021, AWS announced general purpose (M6g), compute optimized (C6g), and memory optimized (R6g, R6gd) instance types for Amazon OpenSearch Service version 7.9 or later powered by AWS […]

Design patterns for an enterprise data lake using AWS Lake Formation cross-account access

In this post, we briefly walk through the most common design patterns adapted by enterprises to build lake house solutions to support their business agility in a multi-tenant model using the AWS Lake Formation cross-account feature to enable a multi-account strategy for line of business (LOB) accounts to produce and consume data from your data […]

Streaming Amazon DynamoDB data into a centralized data lake

February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What’s New post to learn more. August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. For organizations moving towards […]

Increase Apache Kafka’s resiliency with a multi-Region deployment and MirrorMaker 2

Customers create business continuity plans and disaster recovery (DR) strategies to maximize resiliency for their applications, because downtime or data loss can result in losing revenue or halting operations. Ultimately, DR planning is all about enabling the business to continue running despite a Regional outage. This post explains how to make Apache Kafka resilient to […]