AWS Big Data Blog

Category: How-To

Simplify data streaming ingestion for analytics using Amazon MSK and Amazon Redshift

Towards the end of 2022, AWS announced the general availability of real-time streaming ingestion to Amazon Redshift for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), eliminating the need to stage streaming data in Amazon Simple Storage Service (Amazon S3) before ingesting it into Amazon Redshift. Streaming ingestion from Amazon […]

Monitor Apache Spark applications on Amazon EMR with Amazon Cloudwatch

To improve a Spark application’s efficiency, it’s essential to monitor its performance and behavior. In this post, we demonstrate how to publish detailed Spark metrics from Amazon EMR to Amazon CloudWatch. This will give you the ability to identify bottlenecks while optimizing resource utilization.

Automate secure access to Amazon MWAA environments using existing OpenID Connect single-sign-on authentication and authorization

Customers use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to run Apache Airflow at scale in the cloud. They want to use their existing login solutions developed using OpenID Connect (OIDC) providers with Amazon MWAA; this allows them to provide a uniform authentication and single sign-on (SSO) experience using their adopted identity providers (IdP) […]

Use MSK Connect for managed MirrorMaker 2 deployment with IAM authentication

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. In this post, we show how to use MSK Connect for MirrorMaker 2 deployment with AWS Identity and Access Management (IAM) authentication. We create an MSK Connect […]

Copy large datasets from Google Cloud Storage to Amazon S3 using Amazon EMR

Many organizations have data sitting in various data sources in a variety of formats. Even though data is a critical component of decision-making, for many organizations this data is spread across multiple public clouds. Organizations are looking for tools that make it easy and cost-effective to copy large datasets across cloud vendors. With Amazon EMR […]