AWS Big Data Blog

Category: Analytics

Stream change data to Amazon Kinesis Data Streams with AWS DMS

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. In this post, we discuss how to use AWS Database Migration Service (AWS DMS) native change data capture (CDC) capabilities to stream changes into Amazon Kinesis Data […]

Create cross-account, custom Amazon Managed Grafana dashboards for Amazon Redshift

Amazon Managed Grafana recently announced a new data source plugin for Amazon Redshift, enabling you to query, visualize, and alert on your Amazon Redshift data from Amazon Managed Grafana workspaces. With the new Amazon Redshift data source, you can now create dashboards and alerts in your Amazon Managed Grafana workspaces to analyze your structured and […]

Use AWS Glue to read and write Apache Iceberg tables with ACID transactions and perform time travel

September 2023: This post was reviewed and updated for accuracy. Nowadays, many customers have built their data lakes as the core of their data analytic systems. In a typical use case of data lakes, many concurrent queries run to retrieve consistent snapshots of business insights by aggregating query results. A large volume of data constantly […]

Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue

March 2024: This post was reviewed and updated for accuracy. Most businesses store their critical data in a data lake, where you can bring data from various sources to a centralized storage. The data is processed by specialized big data compute engines, such as Amazon Athena for interactive queries, Amazon EMR for Apache Spark applications, […]

Resize Amazon Redshift from DC2 to RA3 with minimal or no downtime

Amazon Redshift is a popular cloud data warehouse that allows you to process exabytes of data across your data warehouse, operational database, and data lake using standard SQL. Amazon Redshift offers different node types like DC2 (dense compute) and RA3, which you can use for your different workloads and use cases. For more information about […]

Implement a CDC-based UPSERT in a data lake using Apache Iceberg and AWS Glue

May 2023: This post was reviewed and updated with code to read and write data to Iceberg table using Native iceberg connector, in the Appendix section. As the implementation of data lakes and modern data architecture increases, customers’ expectations around its features also increase, which include ACID transaction, UPSERT, time travel, schema evolution, auto compaction, […]

How GE Proficy Manufacturing Data Cloud replatformed to improve TCO, data SLA, and performance

This is post is co-authored by Jyothin Madari, Madhusudhan Muppagowni and Ayush Srivastava from GE. GE Proficy Manufacturing Data Cloud (MDC), part of the GE Digital’s Manufacturing Execution Systems (MES) suite of solutions, allows GED’s customers to increase the derived value easily and quickly from the MES by reliably bringing enterprise-wide manufacturing data into the […]

Optimize Federated Query Performance using EXPLAIN and EXPLAIN ANALYZE in Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. In 2019, Athena added support for federated queries to run SQL […]

Use an AD FS user and Tableau to securely query data in AWS Lake Formation

Security-conscious customers often adopt a Zero Trust security architecture. Zero Trust is a security model centered on the idea that access to data shouldn’t be solely based on network location, but rather require users and systems to prove their identities and trustworthiness and enforce fine-grained identity-based authorization rules before granting access to applications, data, and […]

Solution Architecture for the blog post

Simplify and optimize Python package management for AWS Glue PySpark jobs with AWS CodeArtifact

Data engineers use various Python packages to meet their data processing requirements while building data pipelines with AWS Glue PySpark Jobs. Languages like Python and Scala are commonly used in data pipeline development. Developers can take advantage of their open-source packages or even customize their own to make it easier and faster to perform use […]