AWS Big Data Blog

Author: Noritaka Sekiyama

Noritaka Sekiyama is a Principal Big Data Architect with Amazon Web Services (AWS) Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Synchronize your AWS Glue Studio Visual Jobs to different environments

June 2023: This post was reviewed and updated for accuracy. AWS Glue has become a popular option for integrating data from disparate data sources due to its ability to integrate large volumes of data using distributed data processing frameworks. Many customers use AWS Glue to build data lakes and data warehouses. Data engineers who prefer […]

Introducing AWS Glue Auto Scaling: Automatically resize serverless computing resources for lower cost with optimized Apache Spark

October 2024: This post has been updated along with Interactive Sessions support for AWS Glue Auto scaling. June 2023: This post was reviewed and updated for accuracy. Data created in the cloud is growing fast in recent days, so scalability is a key factor in distributed data processing. Many customers benefit from the scalability of […]

Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container

Mar 2025: This post was written for AWS Glue 3.0 and 4.0. For AWS Glue 5.0, visit Develop and test AWS Glue 5.0 jobs locally using a Docker container. Apr 2023: This post was reviewed and updated with enhanced support for Glue 4.0 Streaming jobs. Jan 2023: This post was reviewed and updated with enhanced […]

Best practices to optimize data access performance from Amazon EMR and AWS Glue to Amazon S3

June 2024: This post was reviewed for accuracy and updated to cover Apache Iceberg. June 2023: This post was reviewed and updated for accuracy. Customers are increasingly building data lakes to store data at massive scale in the cloud. It’s common to use distributed computing engines, cloud-native databases, and data warehouses when you want to […]

The following diagram shows our solution architecture.

Effective data lakes using AWS Lake Formation, Part 2: Creating a governed table for streaming data sources

February 2023: The content of this blog post can be now be found on AWS Lake Formation public documentation. Please refer to it instead. We announced the general availability of AWS Lake Formation transactions, row-level security, and acceleration at AWS re:Invent 2021. In Part 1 of this series, we explained how to set up a […]

Improve Amazon Athena query performance using AWS Glue Data Catalog partition indexes

The AWS Glue Data Catalog provides partition indexes to accelerate queries on highly partitioned tables. In the post Improve query performance using AWS Glue partition indexes, we demonstrated how partition indexes reduce the time it takes to fetch partition information during the planning phase of queries run on Amazon EMR, Amazon Redshift Spectrum, and AWS […]

Stream data from relational databases to Amazon Redshift with upserts using AWS Glue streaming jobs

Traditionally, read replicas of relational databases are often used as a data source for non-online transactions of web applications such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. Due to the exponential growth of data volume, it became common practice to replace such read replicas with data warehouses or data lakes […]

Simplify data integration pipeline development using AWS Glue custom blueprints

June 2023: This post was reviewed and updated for accuracy. August 2021: AWS Glue custom blueprints are now generally available. Please visit https://docs.aws.amazon.com/glue/latest/dg/blueprints-overview.html to learn more. Organizations spend significant time developing and maintaining data integration pipelines that hydrate data warehouses, data lakes, and lake houses. As data volume increases, data engineering teams struggle to keep up with […]

Introducing AWS Glue 3.0 with optimized Apache Spark 3.1 runtime for faster data integration

May 2022: This post was reviewed for accuracy. In August 2020, we announced the availability of AWS Glue 2.0. AWS Glue 2.0 reduced job startup times by 10x, enabling customers to realize an average of 45% cost savings on their extract, transform, and load (ETL) jobs. The fast start time allows customers to easily adopt […]

Effective data lakes using AWS Lake Formation, Part 2: Securing data lakes with row-level access control

Apr 2023: This post was updated with the latest dataset and the updated CloudFormation template. July 2023: This post was reviewed for accuracy. Increasingly, customers are looking at data lakes as a core part of their strategy to democratize data access across the organization. Data lakes enable you to handle petabytes and exabytes of data […]

← Older posts

Newer posts →