AWS Big Data Blog

Category: AWS Glue

Entity resolution and fuzzy matches in AWS Glue using the Zingg open source library

In this post, we explore how to use Zingg’s entity resolution capabilities within an AWS Glue notebook, which you can later run as an extract, transform, and load (ETL) job. By integrating Zingg in your notebooks or ETL jobs, you can effectively address data governance challenges and provide consistent and accurate data across your organization.

Understanding Apache Iceberg on AWS with the new technical guide

We’re excited to announce the launch of the Apache Iceberg on AWS technical guide. Whether you are new to Apache Iceberg on AWS or already running production workloads on AWS, this comprehensive technical guide offers detailed guidance on foundational concepts to advanced optimizations to build your transactional data lake with Apache Iceberg on AWS.

Detect and handle data skew on AWS Glue

October 2024: This post was reviewed and updated for accuracy. AWS Glue is a fully managed, serverless data integration service provided by Amazon Web Services (AWS) that uses Apache Spark as one of its backend processing engines (as of this writing, you can use Python Shell or Spark). Data skew occurs when the data being […]

How Fujitsu implemented a global data mesh architecture and democratized data

This is a guest post co-authored with Kanehito Miyake, Engineer at Fujitsu Japan.  Fujitsu Limited was established in Japan in 1935. Currently, we have approximately 120,000 employees worldwide (as of March 2023), including group companies. We develop business in various regions around the world, starting with Japan, and provide digital services globally. To provide a […]

Introducing Amazon Q data integration in AWS Glue

Today, we’re excited to announce general availability of Amazon Q data integration in AWS Glue. Amazon Q data integration, a new generative AI-powered capability of Amazon Q Developer, enables you to build data integration pipelines using natural language. This reduces the time and effort you need to learn, build, and run data integration jobs using […]

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows. […]

Amazon DataZone announces integration with AWS Lake Formation hybrid access mode for the AWS Glue Data Catalog

Last week, we announced the general availability of the integration between Amazon DataZone and AWS Lake Formation hybrid access mode. In this post, we share how this new feature helps you simplify the way you use Amazon DataZone to enable secure and governed sharing of your data in the AWS Glue Data Catalog. We also […]

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Today, we are pleased to announce that Amazon DataZone is now able to present data quality information for data assets. This information empowers end-users to make informed decisions as to whether or not to use specific assets. In this post, we discuss the latest features of Amazon DataZone for data quality, the integration between Amazon DataZone and AWS Glue Data Quality and how you can import data quality scores produced by external systems into Amazon DataZone via API.