AWS Big Data Blog

Category: AWS Glue

Architecture diagram

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Many organizations around the world rely on the use of physical assets, such as vehicles, to deliver a service to their end-customers. By tracking these assets in real time and storing the results, asset owners can derive valuable insights on how their assets are being used to continuously deliver business improvements and plan for future […]

Measure performance of AWS Glue Data Quality for ETL pipelines

In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset. As part of the results, we show how AWS Glue Data Quality provides information about the runtime of extract, transform, and load (ETL) jobs, the resources measured in terms of data processing units (DPUs), and how you can track the cost of running AWS Glue Data Quality for ETL pipelines by defining custom cost reporting in AWS Cost Explorer.

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. However, altering schema and table partitions in traditional data lakes can be a disruptive and time-consuming task, requiring renaming or recreating entire tables and reprocessing large datasets. […]

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

This post is co-written with Toney Thomas and Ben Vengerovsky from Bluestone. In the ever-evolving world of finance and lending, the need for real-time, reliable, and centralized data has become paramount. Bluestone, a leading financial institution, embarked on a transformative journey to modernize its data infrastructure and transition to a data-driven organization. In this post, […]

Combine AWS Glue and Amazon MWAA to build advanced VPC selection and failover strategies

AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue customers often have to meet strict security requirements, which sometimes involve locking down the network connectivity allowed to the job, or running inside […]

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 2: Real-time monitoring using Grafana

Monitoring data pipelines in real time is critical for catching issues early and minimizing disruptions. AWS Glue has made this more straightforward with the launch of AWS Glue job observability metrics, which provide valuable insights into your data integration pipelines built on AWS Glue. However, you might need to track key performance indicators across multiple […]

Use multiple bookmark keys in AWS Glue JDBC jobs

AWS Glue is a serverless data integrating service that you can use to catalog data and prepare for analytics. With AWS Glue, you can discover your data, develop scripts to transform sources into targets, and schedule and run extract, transform, and load (ETL) jobs in a serverless environment. AWS Glue jobs are responsible for running […]

os_glue_architecture

Detect, mask, and redact PII data using AWS Glue before loading into Amazon OpenSearch Service

Many organizations, small and large, are working to migrate and modernize their analytics workloads on Amazon Web Services (AWS). There are many reasons for customers to migrate to AWS, but one of the main reasons is the ability to use fully managed services rather than spending time maintaining infrastructure, patching, monitoring, backups, and more. Leadership […]

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

As the volume and complexity of analytics workloads continue to grow, customers are looking for more efficient and cost-effective ways to ingest and analyse data. Data is stored from online systems such as the databases, CRMs, and marketing systems to data stores such as data lakes on Amazon Simple Storage Service (Amazon S3), data warehouses […]

New Amazon CloudWatch log class to cost-effectively scale your AWS Glue workloads

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. One of […]