AWS Big Data Blog

Category: AWS Glue

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started

AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. AWS Glue provides an extensible architecture that enables users with different data processing use cases. A common use case is building data lakes on Amazon Simple Storage Service (Amazon S3) using AWS […]

How SikSin improved customer engagement with AWS Data Lab and Amazon Personalize

This post is co-written with Byungjun Choi and Sangha Yang from SikSin. SikSin is a technology platform connecting customers with restaurant partners serving their multiple needs. Customers use the SikSin platform to search and discover restaurants, read and write reviews, and view photos. From the restaurateurs’ perspective, SikSin enables restaurant partners to engage and acquire […]

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow. BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, […]

How Novo Nordisk built a modern data architecture on AWS

Novo Nordisk is a leading global pharmaceutical company, responsible for producing life-saving medicines that reach more than 34 million patients each day. They do this following their triple bottom line—that they must strive to be environmentally sustainable, socially sustainable, and financially sustainable. The combination of using AWS and data supports all these targets. Data is […]

Create your own reusable visual transforms for AWS Glue Studio

AWS Glue Studio has recently added the possibility of adding custom transforms that you can use to build visual jobs to use them in combination with the AWS Glue Studio components provided out of the box. You can now define custom visual transform by simply dropping a JSON file and a Python script onto Amazon […]

Introducing native Delta Lake table support with AWS Glue crawlers

June 2023: This post was reviewed and updated for accuracy. Delta Lake is an open-source project that helps implement modern data lake architectures commonly built on Amazon S3 or other cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud. Delta Lake is […]

Getting started with AWS Glue Data Quality for ETL Pipelines

June 2023: This post was reviewed and updated with the latest release from AWS Glue Data Catalog. Today, hundreds of thousands of customers use data lakes for analytics and machine learning. However, data engineers have to cleanse and prepare this data before it can be used. The underlying data has to be accurate and recent […]

Build an AWS Lake Formation permissions inventory dashboard using AWS Glue and Amazon QuickSight

AWS Lake Formation makes it easier to centrally govern, secure, and share data for analytics with familiar database-style grant features managed through the Glue Data Catalog. Lake Formation provides a single place to define fine-grained access control on catalog resources. These permissions are granted to the principals by a data lake admin, and integrated engines […]

Introducing the Cloud Shuffle Storage Plugin for Apache Spark

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. In AWS Glue, you can use Apache Spark, an open-source, distributed processing system for your data integration tasks and big data workloads. Apache Spark utilizes in-memory caching and optimized […]

Scale AWS SDK for pandas workloads with AWS Glue for Ray

September 2023: This post was reviewed and updated with a new dataset and related code blocks and images. AWS SDK for pandas is an open-source library that extends the popular Python pandas library, enabling you to connect to AWS data and analytics services using pandas data frames. We’ve seen customers use the library in combination […]