AWS Big Data Blog

Category: AWS Glue

Build a semantic search engine for tabular columns with Transformers and Amazon OpenSearch Service

Finding similar columns in a data lake has important applications in data cleaning and annotation, schema matching, data discovery, and analytics across multiple data sources. The inability to accurately find and analyze data from disparate sources represents a potential efficiency killer for everyone from data scientists, medical researchers, academics, to financial and government analysts. Conventional […]

Build a real-time GDPR-aligned Apache Iceberg data lake

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. But regulations such as the General Data Protection Regulation (GDPR) have created obligations for data operators who must be able to erase or […]

Introducing AWS Glue crawlers using AWS Lake Formation permission management

Data lakes provide a centralized repository that consolidates your data at scale and makes it available for different kinds of analytics. AWS Glue crawlers are a popular way to scan data in a data lake, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. AWS Lake […]

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

This post is co-written with Olivia Michele and Dariswan Janweri P. at Ruparupa. Ruparupa was built by PT. Omni Digitama Internusa with the vision to cultivate synergy and create a seamless digital ecosystem within Kawan Lama Group that touches and enhances the lives of many. Ruparupa is the first digital platform built by Kawan Lama […]

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history. According to a study, the […]

Extract data from SAP ERP using AWS Glue and the SAP SDK

This is a guest post by Siva Manickam and Prahalathan M from Vyaire Medical Inc. Vyaire Medical Inc. is a global company, headquartered in suburban Chicago, focused exclusively on supporting breathing through every stage of life. Established from legacy brands with a 65-year history of pioneering breathing technology, the company’s portfolio of integrated solutions is […]

Automate schema evolution at scale with Apache Hudi in AWS Glue

In the data analytics space, organizations often deal with many tables in different databases and file formats to hold data for different business functions. Business needs often drive table structure, such as schema evolution (the addition of new columns, removal of existing columns, update of column names, and so on) for some of these tables […]

Improve observability across Amazon MWAA tasks

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud at scale. A data pipeline is a set of tasks and processes used to automate the movement and transformation of data between different systems.­ […]

Introducing MongoDB Atlas metadata collection with AWS Glue crawlers

For data lake customers who need to discover petabytes of data, AWS Glue crawlers are a popular way to discover and catalog data in the background. This allows users to search and find relevant data from multiple data sources. Many customers also have data in managed operational databases such as MongoDB Atlas and need to […]

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

Every day, Amazon devices process and analyze billions of transactions from global shipping, inventory, capacity, supply, sales, marketing, producers, and customer service teams. This data is used in procuring devices’ inventory to meet Amazon customers’ demands. With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics […]