AWS Big Data Blog
Category: AWS Glue
Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue
Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history. According to a study, the […]
Extract data from SAP ERP using AWS Glue and the SAP SDK
This is a guest post by Siva Manickam and Prahalathan M from Vyaire Medical Inc. Vyaire Medical Inc. is a global company, headquartered in suburban Chicago, focused exclusively on supporting breathing through every stage of life. Established from legacy brands with a 65-year history of pioneering breathing technology, the company’s portfolio of integrated solutions is […]
Automate schema evolution at scale with Apache Hudi in AWS Glue
In the data analytics space, organizations often deal with many tables in different databases and file formats to hold data for different business functions. Business needs often drive table structure, such as schema evolution (the addition of new columns, removal of existing columns, update of column names, and so on) for some of these tables […]
Improve observability across Amazon MWAA tasks
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud at scale. A data pipeline is a set of tasks and processes used to automate the movement and transformation of data between different systems. […]
Introducing MongoDB Atlas metadata collection with AWS Glue crawlers
For data lake customers who need to discover petabytes of data, AWS Glue crawlers are a popular way to discover and catalog data in the background. This allows users to search and find relevant data from multiple data sources. Many customers also have data in managed operational databases such as MongoDB Atlas and need to […]
How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics
Every day, Amazon devices process and analyze billions of transactions from global shipping, inventory, capacity, supply, sales, marketing, producers, and customer service teams. This data is used in procuring devices’ inventory to meet Amazon customers’ demands. With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics […]
Handle UPSERT data operations using open-source Delta Lake and AWS Glue
September 2024: This post was reviewed and updated for accuracy. Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities. In this post, we […]
Build a data lake with Apache Flink on Amazon EMR
To build a data-driven business, it is important to democratize enterprise data assets in a data catalog. With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep […]
Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started
AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. AWS Glue provides an extensible architecture that enables users with different data processing use cases. A common use case is building data lakes on Amazon Simple Storage Service (Amazon S3) using AWS […]
How SikSin improved customer engagement with AWS Data Lab and Amazon Personalize
This post is co-written with Byungjun Choi and Sangha Yang from SikSin. SikSin is a technology platform connecting customers with restaurant partners serving their multiple needs. Customers use the SikSin platform to search and discover restaurants, read and write reviews, and view photos. From the restaurateurs’ perspective, SikSin enables restaurant partners to engage and acquire […]









