AWS Big Data Blog

Category: AWS Glue

Get started with Apache Hudi using AWS Glue by implementing key design concepts – Part 1

Many organizations build data lakes on Amazon Simple Storage Service (Amazon S3) using a modern architecture for a scalable and cost-effective solution. Open-source storage formats like Parquet and Avro are commonly used, and data is stored in these formats as immutable files. As the data lake is expanded to additional use cases, there are still […]

Build incremental crawls of data lakes with existing Glue catalog tables

AWS Glue includes crawlers, a capability that make discovering datasets simpler by scanning data in Amazon Simple Storage Service (Amazon S3) and relational databases, extracting their schema, and automatically populating the AWS Glue Data Catalog, which keeps the metadata current. This reduces the time to insight by making newly ingested data quickly available for analysis […]

Code versioning using AWS Glue Studio and GitHub

AWS Glue now offers integration with Git, an open-source version control system widely used across the developer community. Thanks to this integration, you can incorporate your existing DevOps practices on AWS Glue jobs. AWS Glue is a serverless data integration service that helps you create jobs based on Apache Spark or Python to perform extract, […]

Land data from databases to a data lake at scale using AWS Glue blueprints

To build a data lake on AWS, a common data ingestion pattern is to use AWS Glue jobs to perform extract, transform, and load (ETL) data from relational databases to Amazon Simple Storage Service (Amazon S3). A project often involves extracting hundreds of tables from source databases to the data lake raw layer. And for […]

Ingest streaming data to Apache Hudi tables using AWS Glue and Apache Hudi DeltaStreamer

In today’s world with technology modernization, the need for near-real-time streaming use cases has increased exponentially. Many customers are continuously consuming data from different sources, including databases, applications, IoT devices, and sensors. Organizations may need to ingest that streaming data into data lakes built on Amazon Simple Storage Service (Amazon S3). You may also need […]

Build, Test and Deploy ETL solutions using AWS Glue and AWS CDK based CI/CD pipelines

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. It’s serverless, so there’s no infrastructure to set up or manage. This post provides a step-by-step guide to build a continuous integration and continuous delivery (CI/CD) pipeline using AWS […]

Automate ETL jobs between Amazon RDS for SQL Server and Azure Managed SQL using AWS Glue Studio

Nowadays many customers are following a multi-cloud strategy. They might choose to use various cloud-managed services, such as Amazon Relational Database Service (Amazon RDS) for SQL Server and Azure SQL Managed Instances, to perform data analytics tasks, but still use traditional extract, transform, and load (ETL) tools to integrate and process the data. However, traditional ETL tools may […]

EMR Hive Metastore Upgrade

Upgrade Amazon EMR Hive Metastore from 5.X to 6.X

If you are currently running Amazon EMR 5.X clusters, consider moving to Amazon EMR 6.X as  it includes new features that helps you improve performance and optimize on cost. For instance, Apache Hive is two times faster with LLAP on Amazon EMR 6.X, and Spark 3 reduces costs by 40%. Additionally, Amazon EMR 6.x releases […]

Enable self-service visual data integration and analysis for fund performance using AWS Glue Studio and Amazon QuickSight

June 2023: This post was reviewed and updated for accuracy. IMM (Institutional Money Market) is a mutual fund that invests in highly liquid instruments, cash, and cash equivalents. IMM funds are large financial intermediaries that are crucial to financial stability in the US. Due to its criticality, IMM funds are highly regulated under the security […]

Detect and process sensitive data using AWS Glue Studio

Data lakes offer the possibility of sharing diverse types of data with different teams and roles to cover numerous use cases. This is very important in order to implement a data democratization strategy and incentivize the collaboration between lines of business. When a data lake is being designed, one of the most important aspects to […]