AWS Big Data Blog

Category: AWS Glue

Design patterns for an enterprise data lake using AWS Lake Formation cross-account access

In this post, we briefly walk through the most common design patterns adapted by enterprises to build lake house solutions to support their business agility in a multi-tenant model using the AWS Lake Formation cross-account feature to enable a multi-account strategy for line of business (LOB) accounts to produce and consume data from your data […]

Read More

Hydrate your data lake with SaaS application data using Amazon AppFlow

Organizations today want to make data-driven decisions. The data could lie in multiple source systems, such as line of business applications, log files, connected devices, social media, and many more. As organizations adopt software as a service (SaaS) applications, data becomes increasingly fragmented and trapped in different “data islands.” To make decision-making easier, organizations are […]

Read More

Improve query performance using AWS Glue partition indexes

While creating data lakes on the cloud, the data catalog is crucial to centralize metadata and make the data visible, searchable, and queryable for users. With the recent exponential growth of data volume, it becomes much more important to optimize data layout and maintain the metadata on cloud storage to keep the value of data […]

Read More

Build a data quality score card using AWS Glue DataBrew, Amazon Athena, and Amazon QuickSight

Data quality plays an important role while building an extract, transform, and load (ETL) pipeline for sending data to downstream analytical applications and machine learning (ML) models. The analogy “garbage in, garbage out” is apt at describing why it’s important to filter out bad data before further processing. Continuously monitoring data quality and comparing it […]

Read More

Simplify incoming data ingestion with dynamic parameterized datasets in AWS Glue DataBrew

When data analysts and data scientists prepare data for analysis, they often rely on periodically generated data produced by upstream services, such as labeling datasets from Amazon SageMaker Ground Truth or Cost and Usage Reports from AWS Billing and Cost Management. Alternatively, they can regularly upload such data to Amazon Simple Storage Service (Amazon S3) […]

Read More

Set up CI/CD pipelines for AWS Glue DataBrew using AWS Developer Tools

An integral part of DevOps is adopting the culture of continuous integration and continuous delivery (CI/CD). This enables teams to securely store and version code, maintain parity between development and production environments, and achieve end-to-end automation of the release cycle, including building, testing, and deploying to production. In essence, development teams follow CI/CD processes to […]

Read More

How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform

This is a joint blog post co-authored with Anu Jain, Graham Person, and Paul Conroy from JP Morgan Chase.  Most modern organizations recognize that their data benefits their entire enterprise. Data has value to the individual business process that produces it, but data’s additional potential can be realized when it’s shared and combined with other […]

Read More

Effective data lakes using AWS Lake Formation, Part 3: Using ACID transactions on governed tables

Data lakes on Amazon Simple Storage Service (Amazon S3) have become the default repository for all enterprise data and serve as common choice for a large number of users querying from a variety of analytics and ML tools. Often times you want to ingest data continuously into the data lake from multiple sources and query against the […]

Read More

Use Grok patterns in AWS Glue to process streaming data into Amazon Elasticsearch Service

Recently, we launched AWS Glue custom connectors for Amazon Elasticsearch Service (Amazon ES), which provides the capability to ingest data into Amazon ES with just a few clicks. You can now use Amazon ES as a data store for your extract, transform, and load (ETL) jobs using AWS Glue and AWS Glue Studio. This integration […]

Read More

How OrthoFi delivers better insights for customers with Amazon Redshift and AWS Glue

This is a guest post by Christa Pierson and Jon Fearer at OrthoFi. OrthoFi is an orthodontic industry leader in revenue cycle management (RCM), and has partnered with more than 550 orthodontic practices across the country, delivering an end-to-end platform that enables orthodontists to bring on more patients and run their businesses more effectively. To […]

Read More