AWS Big Data Blog

Introducing AWS Glue Auto Scaling: Automatically resize serverless computing resources for lower cost with optimized Apache Spark

June 2023: This post was reviewed and updated for accuracy. Data created in the cloud is growing fast in recent days, so scalability is a key factor in distributed data processing. Many customers benefit from the scalability of the AWS Glue serverless Spark runtime. Today, we’re pleased to announce the release of AWS Glue Auto […]

Enhance analytics with Google Trends data using AWS Glue, Amazon Athena, and Amazon QuickSight

In today’s market, business success often lies in the ability to glean accurate insights and predictions from data. However, data scientists and analysts often find that the data they have at their disposal isn’t enough to help them make accurate predictions for their use cases. A variety of factors might alter an outcome and should […]

Scale Amazon Redshift to meet high throughput query requirements

Many enterprise customers have demanding query throughput requirements for their data warehouses. Some may be able to address these requirements through horizontally or vertically scaling a single cluster. Others may have a short duration where they need extra capacity to handle peaks that can be addressed through Amazon Redshift concurrency scaling. However, enterprises with consistently […]

Amazon Redshift continues its price-performance leadership

Data is a strategic asset. Getting timely value from data requires high-performance systems that can deliver performance at scale while keeping costs low. Amazon Redshift is the most popular cloud data warehouse that is used by tens of thousands of customers to analyze exabytes of data every day. We continue to add new capabilities to […]

Automate notifications on Slack for Amazon Redshift query monitoring rule violations

In this post, we walk you through how to set up automatic notifications of query monitoring rule (QMR) violations in Amazon Redshift to a Slack channel, so that Amazon Redshift users can take timely action. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze your […]

Share data securely across Regions using Amazon Redshift data sharing

July 2023: This post was reviewed for accuracy. Today’s global, data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. This requires you to seamlessly share and consume live, consistent data as a single source of truth without copying the data, […]

Write prepared data directly into JDBC-supported destinations using AWS Glue DataBrew

July 2023: This post was reviewed for accuracy. AWS Glue DataBrew offers over 250 pre-built transformations to automate data preparation tasks (such as filtering anomalies, standardizing formats, and correcting invalid values) that would otherwise require days or weeks writing hand-coded transformations. You can now write cleaned and normalized data directly into JDBC-supported databases and data […]

Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container

Apr 2023: This post was reviewed and updated with enhanced support for Glue 4.0 Streaming jobs. Jan 2023: This post was reviewed and updated with enhanced support for Glue 3.0 Streaming jobs, ARM64, and Glue 4.0. AWS Glue is a fully managed serverless service that allows you to process data coming through different data sources […]

Best practices to optimize data access performance from Amazon EMR and AWS Glue to Amazon S3

June 2023: This post was reviewed and updated for accuracy. Customers are increasingly building data lakes to store data at massive scale in the cloud. It’s common to use distributed computing engines, cloud-native databases, and data warehouses when you want to process and analyze your data in data lakes. Amazon EMR and AWS Glue are […]

Cover Image

Build a data pipeline to automatically discover and mask PII data with AWS Glue DataBrew

Personally identifiable information (PII) data handling is a common requirement when operating a data lake at scale. Businesses often need to mitigate the risk of exposing PII data to the data science team while not hindering the productivity of the team to get to the data they need in order to generate valuable data insights. […]