AWS Big Data Blog

Best practices to optimize data access performance from Amazon EMR and AWS Glue to Amazon S3

June 2023: This post was reviewed and updated for accuracy. Customers are increasingly building data lakes to store data at massive scale in the cloud. It’s common to use distributed computing engines, cloud-native databases, and data warehouses when you want to process and analyze your data in data lakes. Amazon EMR and AWS Glue are […]

Cover Image

Build a data pipeline to automatically discover and mask PII data with AWS Glue DataBrew

Personally identifiable information (PII) data handling is a common requirement when operating a data lake at scale. Businesses often need to mitigate the risk of exposing PII data to the data science team while not hindering the productivity of the team to get to the data they need in order to generate valuable data insights. […]

Query your data streams interactively using Kinesis Data Analytics Studio and Python

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. Amazon Kinesis Data Analytics Studio makes it easy for customers to analyze streaming data in real time, as well as build stream processing applications powered by Apache […]

Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool

July 2023: This post was reviewed for accuracy. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. […]

Integrate Amazon Redshift native IdP federation with Microsoft Azure AD using a SQL client

June 2023: This post was reviewed and updated for accuracy. Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries. The new Amazon Redshift native identity provider authentication simplifies […]

Integrate Amazon Redshift native IdP federation with Microsoft Azure AD and Power BI

June 2023: This post was reviewed and updated for accuracy. Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries. As enterprise customers look to build their data warehouse […]

Simplify management of database privileges in Amazon Redshift using role-based access control

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the challenges with security is that enterprises don’t want to have a concentration of superuser privileges amongst a handful of users. […]

Introducing Protocol buffers (protobuf) schema support in AWS Glue Schema Registry

AWS Glue Schema Registry now supports Protocol buffers (protobuf) schemas in addition to JSON and Avro schemas. This allows application teams to use protobuf schemas to govern the evolution of streaming data and centrally control data quality from data streams to data lake. AWS Glue Schema Registry provides an open-source library that includes Apache-licensed serializers […]

Design patterns: Set up AWS Glue Crawlers using S3 event notifications

The AWS Well-Architected Data Analytics Lens provides a set of guiding principles for analytics applications on AWS. One of the best practices it talks about is build a central Data Catalog to store, share, and track metadata changes. AWS Glue provides a Data Catalog to fulfill this requirement. AWS Glue also provides crawlers that automatically […]

BDB-2071-Virtual_key_2

New features from Apache Hudi 0.9.0 on Amazon EMR

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by providing transaction support and record-level insert, update, and delete capabilities on data lakes on Amazon Simple Storage Service (Amazon S3) or Apache HDFS. Apache Hudi is integrated with open-source big data analytics […]