AWS Big Data Blog

Integrate Amazon Redshift native IdP federation with Microsoft Azure AD using a SQL client

June 2023: This post was reviewed and updated for accuracy. Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries. The new Amazon Redshift native identity provider authentication simplifies […]

Integrate Amazon Redshift native IdP federation with Microsoft Azure AD and Power BI

June 2023: This post was reviewed and updated for accuracy. Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries. As enterprise customers look to build their data warehouse […]

Simplify management of database privileges in Amazon Redshift using role-based access control

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the challenges with security is that enterprises don’t want to have a concentration of superuser privileges amongst a handful of users. […]

Introducing Protocol buffers (protobuf) schema support in AWS Glue Schema Registry

AWS Glue Schema Registry now supports Protocol buffers (protobuf) schemas in addition to JSON and Avro schemas. This allows application teams to use protobuf schemas to govern the evolution of streaming data and centrally control data quality from data streams to data lake. AWS Glue Schema Registry provides an open-source library that includes Apache-licensed serializers […]

Design patterns: Set up AWS Glue Crawlers using S3 event notifications

The AWS Well-Architected Data Analytics Lens provides a set of guiding principles for analytics applications on AWS. One of the best practices it talks about is build a central Data Catalog to store, share, and track metadata changes. AWS Glue provides a Data Catalog to fulfill this requirement. AWS Glue also provides crawlers that automatically […]

New features from Apache Hudi 0.9.0 on Amazon EMR

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by providing transaction support and record-level insert, update, and delete capabilities on data lakes on Amazon Simple Storage Service (Amazon S3) or Apache HDFS. Apache Hudi is integrated with open-source big data analytics […]

Understanding the JVMMemoryPressure metric changes in Amazon OpenSearch Service

This blog post was last reviewed and updated September 2022 with OldGenJVMMemoryPressure, MasterOldGenJVMMemoryPressure and WarmOldGenJVMMemoryPressure metrics to trace usage of old gen. Amazon OpenSearch Service is a managed service that makes it easy to secure, deploy, and operate OpenSearch and legacy Elasticsearch clusters at scale. In the latest service software release of Amazon OpenSearch Service, […]

Build data lineage for data lakes using AWS Glue, Amazon Neptune, and Spline

Data lineage is one of the most critical components of a data governance strategy for data lakes. Data lineage helps ensure that accurate, complete and trustworthy data is being used to drive business decisions. While a data catalog provides metadata management features and search capabilities, data lineage shows the full context of your data by […]

Persist and analyze metadata in a transient Amazon MWAA environment

Customers can harness sophisticated orchestration capabilities through the open-source tool Apache Airflow. Airflow can be installed on Amazon EC2 instances or can be dockerized and deployed as a container on AWS container services. Alternatively, customers can also opt to leverage Amazon Managed Workflows for Apache Airflow (MWAA). Amazon MWAA is a fully managed service that […]

Use Amazon CodeGuru Profiler to monitor and optimize performance in Amazon Kinesis Data Analytics applications for Apache Flink

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. Amazon Kinesis Data Analytics makes it easy to transform and analyze streaming data and gain actionable insights in real time with Apache Flink. Apache Flink is an […]