AWS Big Data Blog

Integrate Amazon Redshift native IdP federation with Microsoft Azure AD and Power BI

Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries. As enterprise customers look to build their data warehouse on Amazon Redshift, they have many integration needs with the […]

Read More

Simplify management of database privileges in Amazon Redshift using role-based access control

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the challenges with security is that enterprises don’t want to have a concentration of superuser privileges amongst a handful of users. […]

Read More

Introducing Protocol buffers (protobuf) schema support in AWS Glue Schema Registry

AWS Glue Schema Registry now supports Protocol buffers (protobuf) schemas in addition to JSON and Avro schemas. This allows application teams to use protobuf schemas to govern the evolution of streaming data and centrally control data quality from data streams to data lake. AWS Glue Schema Registry provides an open-source library that includes Apache-licensed serializers […]

Read More

Design patterns: Set up AWS Glue Crawlers using S3 event notifications

The AWS Well-Architected Data Analytics Lens provides a set of guiding principles for analytics applications on AWS. One of the best practices it talks about is build a central Data Catalog to store, share, and track metadata changes. AWS Glue provides a Data Catalog to fulfill this requirement. AWS Glue also provides crawlers that automatically […]

Read More
BDB-2071-Virtual_key_2

New features from Apache Hudi 0.9.0 on Amazon EMR

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by providing transaction support and record-level insert, update, and delete capabilities on data lakes on Amazon Simple Storage Service (Amazon S3) or Apache HDFS. Apache Hudi is integrated with open-source big data analytics […]

Read More

Understanding the JVMMemoryPressure metric changes in Amazon OpenSearch Service

Amazon OpenSearch Service is a managed service that makes it easy to secure, deploy, and operate OpenSearch and legacy Elasticsearch clusters at scale. In the latest service software release of Amazon OpenSearch Service, we’ve changed the behavior of the JVMMemoryPressure metric. This metric now reports the overall heap usage, including young and old pools, for […]

Read More
Solution Architecture

Build data lineage for data lakes using AWS Glue, Amazon Neptune, and Spline

Data lineage is one of the most critical components of a data governance strategy for data lakes. Data lineage helps ensure that accurate, complete and trustworthy data is being used to drive business decisions. While a data catalog provides metadata management features and search capabilities, data lineage shows the full context of your data by […]

Read More

Persist and analyze metadata in a transient Amazon MWAA environment

Customers can harness sophisticated orchestration capabilities through the open-source tool Apache Airflow. Airflow can be installed on Amazon EC2 instances or can be dockerized and deployed as a container on AWS container services. Alternatively, customers can also opt to leverage Amazon Managed Workflows for Apache Airflow (MWAA). Amazon MWAA is a fully managed service that […]

Read More

Use Amazon CodeGuru Profiler to monitor and optimize performance in Amazon Kinesis Data Analytics applications for Apache Flink

Amazon Kinesis Data Analytics makes it easy to transform and analyze streaming data and gain actionable insights in real time with Apache Flink. Apache Flink is an open-source framework and engine for processing data streams in real time. Kinesis Data Analytics reduces the complexity of building and managing Apache Flink applications using open-source libraries and […]

Read More

Up to 15 times improvement in Hive write performance with the Amazon EMR Hive zero-rename feature

Our customers use Apache Hive on Amazon EMR for large-scale data analytics and extract, transform, and load (ETL) jobs. Amazon EMR Hive uses Apache Tez as the default job execution engine, which creates Directed Acyclic Graphs (DAGs) to process data. Each DAG can contain multiple vertices from which tasks are created to run the application […]

Read More