AWS Big Data Blog

Scale your AWS Glue for Apache Spark jobs with R type, G.12X, and G.16X workers

This post demonstrates how AWS Glue R type, G.12X, and G.16X workers help you scale up your AWS Glue for Apache Spark jobs.

Unifying metadata governance across Amazon SageMaker and Collibra

Amazon Web Services (AWS) and Collibra have built a new integrated solution that demonstrates the integration between the Collibra Platform and the next generation of Amazon SageMaker. In this post, we take a closer look at the integration, describe the use cases it enables, walk through the architecture, and show how to implement the solution in your environment.

Compaction support for Avro and ORC file formats in Apache Iceberg tables in Amazon S3

In this post, we explore how Amazon S3 Tables has expanded its automatic compaction capabilities to include Avro and ORC file formats for Apache Iceberg tables, alongside the previously supported Parquet format. Through performance testing with over 20 billion events, the capability demonstrates significant query performance improvements ranging from 12% to 40% when using compacted tables compared to non-compacted tables across different file formats.

Introducing Jobs in Amazon SageMaker

This post demonstrates how the new jobs experience works in SageMaker Unified Studio.

Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMaker

Today, we are excited to launch a new visual workflows builder in SageMaker Unified Studio. With the new visual workflow experience, you don’t need to code the Python DAGs manually. Instead, you can visually define the orchestration workflow in SageMaker Unified Studio, and the visual definition is automatically converted to a Python DAG definition that is supported in Airflow.This post demonstrates the new visual workflow experience in SageMaker Unified Studio.

Revenue NSW modernises analytics with AWS, enabling unified and scalable data management, processing, and access

Revenue NSW, Australia’s principal revenue management agency, successfully modernized its analytics infrastructure using AWS services. In this blog post, we show how the organization transformed its on-premises data environment into a unified, scalable cloud-based solution using Amazon Redshift, AWS Database Migration Service, Amazon AppFlow, and AWS Glue.

Harnessing the Power of Nested Materialized Views and exploring Cascading Refresh

In this post, we explore how to maximize Amazon Redshift query performance through nested materialized views and implementing cascading refresh strategies. We demonstrate how to create materialized views based on other materialized views, enabling a hierarchical structure of precomputed results that significantly enhances query performance and data processing efficiency, particularly useful for reusing precomputed joins with different aggregate options.

Realizing ocean data democratization: Furuno Electric’s initiatives using Amazon DataZone

In this post, we explore how Furuno Electric built a comprehensive data management foundation using Amazon DataZone and other AWS services to transform from a traditional manufacturing company to a data-driven business.

Professional GIS interface showing Houston metropolitan vaccination clinics with topographic base map, toolbars, and database connectivity

Geospatial data lakes with Amazon Redshift

In this post, we review how to set up Redshift Serverless to use geospatial data contained within a data lake to enhance maps in ArcGIS Pro. This technique helps builders and GIS analysts use available datasets in data lakes and transform it in Amazon Redshift to further enrich the data before presenting it on a map.

Develop and monitor a Spark application using existing data in Amazon S3 with Amazon SageMaker Unified Studio

In this post, we demonstrate how to develop and monitor a Spark application using existing data in Amazon S3 using SageMaker Unified Studio. The solution addresses key challenges organizations face in managing big data analytics workloads through an integrated development environment where data teams can develop, test, and refine Spark applications while leveraging EMR Serverless for dynamic resource allocation and built-in monitoring tools.