Decrease geospatial query latency from minutes to seconds using Zarr on Amazon S3

Geospatial data, including many climate and weather datasets, are often released by government and nonprofit organizations in compressed file formats such as the Network Common Data Form (NetCDF) or GRIdded Binary (GRIB). Many of these formats were designed in a pre-cloud world for users to download whole files for analysis on a local machine. As the complexity and size of geospatial datasets continue to grow, it is more time- and cost-efficient to leave the files in one place, virtually query the data, and download only the subset that is needed locally.

Unlike legacy file formats, the cloud-native Zarr format is designed for virtual and efficient access to compressed chunks of data saved in a central location such as Amazon Simple Storage Service (Amazon S3) from Amazon Web Services (AWS).

In this walkthrough, learn how to convert NetCDF datasets to Zarr using an Amazon SageMaker notebook and an AWS Fargate cluster and query the resulting Zarr store, reducing the time required for time series queries from minutes to seconds.

Legacy geospatial data formats

Geospatial raster data files associate data points with cells in a latitude-longitude grid. At a minimum, this type of data is two-dimensional but grows to N-dimensional if, for example, multiple data variables are associated with each cell (for example, temperature and windspeed) or variables are recorded over time or with an additional dimension such as altitude.

Over the years, a number of file formats have been devised to store this type of data. The GRIB format created in 1985 stores data as a collection of compressed two-dimensional records, with each record corresponding to a single time step. However, GRIB files do not contain metadata that supports accessing records directly; retrieving a time series for a single variable requires sequential access to every single record.

NetCDF was introduced in 1990 and improved on some of the shortcomings of GRIB. NetCDF files contain metadata that supports more efficient data indexing and retrieval, including opening files remotely and virtually browsing their structure without loading everything into memory. NetCDF data can be stored as chunks within a file, enabling parallel reads and writes on different chunks. As of 2002, NetCDF4 uses the Hierarchical Data Format version 5 (HDF5) as a backend, improving the speed of parallel I/O on chunks. However, the query performance of NetCDF4 datasets can still be limited by the fact that multiple chunks are contained within a single file.

Cloud-native geospatial data with Zarr

The Zarr file format organizes N-dimensional data as a hierarchy of groups (datasets) and arrays (variables within a dataset). Each array’s data is stored as a set of compressed, binary chunks, and each chunk is written to its own file.

Using Zarr with Amazon S3 enables chunk access with millisecond latency. Chunk reads and writes can run in parallel and be scaled as needed, either via multiple threads or processes from a single machine or from distributed compute resources such as containers running in Amazon Elastic Container Service (Amazon ECS). Storing Zarr data in Amazon S3 and applying distributed compute not only avoids data download time and costs but is often the only option for analyzing geospatial data that is too large to fit into the RAM or on the disk of a single machine.

Finally, the Zarr format includes accessible metadata that allows users to remotely open and browse datasets and download only the subset of data they need locally, such as the output of an analysis. The figure below shows an example of metadata files and chunks included in a Zarr group containing one array.

Figure 1. Zarr group containing one Zarr array.

Zarr is tightly integrated with other Python libraries in the open-source Pangeo technology stack for analyzing scientific data, such as Dask for coordinating distributed, parallel compute jobs and xarray as a an interface for working with labeled, N-dimensional array data.

Zarr performance benefits

The Registry of Open Data on AWS contains both NetCDF and Zarr versions of many widely-used geospatial datasets, including the ECMWF Reanalysis v5 (ERA5) dataset. ERA5 contains hourly air pressure, windspeed, and air temperature data at multiple altitudes on a 31 km latitude-longitude grid, since 2008. We compared the average time to query one, five, nine, and 13 years of historical data for these three variables, at a given latitude and longitude, from both the NetCDF and Zarr versions of ERA5.

Length of time series	Avg. NetCDF query time (seconds)	Avg. Zarr query time (seconds)
1 year	23.8	1.3
5 years	82.7	2.4
9 years	135.8	3.5
13 years	193.1	4.7

Figure 2. Comparison of the average time to query three ERA5 variables using a Dask cluster with 10 workers (8 vCPU, 16 GM memory.

Note that these tests are not a true one-to-one comparison of file formats since the size of the chunks used by each format differs, and results can also vary depending on network conditions. However, they are in line with benchmark studies that show a significant performance advantage for Zarr compared to NetCDF.

While the Registry of Open Data on AWS contains Zarr versions of many popular geospatial datasets, the Zarr chunking strategy of those datasets may not be optimized for an organization’s data access patterns, or organizations may have substantial internal datasets in NetCDF format. In these cases, they may wish to convert from NetCDF files to Zarr.

Moving to Zarr

There are several options for converting NetCDF files to Zarr: a Pangeo-forge recipe, using the Python rechunker library, and using xarray.

The Pangeo-forge project provides premade recipes that may enable you to convert some GRIB or NetCDF datasets to Zarr without diving into the low-level details of the specific formats.

For more control of the conversion process, you can use the rechunker and xarray libraries. Rechunker enables efficient manipulation of chunked arrays saved in persistent storage. Because rechunker uses an intermediate file store and is designed for parallel execution, it is useful for converting datasets larger than working memory from one file format to another (such as from NetCDF to Zarr), while simultaneously changing the underlying chunk sizes. This makes it ideal for converting large NetCDF datasets to an optimized Zarr format. Xarray is useful for converting smaller datasets to Zarr or appending data from NetCDF files to an existing Zarr store, such as additional data points in a time series (which rechunker currently does not support).

Finally, while technically not an option for converting from NetCDF to Zarr, the Kerchunk python library creates index files that allow a NetCDF dataset to be accessed as if it is a Zarr store, yielding significant performance improvements relative to NetCDF alone without creating a new Zarr dataset.

The rest of this blog post provides step-by-step instructions for using rechunker and xarray to convert a NetCDF dataset to Zarr.

Overview of solution

This solution is deployed within an Amazon Virtual Private Cloud (Amazon VPC) with an internet gateway attached. Within the private subnet, it deploys both a Dask cluster running on AWS Fargate and the Amazon SageMaker notebook instance that will be used to submit jobs to the Dask cluster. In the public subnet, a NAT Gateway provides internet access for connections from the Dask cluster or the notebook instance (for example, when installing libraries). Finally, a Network Load Balancer (NLB) enables access to the Dask dashboard for monitoring the progress of Dask jobs. Resources within the VPC access Amazon S3 via an Amazon S3 VPC Endpoint.

Figure 3. Architecture diagram of the solution.

Figure 3. Architecture diagram of the solution.

Solution walkthrough

The following is a high-level overview of the steps required to perform this walkthrough, which we present in more detail:

Clone the GitHub repository.
Deploy the infrastructure using the AWS Cloud Development Kit (AWS CDK).
(Optional) Enable the Dask dashboard.
Set up the Jupyter notebook on the SageMaker notebook instance.
Use the Jupyter notebook to convert NetCDF files to Zarr.

The first four steps are mostly automated and should take roughly 45 minutes. Once that setup is complete, it should take 30 mins to run the cells in the Jupyter notebook.

Prerequisites

Before starting this walkthrough, you should have the following prerequisites:

An AWS account
AWS Identity and Access Management (IAM) permissions to follow all of the steps in this guide
Python3 and the AWS CDK installed on the computer you will deploy the solution from.

Deploying the solution

Step 1: Clone the GitHub repository

In a terminal, run the following code to clone the GitHub repository containing the solution.

git clone https://github.com/aws-samples/convert-netcdf-to-zarr.git

Change to the root directory of the repository.

cd convert-netcdf-to-zarr

Step 2: Deploy the infrastructure

From the root directory, run the following code to create and activate a virtual environment and install the required libraries. (If running from a Windows machine, you will need to activate the environment differently.)

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Then, run the code below to deploy the infrastructure.

cdk deploy

This step first builds the Docker image that will be used by the Dask cluster. When the Docker image is finished, the CDK creates a ConvertToZarr CloudFormation stack with the infrastructure described above. Enter “y” when asked if you wish to deploy. This process should take 20–30 minutes.

Step 3: (Optional) Enable the Dask dashboard

The Dask dashboard provides real-time visualizations and metrics for Dask distributed computing jobs. The dashboard is not required for the Jupyter notebook to work, but is helpful for visualizing jobs and troubleshooting. The Dask dashboard runs on a Dask-Scheduler task port within the private subnet, so we use an NLB to forward requests from the internet to the dashboard.

First, copy the private IP address of the Dask-Scheduler:

1. Sign in to the Amazon ECS console.

2. Choose the Dask-Cluster.

3. Select the Tasks tab.

4. Choose the task with the Dask-Scheduler task definition.

5. Copy the Private IP address of the task in the Configuration tab of the Configuration section.

Now, add the IP address as the target of the NLB.

1. Sign in to the Amazon EC2 console.

2. In the navigation pane, under Load Balancing, choose Target Groups.

3. Choose the target group that starts with Conver-dasks.

4. Select Register targets.

5. In Step 1, make sure that the Convert-To-Zarr VPC is selected. For Step 2, paste the private IP address of the Dask-Scheduler. The port stays as 8787. Select Include as pending below.

6. Scroll down to the bottom of the page. Select Register pending targets. In a few minutes, the target’s health status should change from Pending to Healthy, indicating that you can access the Dask dashboard from a web browser.

Finally, copy the DNS name of the NLB.

1. Sign in to the Amazon EC2 console.

2. In the navigation pane, under Load Balancing, choose Load Balancers.

3. Find the load balancer that begins with Conve-daskd and copy the DNS name.

4. Paste the DNS name into a web browser to load the (empty) Dask dashboard.

Step 4: Set up the Jupyter notebook

We need to clone the GitHub repository on the SageMaker instance and create the kernel for the Jupyter notebook.

1. Sign in to the SageMaker console.

2. In the navigation pane, under Notebook, choose Notebook instances.

3. Find the Convert-To-Zarr-Notebook and select Open Jupyter.

4. On the Jupyter home page, from the New menu at the top right, select Terminal.

5. In the terminal, run the following commands. First, change to the SageMaker directory:

cd SageMaker

Next, clone the repository:

git clone https://github.com/aws-samples/convert-netcdf-to-zarr.git

Change to the notebooks directory within the repository:

cd convert-netcdf-to-zarr/notebooks

6. Finally, create the Jupyter kernel as a new conda environment.

conda env create --name zarr_py310_nb -f environment.yml

This should take 10-15 minutes.

Step 5: Convert NetCDF dataset to Zarr

On the SageMaker instance, in the convert-netcdf-to-zarr/notebooks folder, open Convert-NetCDF-to-Zarr.ipynb. This Jupyter notebook contains all the steps required to convert hourly data from the NASA MERRA-2 dataset (available from the Registry of Open Data on AWS) from NetCDF to Zarr, using both rechunker and xarray. Let’s look at some key lines of code.

The code below registers the notebook as a client of the Dask cluster.

DASK_SCHEDULER_URL = "Dask-Scheduler.local-dask:8786"
client = Client(DASK_SCHEDULER_URL)

The notebook then builds a list of two months of daily MERRA-2 NetCDF files (nc_files_map) and opens this dataset with array, using the Dask cluster.

ds_nc = xr.open_mfdataset(nc_files_map, engine='h5netcdf', chunks={}, parallel=True)

The chunks parameter tells xarray to store the data in memory as Dask arrays, and parallel=True tells Dask to open the NetCDF files in parallel on the cluster.

The notebook focuses on converting data for one variable in the dataset, T2M (air temperature at 2 meters), from NetCDF to Zarr. The xarray output for T2M shows the NetCDF chunk sizes are (24, 361, 576).

Figure 4. Xarray output for T2M.

Figure 4. Xarray output for T2M.

Querying the NetCDF files for two months of T2M data for a given latitude and longitude takes 1–2 minutes. If you have enabled the Dask dashboard, you can watch the progress of the query on the Dask cluster.

Figure 5. Dask dashboard.

Figure 5. Dask dashboard.

Next, before starting the Zarr conversion the notebook creates a dictionary with new chunk sizes for the Zarr store (new_chunksizes_dict).

# returns [“time”, “lat”, “lon”]
dims = ds_nc.dims.keys()

new_chunksizes = (1080, 90, 90)
new_chunksizes_dict = dict(zip(dims, new_chunksizes))

Relative to NetCDF chunk sizes, the new Zarr chunk sizes (1080, 90, 90) are larger in the time dimension to decrease the number of chunks read during a time series query and smaller in the other two dimensions to keep the size and number of chunks created at appropriate levels.

The code block below shows the call to the rechunk function. The key parameters passed to the function are the NetCDF dataset to convert (ds_nc), the target chunk sizes (new_chunksizes_dict) for each variable to convert (var_name, which in this case equals T2M), the S3 URI of the Zarr store that will be created (zarr_store).

ds_zarr = rechunk(
    ds_nc, 
    target_chunks={
      var_name: new_chunksizes_dict, 
      'time': None,'lat': None,'lon': None
    },
    max_mem='15GB',
    target_store = zarr_store, 
    temp_store = zarr_temp
).execute()

The Jupyter notebook contains a more detailed discussion of both chunking strategy and parameters passed to rechunk. The conversion takes approximately five minutes.

Finally, the notebook code shows how to use xarray to append one week’s worth of additional data, one day at a time, to the Zarr store that was just created. This should take roughly 10 minutes. The file structure of the resulting Zarr store on Amazon S3 is shown below.

Figure 6. Zarr store file structure on Amazon S3.

Figure 6. Zarr store file structure on Amazon S3.

The result? Running the same query to plot two months of T2M Zarr data for the specified latitude and longitude takes less than a second, compared to roughly 1–2 minutes from a NetCDF dataset.

Cleaning up

When you’re finished, you can avoid unwanted charges and remove the AWS resources created with this project by running the following code from the root directory of the repository:

cdk destroy

This will remove all resources created by the project except for the S3 bucket. To delete the bucket, log into the AWS S3 Console. Select the bucket and choose Empty. Once the bucket is empty, select the bucket again and choose Delete.

Conclusion

As geospatial datasets continue to grow in size, the performance benefits of using cloud-native data formats such as Zarr with object storage such as Amazon S3 increase as well. While converting terabytes or more of legacy data to Zarr can be a significant undertaking, this post provides step-by-step instructions to jumpstart the process and unlock the significant performance improvements available from using Zarr.

Additional resources

For more examples of processing array-based geospatial data on AWS, see:

Finally, Geospatial ML with Amazon SageMaker (now in preview) allows customers to efficiently ingest, transform, and train a model on geospatial data, including satellite imagery and location data. It includes built-in map-based visualizations of model predictions, all available within an Amazon SageMaker Studio notebook.

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.