How to discover and use Open Data on AWS Data Exchange
Data is at the center of many processes and products, whether it’s a large-scale dataset used to train machine learning models, a relational database, or an API-based integration. AWS Data Exchange lets you find, subscribe to, and use thousands of datasets delivered via Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and APIs offered by third-party data providers, such as Sinergise.
The AWS Open Data Sponsorship Program makes high-value, cloud-optimized datasets publicly available on Amazon Web Services (AWS). Now, all of those open data datasets are discoverable on AWS Data Exchange. This is in addition to over 3,500 existing data products from more than 250 category-leading data providers across a variety of industries. You can now discover the data you need on AWS Data Exchange, whether it’s open, no-cost, or paid commercial data, in one place. You can browse the data catalog on AWS Data Exchange with or without an AWS account and at no cost. This simplifies the lives of data subscribers, publishers, and IT administrators who must integrate and secure the access to multiple third-party datasets that could be either commercial or public.
In this blog post, Jeff, Mike, and I will show you how to discover and use no-cost open data datasets on AWS Data Exchange. We will also show you how to enrich the open data with a paid dataset and how to import these datasets into Amazon SageMaker and do an analysis against them.
Sinergise is a Geospatial Information System (GIS) company building large, turnkey geospatial systems in the fields of cloud GIS, agriculture and real-estate administration. Sinergise decided to tackle the technical challenge of processing Earth Observation (EO) data in their own way. They did this by creating web services, which make it possible for everyone to get data in their favorite GIS applications using standard Web Map Service (WMS) and Web Coverage Service (WCS) services. They call this the Sentinel Hub. You can learn more about how Sinergise uses AWS to power the Sentinel hub in this blog post.
About ESA WorldCover and Land Parcel Identification System datasets
For this example, we will be using the European Space Agency (ESA) WorldCover dataset, which is managed by Vito and Land Parcel Identification System (LPIS) dataset published by Sinergise on AWS Data Exchange. ESA WorldCover is a global land cover map containing 11 different land cover classes at 10m spatial resolution derived from the Sentinel-1 and Sentinel-2 ESA satellite missions. It is a public data set that is available at no charge for anyone to use for understanding global land cover patterns. LPIS is a commercial dataset provided by Sinergise containing parcel boundaries and crop information for Slovenia and adopted for use by the Ministry of Agriculture, Forestry and Food (MAFF). The ability to access both public and commercial data together using a common discovery mechanism is highly advantageous for deriving new insights from colocated geospatial data.
Solution walkthrough: how to discover and use open data and commercial datasets on AWS Data Exchange
Here is how to discover publicly accessible open data found in AWS Data Exchange and how you can enrich it with a paid dataset. In this walkthrough, Jeff, Mike, and I will search for and find both an open data dataset and a paid dataset through AWS Data Exchange. We will then import these datasets into Amazon SageMaker. In SageMaker, we will do a quick visualization of the data and show how you can combine these data sets for additional insight.
For this walkthrough, you should have the following prerequisites:
- An AWS account
- Access to subscribe to datasets on AWS Data Exchange
- Access to Amazon SageMaker
- Access to AWS CLI
A. Find and acquire an open data dataset on AWS Data Exchange
- Navigate to the AWS Data Exchange Console. Log in using your AWS credentials.
- In the left sidebar, choose Browse catalog. In the search bar, enter WorldCover. Click on the Search button.
- You should see results similar to the following screenshot. The following screenshot shows the AWS Data Exchange console with the results of a search for WorldCover. The ESA WorldCover dataset is highlighted in orange.
- To filter by no-cost datasets, in the left sidebar, under Contract types, choose Open Data licenses.
- Select the ESA WorldCover dataset managed by Vito.
- To view additional details on how you can access the data programmatically using AWS CLI, scroll down to the Resources on AWS tab. To list the bucket contents and access the data directly from a command-line shell environment, copy the following command. This enables you to acquire this dataset.
aws s3 ls --no-sign-request s3://esa-worldcover/
B. Find and subscribe to a commercial dataset on AWS Data Exchange
- In the search bar, enter Sinergise LPIS. Click on the Search button. You should see results similar to the following screenshot.
- The following screenshot shows the AWS Data Exchange console with the results of a search for Sinergise LPIS. The Land Parcel Identification System – LPIS (Slovenia/Europe) dataset is highlighted in orange.
- To view commercial datasets, in the left sidebar, under Contract types, deselect Open Data licenses and select Standard Data Subscription Agreement.
- From the search results, select Land Parcel Identification System – LPIS (Slovenia/Europe). Here you can review the product details, pricing, and terms and conditions. To subscribe to and access this dataset, choose Continue to Subscribe. For details on how to subscribe to commercial datasets on AWS Data exchange, see AWS Documentation.
- Once you have subscribed, you will see the Land Parcel Identification System – LPIS (Slovenia/Europe) under your entitled data sets.
- You can now export the data to your own S3 Bucket. For instructions on how to export data to your own S3 bucket, see AWS Documentation.
C. Open a Jupyter notebook
To create a Jupyter notebook, do the following steps:
- Open a new SageMaker Notebook. From the AWS Management Console, navigate to the Amazon SageMaker service.
- In the left navigation bar, choose the drop-down for Notebook and then select Notebook Instances.
- Choose the orange Create notebook instance button.
- Enter a notebook name, change the instance type to t3.large, and add an IAM role with S3 permissions. Scroll to the bottom and choose the orange Create notebook instance button. You will then be redirected to the notebook overview screen where you will see your new notebook initializing. This process can take a few minutes.
- To log you into your notebook, once the notebook has initialized, on the right side of the instance, choose the blue Open Jupyter button.
- In the top right corner, choose New. Select conda_python3. This drops you into your notebook so you can begin importing and working with your datasets.
D. Import open data and paid datasets into Amazon SageMaker
You can now download the ESA WorldCover virtual raster file, which has a .vrt file extension. This virtual raster file is essentially an index of all the ESA WorldCover imagery that is stored in the ESA WorldCover S3 bucket. This file is designed to be easily imported into common desktop geospatial information system applications as well as into your Jupyter notebook.
- You can access and download directly from the ESA WorldCover S3 bucket because it is an open S3 bucket that is publicly accessible. To download the .vrt file into your Jupyter notebook, in the notebook, run the following command:
s3 = boto3.resource('s3')
b = s3.Bucket('esa-worldcover')
- Install Rasterio, a popular (BSD 3-Clause) Python library for raster analysis and visualization. To do that, in your Jupyter notebook, run the following command:
!pip install rasterio
- Because this is a large raster, if we were to load the .vrt file and try to plot the entire thing, it would take 6.79 TiB of memory. Your u-24tb1.metal EC2 instance does go up to 24 TiB of memory, but for efficiency, use a window function to show a snippet of the VRT. We chose a random window location, which is specified by the four comma-separated numbers in the following window parameters. These numbers are Column Start, Row Start, Height, and Width. This code executes quickly because it’s not trying to read the entire dataset. The window and VRT work together to go directly ,to the image or images in S3 to pull out only the pixels needed for this view. To use this window function, run the following command:
from rasterio.windows import Window
src = rasterio.open("/home/ec2-user/SageMaker/ESA_WorldCover_10m_2020_v100_Map_AWS.vrt")
w = src.read(window=Window(3200000, 620000, 10000, 10000))
The following image depicts the ESA WorldCover dataset with pixel color variations from yellow, blue and green showing the different categories of land classification to represent trees, grassy plains, and snow.
- To work with the vector data, install Geopandas (BSD 3-Clause) with the following command. This is a popular framework for working with vector data. To install Geopandas, run the following command:
!pip install geopandas
- To pull the vector data into your notebook, use the LPIS dataset you subscribed to in step B. Grab its S3 bucket name where you exported the data to in step B.5. Note it is always best practice to confirm the bucket owner when downloading data. You can use the list-buckets API to confirm you are calling the correct bucket. Next you will redeclare your b variable in the boto3 S3 client you set up in step D.1, run the following command:
b = s3.Bucket('your-bucket-name')
E. Adjust Coordinate Reference System, merge, and visualize your datasets
Now that you have pulled the data down, you can match the coordinate reference system of your ESA WorldCover raster and plot it. To modify the coordinate reference system and merge the LPIS vector dataset with the ESA WorldCover raster dataset, do the following:
- To match and plot the coordinate reference system of your ESA WorldCover raster, in your Jupyter notebook, run the following command:
GeopandaDataFrame = geopandas.read_file('lpis.shp')
GdfCrs4326 = GeopandaDataFrame.to_crs(4326)
You should receive an output similar to the following screenshot, which shows 14.450 to 14.575 on the x-axis and 45.94 to 46.02 on the y-axis, with blue shaded plots in the main pane.
- You can now combine your datasets for further insights. To make the workflow a bit simpler, you can download the specific raster that overlays the LPIS data so you can merge your datasets. To do that, run the following command:
b = s3.Bucket('esa-worldcover')
- To merge your datasets and output a new image you can use for analysis, run the following command:
from rasterio.mask import mask
with fiona.open("Lpis4326.shp", "r") as shapefile:
geometry = [feature["geometry"] for feature in shapefile]
with rasterio.open("ESA_WorldCover_10m_2020_v100_N45E012_Map.tif") as raster:
image, transform = mask(raster, geometry, invert=True)
meta = raster.meta.copy()
with rasterio.open("Merged.tif", "w", **meta) as final:
- To compare the merged image with the original raster side-by-side, run the following command:
row = 30000
column = 24000
height = 500
width = 500
merged = rasterio.open("Merged.tif")
mergedwindow = merged.read(window=Window(row, column, height, width))
original = rasterio.open("ESA_WorldCover_10m_2020_v100_N45E012_Map.tif")
originalwindow = original.read(window=Window(row, column, height, width))
You should receive an output similar to the following screenshot. It shows two images. The first is titled Merged Layers and shows 0-400 on the x-axis and 400 to 0 on the y-axis with a dark blue land plot highlighted with areas of fluorescent green showing the variations in land features. You can see the CRS-adjusted LPIS dataset overlaid on top of the ESA WorldCover image. The second image titled Original WorldCover has the same axes labels but is primarily shades of lighter blue and green, with the same fluorescent green highlighted areas. There is less variation in the pixel colors and land features because the LPIS dataset is sitting on top of the WorldCover dataset and masking the features.
In the first, merged image, it is easier to discern additional features such as property lines on top of the WorldCover dataset, which shows terrain features. This example shows how easy it is to pull in multiple data sources from AWS Data Exchange and perform analytics with them using AWS services such as SageMaker.
In this blog post, Jeff, Mike and I showed you to how to find and use open and publicly accessible data and enrich it with a paid dataset, both available in AWS Data Exchange. AWS Data Exchange now has over 300 publicly accessible, open datasets available, along with over 3,000 existing commercial, no-cost, or paid data products from more than 250 category-leading data providers across a variety of industries.
Follow these steps to tear down the environment you just created.
- Navigate to the Amazon SageMaker console.
- On the left navigation bar, choose Notebook instances.
- Select the notebook you created.
- In the top right, choose Actions.
- Choose Delete.
If you’re looking for open data on AWS Data Exchange, check out the new AWS Data Exchange for Open Data documentation to learn more about discovering and using open data. If you’re an open data provider, check out the publishing documentation to learn more about publishing open datasets on the AWS Data Exchange catalog.
About the Authors
Sohaib Katariwala is an Analytics Specialist Solutions Architect at AWS. He has over 12 years of experience helping organizations derive insights from their data.
Jeff Demuth is a solutions architect who joined Amazon Web Services (AWS) in 2016. He focuses on the geospatial community and is passionate about geographic information systems (GIS) and technology. Outside of work, Jeff enjoys traveling, building Internet of Things (IoT) applications, and tinkering with the latest gadgets.
|Mike Jeffe is a Technical Business Development Manager on the Open Data Team at AWS. As the Geospatial lead he supports global initiatives that leverage open geospatial datasets made available through the Open Data Program. He has over 18 years of experience working with customers using geospatial data and technologies.