AWS Machine Learning Blog

Map Earth’s vegetation in under 20 minutes with Amazon SageMaker

In today’s rapidly changing world, monitoring the health of our planet’s vegetation is more critical than ever. Vegetation plays a crucial role in maintaining an ecological balance, providing sustenance, and acting as a carbon sink. Traditionally, monitoring vegetation health has been a daunting task. Methods such as field surveys and manual satellite data analysis are not only time-consuming, but also require significant resources and domain expertise. These traditional approaches are cumbersome. This often leads to delays in data collection and analysis, making it difficult to track and respond swiftly to environmental changes. Furthermore, the high costs associated with these methods limit their accessibility and frequency, hindering comprehensive and ongoing global vegetation monitoring efforts at a planetary scale. In light of these challenges, we have developed an innovative solution to streamline and enhance the efficiency of vegetation monitoring processes on a global scale.

Transitioning from the traditional, labor-intensive methods of monitoring vegetation health, Amazon SageMaker geospatial capabilities offer a streamlined, cost-effective solution. Amazon SageMaker supports geospatial machine learning (ML) capabilities, allowing data scientists and ML engineers to build, train, and deploy ML models using geospatial data. These geospatial capabilities open up a new world of possibilities for environmental monitoring. With SageMaker, users can access a wide array of geospatial datasets, efficiently process and enrich this data, and accelerate their development timelines. Tasks that previously took days or even weeks to accomplish can now be done in a fraction of the time.

In this post, we demonstrate the power of SageMaker geospatial capabilities by mapping the world’s vegetation in under 20 minutes. This example not only highlights the efficiency of SageMaker, but also its impact how geospatial ML can be used to monitor the environment for sustainability and conservation purposes.

Identify areas of interest

We begin by illustrating how SageMaker can be applied to analyze geospatial data at a global scale. To get started, we follow the steps outlined in Getting Started with Amazon SageMaker geospatial capabilities. We start with the specification of the geographical coordinates that define a bounding box covering the areas of interest. This bounding box acts as a filter to select only the relevant satellite images that cover the Earth’s land masses.

import os
import json
import time
import boto3
import geopandas
from shapely.geometry import Polygon
import leafmap.foliumap as leafmap
import sagemaker
import sagemaker_geospatial_map

session = boto3.Session()
execution_role = sagemaker.get_execution_role()
sg_client = session.client(service_name="sagemaker-geospatial")
cooridinates =[
    [-179.034845, -55.973798],
    [179.371094, -55.973798],
    [179.371094, 83.780085],
    [-179.034845, 83.780085],
    [-179.034845, -55.973798]
]           
polygon = Polygon(cooridinates)
world_gdf = geopandas.GeoDataFrame(index=[0], crs='epsg:4326', geometry=[polygon])
m = leafmap.Map(center=[37, -119], zoom=4)
m.add_basemap('Esri.WorldImagery')
m.add_gdf(world_gdf, layer_name="AOI", style={"color": "red"})
m

Sentinel 2 coverage of Earth's land mass

Data acquisition

SageMaker geospatial capabilities provide access to a wide range of public geospatial datasets, including Sentinel-2, Landsat 8, Copernicus DEM, and NAIP. For our vegetation mapping project, we’ve selected Sentinel-2 for its global coverage and update frequency. The Sentinel-2 satellite captures images of Earth’s land surface at a resolution of 10 meters every 5 days. We pick the first week of December 2023 in this example. To make sure we cover most of the visible earth surface, we filter for images with less than 10% cloud coverage. This way, our analysis is based on clear and reliable imagery.

search_rdc_args = {
    "Arn": "arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8", # sentinel-2 L2A
    "RasterDataCollectionQuery": {
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {
                "PolygonGeometry": {
                    "Coordinates": [
                        [
                            [-179.034845, -55.973798],
                            [179.371094, -55.973798],
                            [179.371094, 83.780085],
                            [-179.034845, 83.780085],
                            [-179.034845, -55.973798]
                        ]
                    ]
                }
            }
        },
        "TimeRangeFilter": {
            "StartTime": "2023-12-01T00:00:00Z",
            "EndTime": "2023-12-07T23:59:59Z",
        },
        "PropertyFilters": {
            "Properties": [{"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 10}}}],
            "LogicalOperator": "AND",
        },
    }
}

s2_items = []
s2_tile_ids = []
s2_geometries = {
    'id': [],
    'geometry': [],
}
while search_rdc_args.get("NextToken", True):
    search_result = sg_client.search_raster_data_collection(**search_rdc_args)
    for item in search_result["Items"]:
        s2_id = item['Id']
        s2_tile_id = s2_id.split('_')[1]
        # filtering out tiles cover the same area
        if s2_tile_id not in s2_tile_ids:
            s2_tile_ids.append(s2_tile_id)
            s2_geometries['id'].append(s2_id)
            s2_geometries['geometry'].append(Polygon(item['Geometry']['Coordinates'][0]))
            del item['DateTime']
            s2_items.append(item)  

    search_rdc_args["NextToken"] = search_result.get("NextToken")

print(f"{len(s2_items)} unique Sentinel-2 images found.")

By utilizing the search_raster_data_collection function from SageMaker geospatial, we identified 8,581 unique Sentinel-2 images taken in the first week of December 2023. To validate the accuracy in our selection, we plotted the footprints of these images on a map, confirming that we had the correct images for our analysis.

s2_gdf = geopandas.GeoDataFrame(s2_geometries)
m = leafmap.Map(center=[37, -119], zoom=4)
m.add_basemap('OpenStreetMap')
m.add_gdf(s2_gdf, layer_name="Sentinel-2 Tiles", style={"color": "blue"})
m

Sentinel 2 image footprints

SageMaker geospatial processing jobs

When querying data with SageMaker geospatial capabilities, we received comprehensive details about our target images, including the data footprint, properties around spectral bands, and hyperlinks for direct access. With these hyperlinks, we can bypass traditional memory and storage-intensive methods of first downloading and subsequently processing images locally—a task made even more daunting by the size and scale of our dataset, spanning over 4 TB. Each of the 8,000 images are large in size, have multiple channels, and are individually sized at approximately 500 MB. Processing multiple terabytes of data on a single machine would be time-prohibitive. Although setting up a processing cluster is an alternative, it introduces its own set of complexities, from data distribution to infrastructure management. SageMaker geospatial streamlines this with Amazon SageMaker Processing. We use the purpose-built geospatial container with SageMaker Processing jobs for a simplified, managed experience to create and run a cluster. With just a few lines of code, you can scale out your geospatial workloads with SageMaker Processing jobs. You simply specify a script that defines your workload, the location of your geospatial data on Amazon Simple Storage Service (Amazon S3), and the geospatial container. SageMaker Processing provisions cluster resources for you to run city-, country-, or continent-scale geospatial ML workloads.

For our project, we’re using 25 clusters, with each cluster comprising 20 instances, to scale out our geospatial workload. Next, we divided the 8,581 images into 25 batches for efficient processing. Each batch contains approximately 340 images. These batches are then evenly distributed across the machines in a cluster. All batch manifests are uploaded to Amazon S3, ready for the processing job, so each segment is processed swiftly and efficiently.

def s2_item_to_relative_metadata_url(item):
    parts = item["Assets"]["visual"]["Href"].split("/")
    tile_prefix = parts[4:-1]
    return "{}/{}.json".format("/".join(tile_prefix), item["Id"])


num_jobs = 25
num_instances_per_job = 20 # maximum 20

manifest_list = {}
for idx in range(num_jobs):
    manifest = [{"prefix": "s3://sentinel-cogs/sentinel-s2-l2a-cogs/"}]
    manifest_list[idx] = manifest
# split the manifest for N processing jobs
for idx, item in enumerate(s2_items):
    job_idx = idx%num_jobs
    manifest_list[job_idx].append(s2_item_to_relative_metadata_url(item))
    
# upload the manifest to S3
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()
s3_prefix = 'processing_job_demo'
s3_client = boto3.client("s3")
s3 = boto3.resource("s3")

manifest_dir = "manifests"
os.makedirs(manifest_dir, exist_ok=True)

for job_idx, manifest in manifest_list.items():
    manifest_file = f"{manifest_dir}/manifest{job_idx}.json"
    s3_manifest_key = s3_prefix + "/" + manifest_file
    with open(manifest_file, "w") as f:
        json.dump(manifest, f)

    s3_client.upload_file(manifest_file, s3_bucket_name, s3_manifest_key)
    print("Uploaded {} to {}".format(manifest_file, s3_manifest_key))

With our input data ready, we now turn to the core analysis that will reveal insights into vegetation health through the Normalized Difference Vegetation Index (NDVI). NDVI is calculated from the difference between Near-infrared (NIR) and Red reflectances, normalized by their sum, yielding values that range from -1 to 1. Higher NDVI values signal dense, healthy vegetation, a value of zero indicates no vegetation, and negative values usually point to water bodies. This index serves as a critical tool for assessing vegetation health and distribution. The following is an example of what NDVI looks like.

Sentinel 2 true color image and NDVI

%%writefile scripts/compute_vi.py

import os
import rioxarray
import json
import gc
import warnings

warnings.filterwarnings("ignore")

if __name__ == "__main__":
    print("Starting processing")

    input_path = "/opt/ml/processing/input"
    output_path = "/opt/ml/processing/output"
    input_files = []
    items = []
    for current_path, sub_dirs, files in os.walk(input_path):
        for file in files:
            if file.endswith(".json"):
                full_file_path = os.path.join(input_path, current_path, file)
                input_files.append(full_file_path)
                with open(full_file_path, "r") as f:
                    items.append(json.load(f))

    print("Received {} input files".format(len(input_files)))

    for item in items:
        print("Computing NDVI for {}".format(item["id"]))
        red_band_url = item["assets"]["red"]["href"]
        nir_band_url = item["assets"]["nir"]["href"]
        scl_mask_url = item["assets"]["scl"]["href"]
        red = rioxarray.open_rasterio(red_band_url, masked=True)
        nir = rioxarray.open_rasterio(nir_band_url, masked=True)
        scl = rioxarray.open_rasterio(scl_mask_url, masked=True)
        scl_interp = scl.interp(
            x=red["x"], y=red["y"]
        )  # interpolate SCL to the same resolution as Red and NIR bands

        # mask out cloudy pixels using SCL (https://sentinels.copernicus.eu/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm-overview)
        # class 8: cloud medium probability
        # class 9: cloud high probability
        # class 10: thin cirrus
        red_cloud_masked = red.where((scl_interp != 8) & (scl_interp != 9) & (scl_interp != 10))
        nir_cloud_masked = nir.where((scl_interp != 8) & (scl_interp != 9) & (scl_interp != 10))

        ndvi = (nir_cloud_masked - red_cloud_masked) / (nir_cloud_masked + red_cloud_masked)
        # save the ndvi as geotiff
        s2_tile_id = red_band_url.split("/")[-2]
        file_name = f"{s2_tile_id}_ndvi.tif"
        output_file_path = f"{output_path}/{file_name}"
        ndvi.rio.to_raster(output_file_path)
        print("Written output: {}".format(output_file_path))

        # keep memory usage low
        del red
        del nir
        del scl
        del scl_interp
        del red_cloud_masked
        del nir_cloud_masked
        del ndvi

        gc.collect()

Now we have the compute logic defined, we’re ready to start the geospatial SageMaker Processing job. This involves a straightforward three-step process: setting up the compute cluster, defining the computation specifics, and organizing the input and output details.

First, to set up the cluster, we decide on the number and type of instances required for the job, making sure they’re well-suited for geospatial data processing. The compute environment itself is prepared by selecting a geospatial image that comes with all commonly used packages for processing geospatial data.

Next, for the input, we use the previously created manifest that lists all image hyperlinks. We also designate an S3 location to save our results.

With these elements configured, we’re able to initiate multiple processing jobs at once, allowing them to operate concurrently for efficiency.

from multiprocessing import Process
import sagemaker
import boto3 
from botocore.config import Config
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import ScriptProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

role = get_execution_role()
geospatial_image_uri = '081189585635.dkr.ecr.us-west-2.amazonaws.com/sagemaker-geospatial-v1-0:latest'
# use the retry behaviour of boto3 to avoid throttling issue
sm_boto = boto3.client('sagemaker', config=Config(connect_timeout=5, read_timeout=60, retries={'max_attempts': 20}))
sagemaker_session = sagemaker.Session(sagemaker_client = sm_boto)

def run_job(job_idx):
    s3_manifest = f"s3://{s3_bucket_name}/{s3_prefix}/{manifest_dir}/manifest{job_idx}.json"
    s3_output = f"s3://{s3_bucket_name}/{s3_prefix}/output"
    script_processor = ScriptProcessor(
        command=['python3'],
        image_uri=geospatial_image_uri,
        role=role,
        instance_count=num_instances_per_job,
        instance_type='ml.m5.xlarge',
        base_job_name=f'ca-s2-nvdi-{job_idx}',
        sagemaker_session=sagemaker_session,
    )

    script_processor.run(
        code='scripts/compute_vi.py',
        inputs=[
            ProcessingInput(
                source=s3_manifest,
                destination='/opt/ml/processing/input/',
                s3_data_type='ManifestFile',
                s3_data_distribution_type="ShardedByS3Key"
            ),
        ],
        outputs=[
            ProcessingOutput(
                source='/opt/ml/processing/output/',
                destination=s3_output,
                s3_upload_mode='Continuous'
            )
        ],
    )
    time.sleep(2)

processes = []
for idx in range(num_jobs):
    p = Process(target=run_job, args=(idx,))
    processes.append(p)
    p.start()
    
for p in processes:
    p.join()

After you launch the job, SageMaker automatically spins up the required instances and configures the cluster to process the images listed in your input manifest. This entire setup operates seamlessly, without needing your hands-on management. To monitor and manage the processing jobs, you can use the SageMaker console. It offers real-time updates on the status and completion of your processing tasks. In our example, it took under 20 minutes to process all 8,581 images with 500 instances. The scalability of SageMaker allows for faster processing times if needed, simply by increasing the number of instances.

Sagemaker processing job portal

Conclusion

The power and efficiency of SageMaker geospatial capabilities have opened new doors for environmental monitoring, particularly in the realm of vegetation mapping. Through this example, we showcased how to process over 8,500 satellite images in less than 20 minutes. We not only demonstrated the technical feasibility, but also showcased the efficiency gains from using the cloud for environmental analysis. This approach illustrates a significant leap from traditional, resource-intensive methods to a more agile, scalable, and cost-effective approach. The flexibility to scale processing resources up or down as needed, combined with the ease of accessing and analyzing vast datasets, positions SageMaker as a transformative tool in the field of geospatial analysis. By simplifying the complexities associated with large-scale data processing, SageMaker enables scientists, researchers, and businesses stakeholders to focus more on deriving insights and less on infrastructure and data management.

As we look to the future, the integration of ML and geospatial analytics promises to further enhance our understanding of the planet’s ecological systems. The potential to monitor changes in real time, predict future trends, and respond with more informed decisions can significantly contribute to global conservation efforts. This example of vegetation mapping is just the beginning for running planetary-scale ML. See Amazon SageMaker geospatial capabilities to learn more.


About the Author

Xiong Zhou is a Senior Applied Scientist at AWS. He leads the science team for Amazon SageMaker geospatial capabilities. His current area of research includes LLM evaluation and data generation. In his spare time, he enjoys running, playing basketball and spending time with his family.

Anirudh Viswanathan is a Sr Product Manager, Technical – External Services with the SageMaker geospatial ML team. He holds a Masters in Robotics from Carnegie Mellon University, an MBA from the Wharton School of Business, and is named inventor on over 40 patents. He enjoys long-distance running, visiting art galleries and Broadway shows.

Janosch Woschitz is a Senior Solutions Architect at AWS, specializing in AI/ML. With over 15 years of experience, he supports customers globally in leveraging AI and ML for innovative solutions and building ML platforms on AWS. His expertise spans machine learning, data engineering, and scalable distributed systems, augmented by a strong background in software engineering and industry expertise in domains such as autonomous driving.

Li Erran Li is the applied science manager at humain-in-the-loop services, AWS AI, Amazon. His research interests are 3D deep learning, and vision and language representation learning. Previously he was a senior scientist at Alexa AI, the head of machine learning at Scale AI and the chief scientist at Pony.ai. Before that, he was with the perception team at Uber ATG and the machine learning platform team at Uber working on machine learning for autonomous driving, machine learning systems and strategic initiatives of AI. He started his career at Bell Labs and was adjunct professor at Columbia University. He co-taught tutorials at ICML’17 and ICCV’19, and co-organized several workshops at NeurIPS, ICML, CVPR, ICCV on machine learning for autonomous driving, 3D vision and robotics, machine learning systems and adversarial machine learning. He has a PhD in computer science at Cornell University. He is an ACM Fellow and IEEE Fellow.

Amit Modi is the product leader for SageMaker MLOps, ML Governance, and Responsible AI at AWS. With over a decade of B2B experience, he builds scalable products and teams that drive innovation and deliver value to customers globally.

Kris Efland is a visionary technology leader with a successful track record in driving product innovation and growth for over 20 years. Kris has helped create new products including consumer electronics and enterprise software across many industries, at both startups and large companies. In his current role at Amazon Web Services (AWS), Kris leads the Geospatial AI/ML category. He works at the forefront of Amazon’s fastest-growing ML service, Amazon SageMaker, which serves over 100,000 customers worldwide. He recently led the launch of Amazon SageMaker’s new geospatial capabilities, a powerful set of tools that allow data scientists and machine learning engineers to build, train, and deploy ML models using satellite imagery, maps, and location data. Before joining AWS, Kris was the Head of Autonomous Vehicle (AV) Tools and AV Maps for Lyft, where he led the company’s autonomous mapping efforts and toolchain used to build and operate Lyft’s fleet of autonomous vehicles. He also served as the Director of Engineering at HERE Technologies and Nokia and has co-founded several startups..