Automatically compress and archive satellite imagery for Amazon S3

Satellite imagery often comes as large, high-resolution files, and organizations that work with this data typically have high storage costs. Additionally, large imagery files can take time and resources when downloaded for use with machine learning (ML), data analytics tools, or manual analyst review.

Using standard compression techniques lets us achieve reductions in file size with low loss to an image’s visual quality. This reduction in file size reduces storage and data transfer costs, as well as improves data transfer speeds for imagery data.

In this post, we propose a low-cost solution to compress satellite imagery with Geospatial Data Abstraction Library (GDAL) and archive the original images using Amazon S3 and the Amazon S3 Glacier storage classes. As a result, engineers have quicker access to data for use cases such as ML model training and inference while maintaining access to the high-resolution original images, all at a lower cost than storing the high-resolution images in S3 Standard alone.

Solution overview

In this solution we’ll create a pipeline for processing imagery data. Users will upload images to an S3 bucket and an object upload notification will trigger a Lambda function which will:

Compress the image.
Upload the compressed image to the output bucket.
Copy the original image to an archive prefix in the output bucket in an S3 Glacier storage class.
Delete the original image from the input bucket.

After the function has executed, you can then access the compressed and archived original files in the output bucket.

A diagram of the application architecture

Solution walkthrough

To set up the solution, we cover the following steps:

Create S3 Buckets for input and output.
Write application code that will run on AWS Lambda to compress and archive satellite imagery.
Create a Docker Image for Lambda, bundle the prerequisite dependencies (e.g. GDAL) for our application code.
Deploy the Lambda Function.
Configure S3 Bucket Permissions and an object upload trigger.
Using the solution.

Prerequisites

The following prerequisites are required before continuing:

AWS Identity and Access Management (IAM) permissions to create Amazon S3 buckets and Lambda Functions
Lambda permissions to read and write to S3 buckets
IAM permissions to create Amazon Elastic Container Registry (Amazon ECR) repositories and push Docker images to it
Docker installed
AWS Command Line V2 Interface (AWS CLI) installed

When configured with higher amounts of Memory, Ephemeral Storage, and Timeout, Lambda can compress images less than 10 GB in size.

1. Create the S3 buckets

Navigate to the Amazon S3 console. Create two S3 Buckets: one for input and one for output. For each bucket, create a globally unique name, select your Region of choice, disable Amazon S3 ACLs, and make sure of Block all public access. Encryption is enabled by default with Amazon S3-managed keys (SSE-S3).

2. Write our application code

To create our project files, begin by creating a project directory. Inside of the project directory, create a Dockerfile, and another directory named “app”. Inside the “app” directory create a file “handler.py” and a “requirements.txt“ file. Your project directory should look like this:

project/
├─ app/
│ ├─ handler.py
│ ├─ requirements.txt
├─ Dockerfile

Copy the following code into the handler.py file. We’ll use AWS Lambda Powertools to simplify reading in our S3 create event notifications. Python bindings for GDAL are installed as part of the base image that is specified in the Dockerfile. A few parameters of note:

You can select different storage classes as parameters to the upload function. See more details under the --storage-class parameter in the AWS CLI S3 API documentation. Additional details about storage classes can be found in the Amazon S3 user guide.
Adapt the gdal.Translate method to fit your use-case. For this example, we use the GDAL COG driver and specify JPEG compression. We set the output file extension and quality with OUTFILE_EXTENSION and QUALITY parameters.
Set the OUTPUT_BUCKET_NAME variable to match the output bucket name that you created in the previous step.

# handler.py (Python 3.7)
import os
from urllib.parse import unquote_plus
import boto3
from osgeo import gdal
from aws_lambda_powertools.utilities.data_classes import event_source, S3Event
S3_CLIENT = boto3.client("s3")
OUTFILE_EXTENSION = "tif"
QUALITY = "75"
OUTPUT_BUCKET_NAME = "your-output-bucket-name"

def upload(
    local_file: str,
    bucket_name: str,
    prefix: str,
    object_name: str,
    storage_class: str = "STANDARD",
):
    S3_CLIENT.upload_file(
        local_file,
        bucket_name,
        f"{prefix}/{object_name}",
        ExtraArgs={"StorageClass": storage_class},
    )

@event_source(data_class=S3Event)
def handler(event: S3Event, context):
    bucket_name = event.bucket_name

    # Multiple records can be delivered in a single event
    for record in event.records:
        object_key = unquote_plus(record.s3.get_object.key)

        # Download file
        name, input_ext = os.path.splitext(object_key)
        S3_CLIENT.download_file(
            bucket_name, 
            object_key, 
            f"/tmp/input{input_ext}"
        )

        # Compress File
        outfile = f"/tmp/compressed.{OUTFILE_EXTENSION}"
        ds = gdal.Open(f"/tmp/input{input_ext}")
        gdal.Translate(
            outfile,
            ds,
            format="COG",
            creationOptions=["COMPRESS=JPEG", f"QUALITY={QUALITY}"]
        )

        # Upload Compressed
        upload(
            outfile,
            OUTPUT_BUCKET_NAME,
            "output", 
            f"{name}.{OUTFILE_EXTENSION}",
            "STANDARD" # S3 Storage Tier
        )

        # Archive Original
        upload(
            f"/tmp/input{input_ext}",
            OUTPUT_BUCKET_NAME,
            "archive",
            object_key,
            "GLACIER",
        )

        # Delete original 
        S3_CLIENT.delete_object(Bucket=bucket_name, Key=object_key)

Finally, copy these requirements into requirements.txt.

aws-lambda-powertools~=2.4
awslambdaric~=2.0
boto3~=1.26

3. Create a Docker image for the Lambda function

In a previous post, we demonstrated converting satellite imagery to Cloud Optimized GeoTIFFs (COGs) using a similar architecture to the one we demonstrate here. In that post we created the Lambda function with the rio-cogeo pre-installed in a docker image. Although the rio-cogeo library and docker image that we created in that post could be sufficient for this example, we will demonstrate extending the official GDAL docker image, which has more configuration settings than the rio-cogeo library. This will give us access to the osgeo Python bindings for GDAL within our Lambda function as well as all of the required GDAL dependencies.

To begin, let’s prepare our Dockerfile by following the AWS documentation for creating images from alternative base images. Note that we’re using osgeo/gdal:ubuntu-small-3.5.3 as the base image, which uses a Python runtime version compatible with Lambda. Copy the following code into the Dockerfile that you created earlier:

# Dockerfile
# Define function directory
ARG FUNCTION_DIR="/function"

FROM osgeo/gdal:ubuntu-small-3.5.3 as build-image

# Install aws-lambda-cpp build dependencies
RUN apt-get update && \
    apt-get install -y \
    g++ \
    make \
    cmake \
    unzip \
    python3-pip \
    libcurl4-openssl-dev

# Include global arg in this stage of the build
ARG FUNCTION_DIR

# Create function directory
RUN mkdir -p ${FUNCTION_DIR}

# Copy function code
COPY app/* ${FUNCTION_DIR}/

# Install the runtime interface client & other requirements
RUN python -m pip install \

           --target ${FUNCTION_DIR} \

           -r ${FUNCTION_DIR}/requirements.txt

# Set working directory to function root directory
WORKDIR ${FUNCTION_DIR}

ENTRYPOINT [ "/usr/bin/python", "-m", "awslambdaric" ]
CMD [ "app.handler" ]

Now we build and push our Docker image to our AWS account. In a terminal, navigate to the project directory and execute docker build -t compression-blog:latest .Because the GDAL image doesn’t come with the Lambda Runtime Interface Emulator (RIE) built-in, you can optionally test the image locally by following the instructions for testing an image without adding RIE to the image.

Next, the docker image will be pushed to Amazon ECR. Start by tagging the local docker image and logging into the ECR repository. Create a repository, and then push the docker image.

docker tag compression-blog:latest <AWS_ACCOUNT_NUMBER>.dkr.ecr.<REGION>.amazonaws.com/compression-blog:latest

aws ecr get-login-password --region <REGION> | docker login --username AWS --password-stdin <AWS_ACCOUNT_NUMBER>.dkr.ecr.<REGION>.amazonaws.com *

aws ecr create-repository --repository-name compression-blog --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE

docker push <AWS_ACCOUNT_NUMBER>.dkr.ecr.<REGION>.amazonaws.com/compression-blog:latest

4. Deploy the Lambda function

Navigate to the AWS Lambda console and select Create a function. Choose Container Image, name the function, and select Browse images. Select “compression-blog” from the Amazon ECR image repository dropdown, select the image with the Image tag “latest” from the images, and finally select Select image. Keep the remaining settings and select Create function.

The Memory (10 GB limit), Ephemeral storage (10GB limit), and Timeout (15min limit) can all be increased from the General configuration page under the Configuration tab.

Note that if you make changes to the docker image, you must redeploy the image on the Lambda function’s Image tab by selecting Deploy new image and selecting the latest version of your image via Browse images.

5. Configure the S3 bucket permissions and trigger

After the function is created, select Add trigger in the Function Overview section. Select S3 from the Select a source dropdown, and search for the input bucket that you created earlier. Keep all fields default and acknowledge the Recursive invocation warning.

From the function page, select the Configuration tab and then Permissions. Select the Role name under Execution role, which will open the AWS IAM console. Select Create inline policy under Add permissions. Create a policy that matches the following, replacing “your-input-bucket-name-here” and “your-output-bucket-name-here” with the names of the respective buckets that you created earlier.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject",
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::your-input-bucket-name-here/*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::your-output-bucket-name-here/*"
        }
    ]
}

6. Usage

To test, we upload samples from the RarePlanes dataset, available in the Registry of Open Data on AWS. We use 3-band NTIFs from the dataset and upload them to the input bucket. After uploading the sample images, and waiting for the Lambda invocation to finish, the original image is uploaded to the output bucket at the “archive/” prefix with the Amazon S3 Glacier Flexible Retrieval storage class selected. Then, the compressed image is uploaded to the “output/“ prefix. The original image is also deleted from the input bucket.

In the following images, we demonstrate the difference between the visual quality of the original image and an image compressed with JPEG when the GDAL JPEG quality parameter is set to 75. The original image size was 40.6 MB and the resulting compressed image size is 1.4 MB – a 96% reduction. These images are 95% similar when measured by structural similarity index measure (SSIM), a measure of how similar two images are. Note the minimal compression artifacts in the image on the right.

Full scale

full scale comparison of original and compressed-image
Zoomed

zoomed comparison of original and compressed image.png

Cleaning up

If you followed along and don’t want to maintain the solution set up in this post, delete the resources you created and used to avoid incurring unintended charges.

Delete the input/output S3 buckets

Open the Amazon S3 console.
In the Buckets list, select the option next to the name of the input bucket that you created, and then choose Delete at the top of the page.
If the bucket isn’t empty, then you must choose Empty and submit ‘permanently delete’ in the input field prior to deleting the bucket.
On the Delete bucket page, confirm that you want to delete the bucket by entering the bucket name into the text field, and then choose Delete bucket.
Repeat these instructions for the output bucket.

Delete the Lambda function

Open the AWS Lambda console and select Functions in the navigation sidebar.
In the functions list, select the option next to the name of the function that you created, and choose Actions at the top of the page. Select Delete in the dropdown menu.
Type ‘delete’ in the input box, and select Delete at the bottom.

Delete the ECR repository

Open the Amazon ECR console and select Repositories in the navigation sidebar.
In the Private repositories list, select the repository that you created earlier, “compression-blog”, and choose Delete at the top of the page.
Type ‘delete’ in the input box and select Delete at the bottom.

Conclusion

In this post, we demonstrated a solution that enables the automatic compression and archival of satellite imagery for hosting in Amazon S3. We modified a GDAL Docker image to run on Lambda, wrote application code to compress images, and configured an S3 bucket to trigger our Lambda function.

Satellite imagery is high resolution and expensive to store at scale. By compressing the imagery, we demonstrated that storage size in Amazon S3 Standard can be reduced by more than 90%, while the original images are archived in an S3 Glacier storage class. The S3 Glacier storage classes are purpose-built for data archiving, providing you with the highest performance, most retrieval flexibility, and the lowest cost archive storage in the cloud. The reduction in image sizes enables analysts and automated systems, such as ML applications, to download images more quickly without significant reduction in image quality. Additionally, Lambda provides a serverless, low-cost way to compress the imagery.

Next steps

Consider how you could adapt this post in your architecture, such as using Amazon Elastic Container Service for compute workloads requiring more than 15 minutes of runtime or using AWS SageMaker for hosting an ML model for object detection to run inference on the compressed images. Add your comments below with ideas for how you can apply this post in your applications.