Skip to main content

Guidance for Scaling Geospatial Data Lakes with Earth on AWS

Overview

This Guidance shows how to build scalable geospatial data repositories on AWS, simplifying the design of data pipelines and facilitating faster access to raw data. By integrating Earth on AWS datasets from the Registry of Open Data on AWS, it eliminates the need for storing this data in your own data lake, reducing costs and complexity. This Guidance also offers integration with a variety of dissemination mechanisms and supports diverse processing demands, from basic spatial queries to complex analytics. These features allow you to streamline geospatial workflows and enhance data accessibility.

How it works

These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

CloudWatch provides comprehensive monitoring and observability for your applications. It captures and analyzes events, logs, and metrics to give you real-time insights into your system's health and performance. By using CloudWatch , you can proactively detect issues, troubleshoot problems more efficiently, and respond to incidents faster. This continuous monitoring helps ensure better application reliability while allowing you to maintain optimal performance across your AWS infrastructure.

Read the Operational Excellence whitepaper 

AWS provides a comprehensive suite of security services and features to protect your data and resources. AWS Identity and Access Management (IAM) enables fine-grained access control, allowing you to set permission policies that restrict who can access and manage AWS resources. Data protection occurs through various means: Amazon S3 employs server-side encryption and bucket policies for data at rest, while AWS Key Management Service (AWS KMS) offers customer-managed keys for encrypting data in Amazon S3 , Amazon Relational Database Service (Amazon RDS), and DynamoDB .

This Guidance enhances network security by using security groups attached to container task network interfaces, protecting virtual private cloud (VPC) resources. It also configures network access control lists for subnet-level access restrictions and utilizes VPC endpoints to keep traffic within the AWS environment, safeguarding data in transit. Furthermore, the use of managed services like Amazon ECS , Lambda , and SageMaker reduces your security maintenance burden under the shared responsibility model.

Read the Security whitepaper 

The services selected for this Guidance offer high availability, durability, and scalability for your applications. Lambda enhances reliability by running functions across multiple Availability Zones (AZs) so that event processing continues even if one AZ fails. Aurora PostgreSQL provides robust high-availability options, replicating data six ways across three AZs for improved fault tolerance, even with a single database instance.

For controlled scaling and resilience, Step Functions allows you to manage processing rates, preventing overload on downstream services and avoiding rate limits. It also orchestrates stateless components, which are inherently more scalable, robust, and manageable. Data durability is supported by Amazon S3 , offering automatic cross-Region replication, while both DynamoDB and Aurora provide flexible backup capabilities for point-in-time recovery.

Read the Reliability whitepaper 

This Guidance uses AWS managed services that automatically adjust to workload demands. For example, Lambda scales automatically to handle querying and data processing based on incoming event volume. Step Functions manages workflow orchestration, dynamically adjusting to increased loads by parallelizing or queueing tasks.

For data storage and access, this Guidance utilizes three key services:

  1. Amazon S3 accommodates high throughput and numerous requests without provisioning, optimizing for various access patterns.
  2. DynamoDB offers flexible query capabilities with automatic scaling.
  3. Aurora automatically adjusts compute and memory resources to match workload demands.

Together, these services provide a scalable infrastructure capable of handling varying workloads efficiently so that your application can maintain performance and responsiveness as demand fluctuates without the need for manual intervention or complex capacity planning.

Read the Performance Efficiency whitepaper 

This Guidance optimizes costs through several strategies:

  1. Storage: Uses datasets from the Registry of Open Data on AWS and recommends downloading raw files to your data lake for multiple processing iterations.
  2. Compute: Uses serverless services for automatic scaling, employs spot instances for batch processing with on-demand failover, and suggests reserved instances for Amazon RDS for PostgreSQL .
  3. Data transfer: Uses VPC endpoints, contains processing within a single VPC, and eliminates the need for additional transfer services between components.

These approaches minimize expenses across storage, compute, and networking while maintaining performance for geospatial processing workloads.

Read the Cost Optimization whitepaper 

Through the use of managed, serverless services, this Guidance minimizes the environmental impact of backend resources by scaling them up and down to meet demand. Additionally, you can monitor CloudWatch metrics to make sure that the scaled environment is not overprovisioned, further reducing your environmental impact.

Read the Sustainability whitepaper 

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.