AWS Solutions Library

Guidance for Distributed Computing with Cross Regional Dask on AWS

Q: Performance Efficiency

In this Guidance, we selected Amazon ECS for its ability to execute a long-running Dask container and mount to an FSx for Lustre file system. FSx for Lustre provides a high IOPS service for data access patterns, and regional clusters scale compute based on CPU usage. We use Lambda for short-term processes, such as a daily sync of the FSx for Lustre file system to Amazon S3 or AWS Fargate for the scheduler that doesn’t require mounting to a file system. You can optimize your data by performing on a more relevant metric to scale than CPU usage (such as a published metric from the worker framework detailing when it should scale out) and building it into the Dask framework. Read the Performance Efficiency whitepaper

Overview

This Guidance helps customers use a Dask framework to perform input/output (I/O)-intensive workloads on high-volume data that is sparsely located across multiple AWS Regions. Instead of replicating data from its source Region to the user’s location, this Guidance uses the AWS global network to deploy a distributed computing architecture that strategically positions Dask workers as close as possible to the applicable dataset. Amazon FSx for Lustre rapidly loads and performs high I/O per second (IOPS) for scientists. To decouple the user experience from the underlying infrastructure, the architecture builds a metadata catalog through a self-managed OpenSearch domain using Amazon OpenSearch Service. This gives scientists full visibility into which datasets exist in FSx for Lustre in each of the worker Regions.

This Guidance helps users interact with and navigate cross regional data. Rather than waiting days or weeks to get cross regional data or figure out the Region where the data exists, the user can specify their requested inputs and receive the data in minutes.

How it works

This architecture shows how to perform data-proximate compute on large datasets located across multiple Regions using cross regional Dask clusters on AWS. It also minimizes cross regional traffic and your associated carbon footprint.

Download the architecture diagram

100 %

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

All Dask processes, including the scheduler and workers, publish logs to Amazon CloudWatch. These logs will show if a failure has occurred in the system. For a more detailed analysis, users have access to a real-time dashboard application showcasing various metrics such as current worker connections, in-flight processes, and CPU usage. By manually adjusting the desired count of scheduler tasks and worker clusters, you can resolve issues such as a task refresh to unfreeze processes.

Read the Operational Excellence whitepaper

The principle of least privilege is applied throughout the Guidance, limiting each resource’s access to only what is required for the resource to complete its function. External users would only need SageMaker notebook access, given this is the primary tool to interact with Dask. Additionally, all resources are deployed into private subnets. OpenSearch Service has configured node-to-node encryption and enforces HTTPS connections only. The SageMaker notebook and FSx for Lustre file system (where the data to query sits) both leverage AWS Key Management Service (AWS KMS) to encrypt the underlying storage of each service.

Read the Security whitepaper

By deploying private subnets across different Availability Zones, we decrease the chance of disruption to services. If one Availability Zone fails, the autoscaling group attached to the Amazon ECS cluster would spin up a replacement instance in another Availability Zone. This Guidance contains loosely coupled dependencies throughout each of the Regions. Dask workers connect independently into the client Region’s scheduler and mount to an FSx for Lustre file system on launch. Telemetry from the workers notify the scheduler of failed workers to terminate. Each of the Dask workers have a running “nanny process” which monitors the health of the workers and restarts them if necessary. In this process, the scheduler detects and requests a new worker to be spun up in-place should the worker fail to process its request.

Read the Reliability whitepaper

In this Guidance, we selected Amazon ECS for its ability to execute a long-running Dask container and mount to an FSx for Lustre file system. FSx for Lustre provides a high IOPS service for data access patterns, and regional clusters scale compute based on CPU usage. We use Lambda for short-term processes, such as a daily sync of the FSx for Lustre file system to Amazon S3 or AWS Fargate for the scheduler that doesn’t require mounting to a file system. You can optimize your data by performing on a more relevant metric to scale than CPU usage (such as a published metric from the worker framework detailing when it should scale out) and building it into the Dask framework.

Read the Performance Efficiency whitepaper

Instances follow an on-demand pricing model which charge on an hourly basis. There is a scaling metric on CPU usage attached to the cluster of Dask workers. This metric streams into CloudWatch, a service that provides insight as to whether the existing supply of workers is sufficient or whether additional workers should be provisioned. This visibility helps you use only the minimum resources required and reduce overall costs.

Read the Cost Optimization whitepaper

Dask workers are kept to a minimal amount, configured to 10 workers per Amazon ECS task. As the scheduler receives jobs for that pool of workers, CPU usage of the workers increases. If those workers remain at a high enough CPU threshold, an alert will trigger to scale to more workers. The number of workers only increases as demand increases, helping you keep the amount of resources you use to a minimum.

Read the Sustainability whitepaper

Get started

Implementation Resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open sample code on GitHub

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Guidance for Distributed Computing with Cross Regional Dask on AWS

Overview

How it works

Well-Architected Pillars

Get started

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help