Guidance for Distributed Computing with Cross Regional Dask on AWS
Overview
This Guidance helps customers use a Dask framework to perform input/output (I/O)-intensive workloads on high-volume data that is sparsely located across multiple AWS Regions. Instead of replicating data from its source Region to the user’s location, this Guidance uses the AWS global network to deploy a distributed computing architecture that strategically positions Dask workers as close as possible to the applicable dataset. Amazon FSx for Lustre rapidly loads and performs high I/O per second (IOPS) for scientists. To decouple the user experience from the underlying infrastructure, the architecture builds a metadata catalog through a self-managed OpenSearch domain using Amazon OpenSearch Service. This gives scientists full visibility into which datasets exist in FSx for Lustre in each of the worker Regions.
This Guidance helps users interact with and navigate cross regional data. Rather than waiting days or weeks to get cross regional data or figure out the Region where the data exists, the user can specify their requested inputs and receive the data in minutes.
How it works
This architecture shows how to perform data-proximate compute on large datasets located across multiple Regions using cross regional Dask clusters on AWS. It also minimizes cross regional traffic and your associated carbon footprint.
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Get started
Implementation Resources
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages