This Guidance helps customers use a Dask framework to perform input/output (I/O)-intensive workloads on high-volume data that is sparsely located across multiple AWS Regions. Instead of replicating data from its source Region to the user’s location, this Guidance uses the AWS global network to deploy a distributed computing architecture that strategically positions Dask workers as close as possible to the applicable dataset. Amazon FSx for Lustre rapidly loads and performs high I/O per second (IOPS) for scientists. To decouple the user experience from the underlying infrastructure, the architecture builds a metadata catalog through a self-managed OpenSearch domain using Amazon OpenSearch Service. This gives scientists full visibility into which datasets exist in FSx for Lustre in each of the worker Regions.
This Guidance helps users interact with and navigate cross regional data. Rather than waiting days or weeks to get cross regional data or figure out the Region where the data exists, the user can specify their requested inputs and receive the data in minutes.
This architecture contains two personas. Through Elastic Load Balancing (ELB), the system administrator (Persona 1), has access to the Dask dashboard running on the Dask scheduler. Persona 2, the scientist, accesses the Amazon SageMaker notebook through the AWS console.
SageMaker notebook connects to the scheduler, and the user looks up Dask workers closest to the datasets by querying the data catalog on Amazon OpenSearch Service and then initiating the compute request.
AWS Transit Gateway routes traffic between the Dask scheduler and Dask workers running in different AWS Regions.
Dask workers running as Amazon Elastic Container Service (Amazon ECS) tasks perform the requested compute on datasets mounted through Amazon FSx for Lustre. Amazon ECS automatically scales up Dask worker instances based on CPU usage.
Metadata of datasets are synced periodically from public Amazon Simple Storage Service (Amazon S3) buckets that are part of the Open Data on AWS Initiative.
Synced datasets are automatically indexed using an Amazon Elastic Compute Cloud (Amazon EC2) instance that executes a daily cronjob (a command for scheduled tasks) to update the data catalog on OpenSearch Service.
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
All Dask processes, including the scheduler and workers, publish logs to Amazon CloudWatch. These logs will show if a failure has occurred in the system. For a more detailed analysis, users have access to a real-time dashboard application showcasing various metrics such as current worker connections, in-flight processes, and CPU usage. By manually adjusting the desired count of scheduler tasks and worker clusters, you can resolve issues such as a task refresh to unfreeze processes.
The principle of least privilege is applied throughout the Guidance, limiting each resource’s access to only what is required for the resource to complete its function. External users would only need SageMaker notebook access, given this is the primary tool to interact with Dask. Additionally, all resources are deployed into private subnets. OpenSearch Service has configured node-to-node encryption and enforces HTTPS connections only. The SageMaker notebook and FSx for Lustre file system (where the data to query sits) both leverage AWS Key Management Service (AWS KMS) to encrypt the underlying storage of each service.
By deploying private subnets across different Availability Zones, we decrease the chance of disruption to services. If one Availability Zone fails, the autoscaling group attached to the Amazon ECS cluster would spin up a replacement instance in another Availability Zone. This Guidance contains loosely coupled dependencies throughout each of the Regions. Dask workers connect independently into the client Region’s scheduler and mount to an FSx for Lustre file system on launch. Telemetry from the workers notify the scheduler of failed workers to terminate. Each of the Dask workers have a running “nanny process” which monitors the health of the workers and restarts them if necessary. In this process, the scheduler detects and requests a new worker to be spun up in-place should the worker fail to process its request.
In this Guidance, we selected Amazon ECS for its ability to execute a long-running Dask container and mount to an FSx for Lustre file system. FSx for Lustre provides a high IOPS service for data access patterns, and regional clusters scale compute based on CPU usage. We use Lambda for short-term processes, such as a daily sync of the FSx for Lustre file system to Amazon S3 or AWS Fargate for the scheduler that doesn’t require mounting to a file system. You can optimize your data by performing on a more relevant metric to scale than CPU usage (such as a published metric from the worker framework detailing when it should scale out) and building it into the Dask framework.
Instances follow an on-demand pricing model which charge on an hourly basis. There is a scaling metric on CPU usage attached to the cluster of Dask workers. This metric streams into CloudWatch, a service that provides insight as to whether the existing supply of workers is sufficient or whether additional workers should be provisioned. This visibility helps you use only the minimum resources required and reduce overall costs.
SustainabilityDask workers are kept to a minimal amount, configured to 10 workers per Amazon ECS task. As the scheduler receives jobs for that pool of workers, CPU usage of the workers increases. If those workers remain at a high enough CPU threshold, an alert will trigger to scale to more workers. The number of workers only increases as demand increases, helping you keep the amount of resources you use to a minimum.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.