Guidance for Distributed Model Training on AWS
Overview
How it works
These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Operational Excellence
To support scalable simulation and key performance indicator (KPI) calculation models, use Amazon EKS and Amazon QuickSight.
Security
Resources are stored in a virtual private cloud (VPC), which provides a logically isolated network. You can grant access to these resources using AWS Identity and Access Management (IAM) roles that grant least privilege, or the minimum number of permissions required to complete a task.
Read the Security whitepaperReliability
Kubeflow on AWS supports a data pipeline orchestration.
Performance Efficiency
If you have on-premises restrictions or existing Kubernetes investments, you can use Amazon EKS and Kubeflow on AWS to implement an ML pipeline for distributed training or use a fully managed SageMaker solution for production-scale training infrastructure. These two options help you scale to meet workload requirements of the training environment.
Cost Optimization
We selected resource sizes and types based on resource characteristics and past workloads so you only pay for resources matched to your needs.
Sustainability
SageMaker is designed to handle training clusters that scale up as needed and shut down automatically when jobs are complete. SageMaker also reduces the amount of infrastructure and operational overhead typically required with training deep learning models on hundreds of GPUs. Amazon Elastic File System (Amazon EFS) integration with the training clusters and the development environment allow you to share your code and processed training dataset, so you don’t have to build the container image and load large datasets after every code change.
Deploy with confidence
Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.
Disclaimer
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages