Guidance for AI Training for Smart Products on AWS
Overview
How it works
This architecture diagram shows how to build a secure and flexible multi-tenant artificial intelligence (AI) training environment for smart products.
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Operational Excellence
This Guidance helps you reduce the operational burden of your platform by using managed services. By integrating a SageMaker software development kit (SDK), you can create a web portal UI that helps your customers submit and manage training jobs, and you won’t need to worry about the underlying training infrastructure. You and your customers can also maintain training algorithms, training datasets, and training containers in managed data lakes in Amazon S3 and Amazon EFS, which provide high reliability and availability. These data lakes will scale out automatically to support your business growth while minimizing maintenance efforts. Additionally, you can use Amazon SQS to orchestrate your overall model training pipelines by integrating data flows and workflows and notifying customers to manage tasks and download models.
Security
This Guidance applies a data separation policy to limit users’ access to sensitive data, and all core services are located in private subnets, limiting public internet access. For example, your administrator can access the bastion host and critical training data only through an AWS Virtual Private Network (AWS VPN) connection. Security group rules for Amazon EFS help you limit access to just your platform administrator, so end customers cannot access critical training source data directly. Additionally, by setting an access policy for Amazon S3, you can enable individual customers to upload and download their own data without accessing or impacting other customers’ data.
Reliability
Users log in to your web portal through Application Load Balancer, which distributes traffic to target compute instances. WhenAmazon Elastic Compute Cloud (Amazon EC2) Auto Scaling detects an unhealthy instance, it terminates it and launches a new one so that the service can continue without interruption. This Guidance also uses Amazon S3, which is designed for high availability and reliability. Customers can use Amazon S3 to store model artifacts.
Performance Efficiency
SageMaker helps you train and tune models at scale without the need to manage infrastructure. When your customers submit training jobs from the portal, SageMaker helps distribute training workloads at scale by defining proper training resources. This can reduce the time and costs needed for customers to finish critical training jobs.
Cost Optimization
Model training resources are provisioned only as needed when your customers submit training jobs. You can also apply Amazon SageMaker Savings Plans to reduce costs for ML training. And by defining SageMaker ResourceConfig to identify proper ML instances and storage volumes, your customers can manage their resources dynamically during model life cycles. Additionally, by using AWS Auto Scaling, your platform can automatically provision additional desired instances to handle unexpected training portal workloads and automatically scale back down during lower demand so that you don’t need to host idle instances. AWS Auto Scaling can be combined with Amazon EC2 Auto Scaling to scale additional resources.
Sustainability
By adopting serverless infrastructure and managed services, you can avoid overprovisioning training and storage resources, reducing your carbon footprint. For example, SageMaker and AWS Auto Scaling will use only the compute resources needed to run training jobs and the training portal, helping you minimize provisioned compute resources. Additionally, you can use both Amazon S3 and Amazon EFS as data lakes for your training data sources. These services offer various storage classes to help you avoid overprovisioning storage capacity.
Disclaimer
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages