AWS Solutions Library

Guidance for AI Training for Smart Products on AWS

Overview

This Guidance demonstrates how to build a secure and flexible multi-tenant artificial intelligence (AI) training platform for smart products. Customers can utilize your platform's training data and machine learning (ML) models in conjunction with their own data. This creates a multi-tenant infrastructure promoting agility and cost-efficiency. By building a multi-tenant training environment on AWS, you can safeguard your platform's data, algorithms, and services from unauthorized access while enabling customers to securely maintain separate datasets. Your platform can then orchestrate automated model training pipelines, integrating data and workflows. Ultimately, this helps your customers achieve faster time-to-market.

How it works

This architecture diagram shows how to build a secure and flexible multi-tenant artificial intelligence (AI) training environment for smart products.

Download the architecture diagram

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

This Guidance helps you reduce the operational burden of your platform by using managed services. By integrating a SageMaker software development kit (SDK), you can create a web portal UI that helps your customers submit and manage training jobs, and you won’t need to worry about the underlying training infrastructure. You and your customers can also maintain training algorithms, training datasets, and training containers in managed data lakes in Amazon S3 and Amazon EFS, which provide high reliability and availability. These data lakes will scale out automatically to support your business growth while minimizing maintenance efforts. Additionally, you can use Amazon SQS to orchestrate your overall model training pipelines by integrating data flows and workflows and notifying customers to manage tasks and download models.

Read the Operational Excellence whitepaper

Security

This Guidance applies a data separation policy to limit users’ access to sensitive data, and all core services are located in private subnets, limiting public internet access. For example, your administrator can access the bastion host and critical training data only through an AWS Virtual Private Network (AWS VPN) connection. Security group rules for Amazon EFS help you limit access to just your platform administrator, so end customers cannot access critical training source data directly. Additionally, by setting an access policy for Amazon S3, you can enable individual customers to upload and download their own data without accessing or impacting other customers’ data.

Read the Security whitepaper

Reliability

Users log in to your web portal through Application Load Balancer, which distributes traffic to target compute instances. WhenAmazon Elastic Compute Cloud (Amazon EC2) Auto Scaling detects an unhealthy instance, it terminates it and launches a new one so that the service can continue without interruption. This Guidance also uses Amazon S3, which is designed for high availability and reliability. Customers can use Amazon S3 to store model artifacts.

Read the Reliability whitepaper

Performance Efficiency

SageMaker helps you train and tune models at scale without the need to manage infrastructure. When your customers submit training jobs from the portal, SageMaker helps distribute training workloads at scale by defining proper training resources. This can reduce the time and costs needed for customers to finish critical training jobs.

Read the Performance Efficiency whitepaper

Cost Optimization

Model training resources are provisioned only as needed when your customers submit training jobs. You can also apply Amazon SageMaker Savings Plans to reduce costs for ML training. And by defining SageMaker ResourceConfig to identify proper ML instances and storage volumes, your customers can manage their resources dynamically during model life cycles. Additionally, by using AWS Auto Scaling, your platform can automatically provision additional desired instances to handle unexpected training portal workloads and automatically scale back down during lower demand so that you don’t need to host idle instances. AWS Auto Scaling can be combined with Amazon EC2 Auto Scaling to scale additional resources.

Read the Cost Optimization whitepaper

Sustainability

By adopting serverless infrastructure and managed services, you can avoid overprovisioning training and storage resources, reducing your carbon footprint. For example, SageMaker and AWS Auto Scaling will use only the compute resources needed to run training jobs and the training portal, helping you minimize provisioned compute resources. Additionally, you can use both Amazon S3 and Amazon EFS as data lakes for your training data sources. These services offer various storage classes to help you avoid overprovisioning storage capacity.

Read the Sustainability whitepaper

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages