[SEO Subhead]
This Guidance demonstrates how to build a secure and flexible multi-tenant artificial intelligence (AI) training platform for smart products. Customers can utilize your platform's training data and machine learning (ML) models in conjunction with their own data. This creates a multi-tenant infrastructure promoting agility and cost-efficiency. By building a multi-tenant training environment on AWS, you can safeguard your platform's data, algorithms, and services from unauthorized access while enabling customers to securely maintain separate datasets. Your platform can then orchestrate automated model training pipelines, integrating data and workflows. Ultimately, this helps your customers achieve faster time-to-market.
Please note: [Disclaimer]
Architecture Diagram

[Architecture diagram description]
Step 1
The platform owner builds a custom model with a “Bring-Your-Own-Model” and registers the model in an Amazon Elastic Container Repository (Amazon ECR).
Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
This Guidance helps you reduce the operational burden of your platform by using managed services. By integrating a SageMaker software development kit (SDK), you can create a web portal UI that helps your customers submit and manage training jobs, and you won’t need to worry about the underlying training infrastructure. You and your customers can also maintain training algorithms, training datasets, and training containers in managed data lakes in Amazon S3 and Amazon EFS, which provide high reliability and availability. These data lakes will scale out automatically to support your business growth while minimizing maintenance efforts. Additionally, you can use Amazon SQS to orchestrate your overall model training pipelines by integrating data flows and workflows and notifying customers to manage tasks and download models.
-
Security
This Guidance applies a data separation policy to limit users’ access to sensitive data, and all core services are located in private subnets, limiting public internet access. For example, your administrator can access the bastion host and critical training data only through an AWS Virtual Private Network (AWS VPN) connection. Security group rules for Amazon EFS help you limit access to just your platform administrator, so end customers cannot access critical training source data directly. Additionally, by setting an access policy for Amazon S3, you can enable individual customers to upload and download their own data without accessing or impacting other customers’ data.
-
Reliability
Users log in to your web portal through Application Load Balancer, which distributes traffic to target compute instances. When Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling detects an unhealthy instance, it terminates it and launches a new one so that the service can continue without interruption. This Guidance also uses Amazon S3, which is designed for high availability and reliability. Customers can use Amazon S3 to store model artifacts.
-
Performance Efficiency
SageMaker helps you train and tune models at scale without the need to manage infrastructure. When your customers submit training jobs from the portal, SageMaker helps distribute training workloads at scale by defining proper training resources. This can reduce the time and costs needed for customers to finish critical training jobs.
-
Cost Optimization
Model training resources are provisioned only as needed when your customers submit training jobs. You can also apply Amazon SageMaker Savings Plans to reduce costs for ML training. And by defining SageMaker ResourceConfig to identify proper ML instances and storage volumes, your customers can manage their resources dynamically during model life cycles. Additionally, by using AWS Auto Scaling, your platform can automatically provision additional desired instances to handle unexpected training portal workloads and automatically scale back down during lower demand so that you don’t need to host idle instances. AWS Auto Scaling can be combined with Amazon EC2 Auto Scaling to scale additional resources.
-
Sustainability
By adopting serverless infrastructure and managed services, you can avoid overprovisioning training and storage resources, reducing your carbon footprint. For example, SageMaker and AWS Auto Scaling will use only the compute resources needed to run training jobs and the training portal, helping you minimize provisioned compute resources. Additionally, you can use both Amazon S3 and Amazon EFS as data lakes for your training data sources. These services offer various storage classes to help you avoid overprovisioning storage capacity.
Implementation Resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content

[Title]
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.