How Volkswagen and AWS built end-to-end MLOps for Digital Production Platform

Background

In 2019, Volkswagen AG (VW) and Amazon Web Services (AWS) formed a strategic collaboration to develop the Digital Production Platform (DPP), designed to enhance VW production and logistics efficiency by up to 30 percent while reducing production costs by the same margin. The DPP streamlines data access from VW shop-floor devices and manufacturing systems, making it simpler to build and deploy new applications, up to 2-3x faster, while reducing experimentation costs, and facilitating use case sharing across VW. A key solution deployed was the end-end MLOps pipeline for machine learning (ML) use cases. This blog outlines the architecture used to streamline the end-to-end ML lifecycle, best practices, and how to implement a similar MLOps solution in your organization.

Teams across VW plants implemented over 100 use cases across VW plants and brands. The majority of the use cases comprised of ML based solutions in the areas of predictive maintenance, quality, and process optimization. For example, predictive maintenance for robot welding guns involved sensors detecting mechanical or welding circuit failures. ML models were required to proactively predict faults across thousands of guns at different plants. Multiple data science teams at VW worked on production-ready ML solutions, adhering to VW’s strict security standards. However, this decentralized approach revealed several challenges:

Inconsistent Development Approaches: Each plant developed solutions independently, resulting in fragmented ML operations and a lack of unified MLOps strategy. This led to a patchwork of solutions rather than a cohesive framework.
Development Inefficiencies: Redundant efforts arose from VW teams recreating similar infrastructure components, each requiring separate VW security reviews, which increased complexity.
Time and Resource Impact: Initial deployments required two full-time employees working for two months per workstream. Onboarding new members and completing security reviews also took longer due to unique implementations.
Process Management Issues: Without standardized processes, VW teams struggled with model lifecycle management, traceability, and version control, impacting transparency and accountability.
Quality and Maintenance Challenges: Diverse implementations led to inconsistent quality, strained resources, and varying testing standards, complicating the adoption of best practices.

These challenges led to financial implications, delayed time-to-market, increased maintenance costs, and added security risks. Knowledge sharing became difficult, resulting in duplication of efforts, undifferentiated heavy lifting, as well as additional operational and maintenance overhead for custom solutions. To address these challenges, VW collaborated with AWS Professional Services to build a more secure, scalable MLOps solution for industrial ML use cases deployed on the DPP.

MLOps Architecture

The architecture implemented at VW demonstrates how MLOps can automate these steps, creating an efficient framework for managing the machine learning lifecycle. To explore further, check out this blog post on the MLOps foundation roadmap for enterprises with Amazon SageMaker.

Figure 1 – MLOps multi-account architecture

A multi-account strategy helps manage multiple models, as shown in Figure 1 – MLOps architecture diagram featuring six VW AWS accounts. Here’s how each account functions:

Data account: This account serves as a centralized hub for data management, overseeing all data ingestion from sources such as on-premises systems or other environments to the cloud. Administrators centrally control and restrict access to specific data columns to meet use case requirements, ensuring compliance through anonymization when necessary. For insights on how VW manages data access and governance using Amazon DataZone, refer to the blog post.
EXP (Experimentation) account: This account provides a dedicated environment for the VW data science team to perform data exploration, model experimentation and training. The EXP account deploys all resources in an isolated VPC with no egress internet access. To enable the use of third-party libraries, an AWS CodeArtifact repository provides secure access to public repositories like PyPI. Data scientists commit code changes to a dedicated AWS CodeCommit repository (or other Git providers like GitLab, as AWS CodeCommit access for new customers has ended). When training or inference requires custom container images, data scientists commit code to a repository. A CI/CD pipeline then scans, tests, and builds the images before publishing them to a central Amazon ECR registry in the RES account.
RES (Resources) account: This account manages all infrastructure and ML model deployments. It hosts the central CodeCommit repository for Infrastructure as Code (IaC) and the AWS CodePipeline CI/CD pipeline, enabling deployments across RES, EXP, DEV, INT, and PROD accounts. Additionally, this account hosts centralized Amazon ECR repositories where data scientists publish custom Docker images used in inference workflows. Lastly, it creates AWS Service Catalog products in the EXP account to deploy Model Training Pipelines and in the RES account to deploy Model Deployment Pipelines.
DEV (Development) account: This account serves as the development environment where VW teams initially deploy ML models to Amazon SageMaker endpoints. Here, models undergo end-to-end testing by VW for both model metrics such as model performance and infrastructure metrics such as response times and availability. In the DEV account, administrators manually grant access to data scientists and DevOps teams for inspection and troubleshooting the deployment if required. Upon successful testing, a manual approval step in the CI/CD pipeline in the RES account advances the deployment to the INT stage.
INT (Integration/ Staging) account: This account functions as a staging environment for deploying the ML model to validate successful infrastructure deployments and integrations before proceeding with a deployment into the PROD environment. Unlike the DEV environment, deployments in the INT account can only be accessed through read-only permissions. After all tests pass, the DevOps team provides manual approval through the CI/CD pipeline in the RES account to deploy the model to production.
PROD (Production) account: This account hosts the production version of the ML model on an Amazon SageMaker endpoint. In the production environment, you can configure the SageMaker endpoint with an auto scaling group to automatically scale your endpoint up or down based on demand.

MLOps Flow for Data Scientist

Machine learning lifecycle is an iterative process, starting with identifying a business problem and determining whether ML is the appropriate solution. Once confirmed, the process involves framing the ML problem, followed by the data stage, where data engineers collect, explore, prepare, and analyze data through visualization and analysis. Next is feature engineering, where specific techniques such as encoding, normalization, and handling missing values are applied. This is followed by model development, which includes selecting an appropriate algorithm, training the model, tuning hyperparameters, and evaluating performance using predefined metrics. Once the model meets the desired performance criteria, it is deployed into production. Over time, model performance may degrade, necessitating continuous monitoring, debugging, retraining, and redeployment to maintain model effectiveness. The following steps describe the user flow for a Data scientist of MLOPS solution across various accounts.

Data Ingestion and Data Preparation – Data engineers create extract, transform, and load (ETL) pipelines combining multiple data sources and prepare the necessary datasets for the ML use cases in the DATA account. The data is cataloged using the AWS Glue Data Catalog and shared with other users and accounts via AWS Lake Formation for governance. Data scientists are granted secure access to specific datasets from the DATA account.
Data Exploration and Model Development – Each data scientist receives a dedicated Amazon SageMaker Studio user profile with an IAM role and Security Group, to access their SageMaker Studio domain and specific datasets in Amazon S3. In their individual workspaces, data scientists conduct tasks such as data exploration, model training, hyperparameter tuning, data processing, and model evaluation, using Jupyter Notebooks or SageMaker services. This can be extended with Amazon SageMaker Feature Store for feature reuse. For more information, refer to Enable feature reuse across accounts and teams using Amazon SageMaker Feature Store.
Model Training and Model Retraining – After experimentation, data scientists launch the “Model Building Product” from AWS Service Catalog. This initiates a CloudFormation stack to set up a Sagemaker Pipeline for orchestrating tasks like data processing, training, and evaluation. Successfully trained and evaluated ML models are registered in the Sagemaker Model Registry, which maintains version history and deployment metadata, such as container images and artifact locations. For subsequent model retraining, data scientists trigger the Training Pipeline, which registers a new model version in the model registry upon successful execution. When a new model version is registered, an Amazon EventBridge event is triggered and sent to the RES account, initiating the deployment process.
Model Deployment and Model Redeployment – To create a model deployment pipeline, the DevOps engineer launches the “Model Deployment” product from the Service Catalog, referencing the trained model in the Model Registry (EXP account). This product provisions a CodeCommit repository for IaC, a CodePipeline, and an EventBridge rule to listen for “new model version” events from the EXP account. The CI/CD pipeline is triggered by changes in the CodeCommit repository for incoming events from EventBridge. It queries the Model Registry for the latest version and deploys the model and resources to the DEV stage. After manual approvals, the model progresses through INT and PROD stages.

Benefits

The new MLOps pipeline delivers several key benefits:

Standardization: By replacing multiple custom solutions with a unified framework, VW eliminated redundant development efforts and established consistent practices across ML operations.
Operational Efficiency: A structured account architecture, spanning different environments, provides clear separation of concerns and streamlines the entire ML lifecycle from experimentation to deployment.
Security and Governance: Built-in security guardrails, including dedicated IAM roles, isolated VPCs, and encrypted communications, helps ensure that ML operations meet enterprise security standards while maintaining operational flexibility.
Scalability: The solution currently supports 8 use cases across 5 plants, serving 16 data scientists, with the architecture designed to accommodate future growth and additional use cases.
Reduced Time-to-Market: A standardized, automated pipeline now accomplishes in days what previously required two full-time employees working for two months per workstream, significantly accelerating model deployment.”.

Open-Source Repository

The GitHub repository provides a solution template to deploy the MLOps infrastructure discussed in this blog post. This solution deploys a total of 13 configurable AWS CDK stacks across five AWS accounts, enabling you to quickly bootstrap the MLOps platform. The template is flexible and can be customized to meet your specific requirements, such as adding Service Catalog Products to refine Model Training and Deployment workflows.
For deployment, you will require access to the five AWS accounts for the following environments:

EXP (Experimentation)
RES (Resources)
DEV (Development)
INT (Integration)
PROD (Production)

For prerequisites and deployment instructions, refer to the README.md file in the repository.

Possible extensions

The following extensions can enhance the MLOps solution to meet specific use case requirements:

Batch Processing

While online inference provides real-time predictions with low latency, batch inference is ideal for scenarios where data arrives in bulk at regular intervals, and immediate results are not required. Batch inference is particularly suited for periodic inference needs. Using Amazon SageMaker, you can perform batch inference by running a batch transform job on a SageMaker Model or orchestrating a batch inference workflow with AWS Step Functions. Results are automatically stored in Amazon S3.

API Gateway

To serve models through custom API endpoints, use Amazon API Gateway, a fully managed service to create, publish, maintain, monitor, and secure APIs at scale. Requests are routed through API Gateway to AWS Lambda, which invokes the SageMaker endpoint and returns responses to the API Gateway for testing and serving predictions.

Model Monitoring

Amazon SageMaker Model Monitor helps enables continuous tracking of model performance post-deployment in production. It captures input data samples and model predictions at scheduled intervals, monitoring metrics such as data quality, model quality, bias, and explainability. If Model Monitor detects any drift or degradation, it generates alerts, enabling you to take corrective actions like collecting new training data, retraining the model, or auditing upstream systems. Learn more about Monitoring in-production ML models at large scale using Amazon SageMaker Model Monitor.

Security Guardrails

The VW MLOPS Solution follows AWS security best practices from the AWS Well Architected Framework Security pillar and from the AWS white paper Build a Secure Enterprise Machine Learning Platform on AWS. Apart from the security features implemented in each of the VW AWS accounts, below are some more best practices that have been followed:

1. Infrastructure protection:

All resources are isolated within self-managed VPC subnets.
All AWS services are accessed through dedicated VPC Service Endpoints.

2. Data protection:

All data at rest is encrypted using customer-managed encryption keys.
SageMaker Studio environments are encrypted at rest.

3. Identity and access management:

Dedicated IAM roles are assigned to SageMaker Studio users, pipelines, training jobs, and model endpoints.
IAM Roles are created using Amazon SageMaker Role Manager personas to enforce least-privilege access.
Explicit IAM deny policies to restrict model training or model deployment outside the VPC or without encryption.

4. Detection:

Regular security audits are conducted to identify and mitigate vulnerabilities.
AWS Config is used to track configuration changes and ensure compliance with security policies.
AWS Security Hub aggregates security findings from multiple AWS services for centralized management and remediation. Learn more about how VW secures landing zone with automated remediation of security findings.

Conclusion

The collaboration between VW and AWS successfully transformed a fragmented MLOps landscape into a more standardized, efficient, and secure ML production pipeline. By implementing a comprehensive MLOps solution built on Amazon SageMaker, VW addressed challenges of decentralized development, to establish a more streamlined, scalable, and secure ML lifecycle through a multi-account MLOps architecture. This implementation may serve as a blueprint for other enterprises looking to standardize their MLOps practices at scale. If you are interested in exploring similar solutions or need guidance in building your own MLOps solution, visit the AWS for automotive page, or contact your AWS team today.

AWS for Industries