AWS Machine Learning Infrastructure

High performance, cost-effective, scalable infrastructure for every workload

More machine learning happens on AWS than anywhere else

More customers, across a diverse set of industries, choose AWS compared to any other cloud to build, train, and deploy their machine learning (ML) applications. AWS delivers the broadest choice of powerful compute, high speed networking, and scalable high performance storage options for any ML project or application.

Every ML project is different, and with AWS, you can customize your infrastructure to fit your performance and budget requirements. From using the ML framework that works best for your team, to selecting the right hardware platform to host your ML models, AWS offers a broad choice of services to meet your needs.

Businesses have found new ways to leverage ML for recommendation engines, object detection, voice assistants, fraud detection, and more. Although the use of ML is gaining traction, training and deploying ML models is expensive, model development time is long, and procuring the right amount of infrastructure to meet changing business conditions can be challenging. AWS ML infrastructure services remove the barriers to adoption of ML by being high performing, cost-effective, and highly flexible.

AWS ML Infrastructure: High performing, cost-effective, and highly flexible (3:20)

Choose From a Broad Set of Machine Learning Services

The below graphic illustrates the depth and breadth of services that AWS offers. Workflow services, shown in the top layer, make it easier for you to manage and scale your underlying ML infrastructure. The next layer highlights that AWS ML infrastructure supports all of the leading ML frameworks. The bottom layer shows examples of compute, networking, and storage services that constitute the foundational blocks of ML infrastructure.

PowerPoint Presentation Amazon SageMakerAWS Deep Learning AMIsAWS Deep Learning ContainersAWS BatchAWS ParallelClusterAmazon EKSAmazon ECSAmazon EMRTensorFlowPyTorchMXNetEC2 P4EC2 P3EC2 G4EC2 Inf1Elastic InferenceAWS OutpostsElastic Fabric AdapterAmazon S3Amazon EBSAmazon FSxAmazon EFS

Machine Learning Infrastructure Services

Traditional ML development is a complex, expensive, and iterative process. First, you need to prepare example data to train a model. Then, developers need to select which algorithm or framework they’ll use to build the model. Then, they need to train the model on how to make predictions, and tune it so that it delivers the best possible predictions. Finally, they need to integrate the model with their application and deploy this application on infrastructure that will scale.

  • Prepare
  • Build
  • Train
  • Deploy
  • Prepare
  • Data scientists often spend a lot of time exploring and preprocessing, or "wrangling," example data before using it for model training. To preprocess data, you typically fetch the data into a repository, clean the data by filtering and modifying your data so that it is easier to explore, prepare or transform the data into meaningful datasets by filtering out the parts you don't want or need, and label the data.

    Challenge AWS Solution How
    Manual data labeling Amazon Mechanical Turk Provides an on-demand, scalable, human workforce to complete tasks.
    Manual data labeling Amazon SageMaker Ground Truth Automates labeling by training Ground Truth from data labeled by humans so that the service learns to label data independently.
    Manage and scale data processing Amazon SageMaker Processing Extend a full managed experience to data processing workloads. Connect to existing storage or file system data sources, spin up the resources required to run your job, save the output to persistent storage, and examine the logs and metrics.
    Management of large amounts of data needed to train models Amazon EMR Processes vast amounts of data quickly and cost-effectively at scale.
    Shared file storage of large amounts of data needed to train models
    Amazon S3 Offers global availability of long-term durable storage of data in a readily accessible get/put access format.
  • Build
  • Once you have training data available, you need to choose a machine learning algorithm with a learning style that meets your needs. These algorithms can broadly be classified as supervised learning, unsupervised learning, or reinforcement learning. To assist you in your development of your model, different machine learning frameworks such as TensorFlow, Pytorch, and MXNet are available with libraries and tools to make development easier.

    Challenge AWS Solution How
    Accessing Jupyter Notebooks Hosted Jupyter Notebooks Hosted Jupyter Notebooks running on an EC2 instance of your choice.
    Sharing and collaborating in Jupyter Notebooks Amazon SageMaker Notebooks Fully managed Jupyter notebooks that you can start working within seconds and share with a single click. Code dependencies are automatically captured, so you can easily collaborate with others. Peers get the exact same notebook, saved in the same place.
    Algorithm creation Amazon SageMaker Pre-Built Algorithms High-performance, scalable machine learning algorithms, optimized for speed and accuracy, that can perform training on petabyte-scale data sets.
    Deep learning framework optimization Amazon SageMaker The leading frameworks are automatically configured and optimized for high performance. You don’t need to manually setup frameworks and can use them within the built-in containers.
    Getting started using multiple ML frameworks AWS Deep Learning AMIs Enables users to quickly launch Amazon EC2 instances pre-installed with popular deep learning frameworks and interfaces such as TensorFlow, PyTorch, and Apache MXNet.
    Getting started with containers using multiple ML frameworks   AWS Deep Learning Containers Docker images pre-installed with deep learning frameworks to make it easy to deploy custom machine learning environments quickly.
  • Train
  • After building out your model, you need compute, networking, and storage resources for training your model. Faster model training can enable data scientists and machine learning engineers to iterate faster, train more models, and increase accuracy. After you've trained your model, you evaluate it to determine whether the accuracy of the inferences is acceptable.


    AWS Solution            How
    Time sensitive large scale training Amazon EC2 P4 instances P4d instances deliver the highest performance machine learning training in the cloud with 8 NVIDIA A100 Tensor Core GPUs, 400 Gbps instance networking, and support for Elastic Fabric Adapter (EFA) with NVIDIA GPUDirect RDMA (remote direct memory access). P4d instances are deployed in hyperscale clusters called EC2 UltraClusters that provide supercomputer-class performance for everyday ML developers, researchers, and data scientists.
    Time sensitive large scale training Amazon EC2 P3 instances P3 instances deliver up to one petaflop of mixed-precision performance per instance with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput.
    Cost sensitive small scale training Amazon EC2 G4 instances G4 instances deliver up to 65 TFLOPs of FP16 performance and are a compelling solution for small-scale training jobs.

    Orchestration Services

    Challenge AWS Solution How
    Multi-node training Elastic Fabric Adapter EFA enables customers to run applications requiring high levels of inter-node communications at scale using a custom-built operating system (OS) bypass hardware interface.
    Highly scalable complex container orchestration Amazon Elastic Container Service (ECS) ECS is a fully managed container orchestration service.
    Highly scalable Kubernetes orchestration Amazon Elastic Kubernetes Service (EKS) You can use Kubeflow with EKS to model your machine learning workflows and efficiently run distributed training jobs.
    Large scale training AWS Batch Batch dynamically provisions the optimal quantity and type of compute resources based on the volume and specific resource requirements of the batch jobs submitted.
    Optimizing performance for large scale training AWS ParallelCluster AWS ParallelCluster automatically sets up the required compute resources and shared filesystems for large scale ML training projects.


    Challenge AWS Solution How
    Scalable storage Amazon S3 S3 can easily achieve thousands of transactions per second as the storage tier.
    Throughput and latency of storage access Amazon FSx for Lustre FSx for Lustre integrated with S3 delivers shared file storage with high throughput and consistent, low latencies.
    Batch processing on central locations Amazon Elastic File System (EFS) EFS provides easy access to large machine learning datasets or shared code, right from a notebook environment, without the need to provision storage or worry about managing the network file system.
    High I/O performance for temporary working storage Amazon Elastic Block Store (EBS) EBS enables single digit-millisecond latency for high performance storage needs.

    Fully Managed Services

    Challenge AWS Solution How
    Experiment management and tracking Amazon SageMaker Experiments Evaluate and organize training experiments in an easy and scalable manner, organize thousands of training experiments, log experiment artifacts, and visualize models quickly.
    Debug models Amazon SageMaker Debugger A visual interface to analyze the debug data and watch visual indicators about potential anomalies in the training process.
    Model Tuning Amazon SageMaker Automatic Model Tuning Automatically tune models by adjusting thousands of different combinations of algorithm parameters to arrive at the most accurate predictions the model is capable of producing.
  • Deploy
  • Once you have completed training and optimizing your model to your desired level of accuracy and precision, you put it into production to make predictions. Inference is what actually accounts for the vast majority of machine learning’s cost. According to customers, machine learning inference can represent up to 90% of overall operational costs for running machine learning workloads.


    Challenge AWS Solution How
    High cost and low performance Amazon EC2 Inf1 instances Inf1 instances feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS.
    Inference for models using NVIDIA’s CUDA, CuDNN or TensorRT libraries Amazon EC2 G4 instances G4 instances are equipped with NVIDIA T4 GPUs which deliver up to 40X better low-latency throughput than CPUs.
    Inference for models that take advantage of Intel AVX-512 Vector Neural Network Instructions (AVX512 VNNI) Amazon EC2 C5 instances C5 instances include Intel AVX-512 VNNI which helps speed up typical machine learning operations like convolution, and automatically improves inference performance over a wide range of deep learning workloads.
    Right-sizing inference acceleration for optimal price/performance Amazon Elastic Inference Elastic Inference allows you to attach low-cost GPU-powered acceleration to Amazon EC2 instances.
    Low latency inference, local data processing, or storage requirements
    AWS Outposts AWS Outposts is a fully managed service that extends AWS infrastructure, AWS services, APIs, and tools to virtually any datacenter, co-location space, or on-premises facility.

    Scaling Inference

    Challenge AWS Solution How
    Complex scaling of your infrastructure AWS Cloudformation CloudFormation allows you to use programming languages or a simple text file to model and provision, in an automated and secure manner, all the resources needed for your applications across all regions and accounts.
    Unpredictable scalability of your infrastructure AWS Auto Scaling AWS Auto Scaling monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.
    Unpredictable use of EC2 instances Amazon EC2 Fleet With a single API call, you can provision capacity across EC2 instance types and across purchase models to achieve desired scale, performance, and cost.
    Ensuring model accuracy Amazon SageMaker Model Monitor Continuously monitor the quality of machine learning models in production and receive alert when there are deviations in model quality without building additional tooling.
    Managing inference costs Amazon SageMaker Multi-Model Endpoints Deploy multiple models with a single click on a single endpoint and serve them using a single serving container to provide a scalable and cost effective way to deploy large numbers of models.

"The P3 instances helped us reduce our time to train machine learning models from days to hours and we are looking forward to utilizing P4d instances, as the additional GPU memory and more efficient float formats will allow us to train more complex models at an even faster speed."


Intuit is all in on AWS and uses AWS to better serve its customers. Intuit uses Amazon SageMaker to train its machine-learning models quickly and at scale, cutting the time needed to deploy the models by 90 percent. Learn more.


"With previous GPU clusters, it would take days to train complex AI models, such as Progressive GANs, for simulations and view the results. Using the new P4d instances reduced processing time from days to hours. We saw two- to three-times greater speed on training models."


Capital One turns data into insights through machine learning, allowing the company to innovate quickly on behalf of its customers. Capital One uses AWS services including Amazon S3 to power its machine-learning innovation. Learn more.


Zillow runs its ML algorithms using Spark on Amazon EMR to quickly create scalable clusters and use distributed-processing capabilities to process large data sets in near real time, create features, and train and score millions of ML models. Learn more.

By the Numbers


2.5x better

deep learning performance for P4d compared to previous generation P3 instances, offering the highest performance in the cloud.


62 minutes

is the record setting time to train BERT with TensorFlow using 256 P3dn.24xlarge instances with 2,048 GPUs.

Low Cost

40% lower

cost per inference for Inf1 instances compared to G4 instances, offering the lowest cost per inference in the cloud.


22 worldwide

geographic regions with up to 69 Availability Zones available for many AWS machine learning infrastructure services.


  • High-Performance
  • Cost-Effective
  • Highly Flexible
  • High-Performance
  • Often times, development efficiency of data scientists and ML engineers is limited by how frequently they can train their deep learning models to incorporate new features, improve prediction accuracy, or adjust for data drift. AWS provides a high performance compute, networking, and storage infrastructure, available broadly on a pay-as-you-go basis, enabling development teams to train their models on an as-needed basis and not let infrastructure hold back their innovation.

    Compute: Reduce Training Time to Minutes and Super Charge Your Inference

    AWS provides high performance GPU instances and the industry’s first instances featuring custom built silicon for ML inference.

    Amazon EC2 P4d instances are the highest performance instances in the cloud for machine learning training, delivering up to 60% lower cost to train, including 2.5x better deep learning performance over previous generation P3 instances. P4d instances are also deployed in hyperscale clusters, called EC2 UltraClusters, that are comprised of more than 4,000 NVIDIA A100 GPUs, Petabit-scale networking, and scalable, low latency storage with FSx for Lustre. EC2 UltraClusters democratize access to supercomputing-class performance for everyday developers, researchers, and data scientists with an easy pay as you go usage model, without any setup or maintenance costs.


    For deploying trained models in production, Amazon EC2 Inf1 instances deliver high performance and the lowest cost machine learning inference in the cloud. These instances feature AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. With 1 to 16 AWS Inferentia chips per instance, Inf1 instances can scale in performance to up to 2000 Tera Operations per Second (TOPS).


    Networking: Scalable infrastructure for efficient distributed training or scale-out inference

    Training a large model takes time, and the larger and more complex the model is, the longer the training is going to take. AWS has several networking solutions to help customers scale their multi-node deployments to reduce training time. Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system (OS) bypass hardware interface enhances the performance of inter-instance communications, which is critical to scaling efficiently. With EFA, machine learning training applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of GPUs. Coupled with up to 400 Gbps per-instance network bandwidth and NVIDIA GPUDirect RDMA (remote direct memory access) for low latency GPU to GPU communication between instances, you get the performance of expensive on-premises GPU clusters with the on-demand elasticity and flexibility of the AWS cloud.

    Page-Illo_EC2 Overviews and Features_Enhanced Networking and EFA

    Storage: Ideal options for creating data-lakes or managing labeled data

    Organizations of all sizes, across all industries, are using data lakes to transform data from a cost that must be managed, to a business asset that can be used to derive valuable business insights or to provide enhanced customer experiences with the help of machine learning. Amazon Simple Storage Service (S3) is the largest and most performant object storage service for structured and unstructured data and the storage service of choice to build a data lake. With Amazon S3, you can cost-effectively build and scale a data lake of any size in a secure environment where data is protected by 99.999999999% (11 9s) of durability. For distributed training, if you need faster access to your labeled data, Amazon FSx for Lustre delivers performance that is optimized for sub-millisecond latencies and throughput that scales to hundreds of gigabytes per second. FSx for Lustre integrates with Amazon S3, making it easy to process data sets with the Lustre file system. When linked to an S3 bucket, an FSx for Lustre file system transparently presents S3 objects as files and allows you to write changed data back to S3.

  • Cost-Effective
  • Organizations are rapidly adopting the use of deep learning to build never seen before applications. Coupled with a rapid increase in model complexity, the cost to build, train and deploy machine learning applications quickly adds up. As companies move from exploring and experimenting with machine learning to deploying their applications at-scale, AWS offers the ideal combination of performance and low-cost infrastructure services across the entire application development lifecycle.

    Lowest Cost in the industry for ML inference

    Machine learning inference can represent up to 90% of the overall operational costs for running machine learning applications in production. Amazon EC2 Inf1 instances deliver high performance and the lowest cost machine learning inference in the cloud. Inf1 instances are built from the ground up to support machine learning inference applications. They feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. Each AWS Inferentia chip supports up to 128 TOPS (trillions of operations per second) of performance at low power to enable high performance efficiency.


    For applications that need GPUs for running their models in production, Amazon EC2 G4 instances are the industry’s most cost-effective GPU instances. Featuring NVIDIA T4 GPUs, these instances are available in different sizes with access to one GPU or multiple GPUs with different amounts of vCPU and memory - giving you the flexibility to pick the right instance size for your applications.


    Not all machine learning models are the same, and different models benefit from different levels of hardware acceleration. Intel based Amazon EC2 C5 instances offer the lowest price per vCPU in the Amazon EC2 family and are ideal for running advanced compute-intensive workloads. These instances support Intel Deep Learning Boost and can offer an ideal balance of performance and cost for running ML models in production.


    Amazon Elastic Inference allows you to attach low-cost GPU-powered acceleration to Amazon EC2 instances, Amazon SageMaker instances, or Amazon ECS tasks to reduce the cost of running deep learning inference by up to 75%.


    Broad choice of GPU instances to optimize time and cost-to-train, available at scale

    Depending on the type of machine learning application, customers prefer to optimize their development cycles to either lower the time it takes to train their ML models or lower their total cost to train. In most cases, training costs include not only the cost to train, but also the opportunity cost of idle time that ML engineers and data scientists could have spent optimizing their model.

    Amazon EC2 G4 instances deliver the industry’s most cost-effective GPU platform. These instances are optimal for training less complex models and is ideal for businesses or institutions that are less sensitive to time-to-train. G4 instances provide access to up to eight NVIDIA T4 GPUs, each delivering up to 65 TFLOPs of FP16 performance.


    Amazon EC2 P4 instances offer best-in-class single instances and distributed training performance, allowing engineering teams to significantly cut down their model iteration times, accelerate time to market, and optimize their overall engineering expenses. These instances provide up to 60% lower cost over previous generation P3 instances and can be deployed via all EC2 pricing options with up to a 90% discount using Spot. As performance of GPUs and hardware ML accelerators improves at least 2X every 18 months, using AWS infrastructure on a pay-as-you-go model gives you the ability to take advantage of the best price performance without locking up valuable CapEx for on-prem clusters that have limited shelf life.


    Amazon EC2 P3 and P3dn instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications. These instances deliver up to one petaflop of mixed-precision performance per instance to significantly accelerate machine learning and high performance computing applications. P3 and P3dn instances are available in 4 sizes providing up to 8 GPUs and 96 vCPUs and are available globally across 18 AWS regions.

  • Highly Flexible
  • Support for all major machine learning frameworks

    Frameworks like TensorFlow and PyTorch abstract much of the minutia of dealing with the implementation of building ML models by allowing developers to focus on the overall logic and dataflow of their model. Over 70% of companies that are building machine learning applications have stated that their teams use a mix of different ML frameworks. AWS ML infrastructure supports all of the popular deep learning frameworks, allowing your teams to pick the right framework to match their preference and development efficiency.


    Optimizations that plug under the frameworks

    At AWS, we have a strong focus on enabling customers to not only run their ML workloads on AWS, but also to give them the ultimate freedom to pick the ML framework or infrastructure services that works best for them. Software optimization to effectively train and deploy models on AWS infrastructure services are integrated with the most popular ML frameworks (TensorFlow, PyTorch, and MXNet) allowing customers to continue using whichever framework they prefer, and not be constrained to a specific framework/or hardware architecture. Operating at the framework level allows customers the freedom to always choose the best solution for their needs, and not be tied to a specific hardware architecture or cloud provider.

    AWS Neuron is a software development kit (SDK) for AWS Inferentia chips and enables developers to run high-performance and low latency inference using AWS Inferentia-based Amazon EC2 Inf1 instances. AWS Neuron is natively integrated with popular frameworks including TensorFlow, PyTorch, and MXNet. Customers can bring their pre-trained models and make only a few lines of code changes from within the framework to accelerate their inference with EC2 Inf1 instances, without writing AWS Inferentia chip specific custom code.


    To support efficient multi-node/distributed training, AWS has integrated Elastic Fabric Adapter (EFA) with NVIDIA Collective Communications Library (NCCL) - a library for communicating between multiple GPUs within a single node or across multiple nodes. Similar to AWS Neuron, customers can continue to use their ML framework of choice to build their models, and leverage under-the-hood optimization for AWS infrastructure.


Pricing Options

Machine learning training and inference workloads can exhibit characteristics that are steady state (such as hourly batch tagging of photos for a large population), spikey (such as kicking off new training jobs or search recommendations during promotional periods), or both. AWS has pricing options and solutions to help you optimize your infrastructure performance and costs.




A - use Spot instances for flexible, fault tolerant workloads such as ML training jobs that are not time-sensitive

B - use On-Demand instances for new or stateful spiky workloads such as short-term ML training jobs

C - use Savings Plans for known/steady state workloads such as stable inference workloads

Use Case AWS Solution How
Short-term training jobs On-Demand Pricing With On-Demand instances, you pay for compute capacity by the hour or the second depending on which instances you run.
Training jobs that have flexible start-stop times Spot pricing Amazon EC2 Spot instances allow you to request spare Amazon EC2 computing capacity for up to 90% off the On-Demand price.
Steady machine learning workloads over different instance types over a long period of time Savings Plans Savings Plans offer significant savings over On-Demand prices, in exchange for a commitment to use a specific amount of compute power for a one or three year period.

Additional Resources