AWS Machine Learning Infrastructure
More machine learning happens on AWS than anywhere else
More customers, across a diverse set of industries, choose AWS compared to any other cloud to build, train, and deploy their machine learning (ML) applications. AWS delivers the broadest choice of powerful compute, high speed networking, and scalable high performance storage options for any ML project or application.
Every ML project is different, and with AWS, you can customize your infrastructure to fit your performance and budget requirements. From using the ML framework that works best for your team, to selecting the right hardware platform to host your ML models, AWS offers a broad choice of services to meet your needs.
Businesses have found new ways to leverage ML for recommendation engines, object detection, voice assistants, fraud detection, and more. Although the use of ML is gaining traction, training and deploying ML models is expensive, model development time is long, and procuring the right amount of infrastructure to meet changing business conditions can be challenging. AWS ML infrastructure services remove the barriers to adoption of ML by being high performing, cost-effective, and highly flexible.
Choose from a broad set of machine learning services
The below graphic illustrates the depth and breadth of services that AWS offers. Workflow services, shown in the top layer, make it easier for you to manage and scale your underlying ML infrastructure. The next layer highlights that AWS ML infrastructure supports all of the leading ML frameworks. The bottom layer shows examples of compute, networking, and storage services that constitute the foundational blocks of ML infrastructure.
Machine learning infrastructure services
Traditional ML development is a complex, expensive, and iterative process. First, you need to prepare example data to train a model. Then, developers need to select which algorithm or framework they’ll use to build the model. Then, they need to train the model on how to make predictions, and tune it so that it delivers the best possible predictions. Finally, they need to integrate the model with their application and deploy this application on infrastructure that will scale.
Data scientists often spend a lot of time exploring and preprocessing, or "wrangling," example data before using it for model training. To preprocess data, you typically fetch the data into a repository, clean the data by filtering and modifying your data so that it is easier to explore, prepare or transform the data into meaningful datasets by filtering out the parts you don't want or need, and label the data.
Challenge AWS Solution How Manual data labeling Amazon Mechanical Turk Provides an on-demand, scalable, human workforce to complete tasks. Manual data labeling Amazon SageMaker Ground Truth Automates labeling by training Ground Truth from data labeled by humans so that the service learns to label data independently. Manage and scale data processing Amazon SageMaker Processing Extend a full managed experience to data processing workloads. Connect to existing storage or file system data sources, spin up the resources required to run your job, save the output to persistent storage, and examine the logs and metrics. Management of large amounts of data needed to train models Amazon EMR Processes vast amounts of data quickly and cost-effectively at scale. Shared file storage of large amounts of data needed to train models
Amazon S3 Offers global availability of long-term durable storage of data in a readily accessible get/put access format.
Once you have training data available, you need to choose a machine learning algorithm with a learning style that meets your needs. These algorithms can broadly be classified as supervised learning, unsupervised learning, or reinforcement learning. To assist you in your development of your model, different machine learning frameworks such as TensorFlow, Pytorch, and MXNet are available with libraries and tools to make development easier.
Challenge AWS Solution How Accessing Jupyter Notebooks Hosted Jupyter Notebooks Hosted Jupyter Notebooks running on an EC2 instance of your choice. Sharing and collaborating in Jupyter Notebooks Amazon SageMaker Notebooks Fully managed Jupyter notebooks that you can start working within seconds and share with a single click. Code dependencies are automatically captured, so you can easily collaborate with others. Peers get the exact same notebook, saved in the same place. Algorithm creation Amazon SageMaker Pre-Built Algorithms High-performance, scalable machine learning algorithms, optimized for speed and accuracy, that can perform training on petabyte-scale data sets. Deep learning framework optimization Amazon SageMaker The leading frameworks are automatically configured and optimized for high performance. You don’t need to manually setup frameworks and can use them within the built-in containers. Getting started using multiple ML frameworks AWS Deep Learning AMIs Enables users to quickly launch Amazon EC2 instances pre-installed with popular deep learning frameworks and interfaces such as TensorFlow, PyTorch, and Apache MXNet. Getting started with containers using multiple ML frameworks AWS Deep Learning Containers Docker images pre-installed with deep learning frameworks to make it easy to deploy custom machine learning environments quickly.
After building out your model, you need compute, networking, and storage resources for training your model. Faster model training can enable data scientists and machine learning engineers to iterate faster, train more models, and increase accuracy. After you've trained your model, you evaluate it to determine whether the accuracy of the inferences is acceptable.
AWS Solution How Time- and cost-sensitive, large-scale training EC2 Trn1 instances powered by AWS Trainium
Amazon EC2 Trn1 instances, powered by AWS Trainium chips, are purpose built for high-performance deep learning and deliver the best price performance for training deep learning models in the cloud.
Cost-sensitive-training EC2 DL1 instances powered by Habana Gaudi
Amazon EC2 DL1 instances, powered by Gaudi accelerators from Habana Labs, an Intel company, are designed for training deep learning models. They use up to 8 Gaudi accelerators and deliver up to 40% better price performance than current GPU-based EC2 instances for training deep learning models.
Time sensitive large scale training Amazon EC2 P4 instances P4d instances deliver the highest performance machine learning training in the cloud with 8 NVIDIA A100 Tensor Core GPUs, 400 Gbps instance networking, and support for Elastic Fabric Adapter (EFA) with NVIDIA GPUDirect RDMA (remote direct memory access). P4d instances are deployed in hyperscale clusters called EC2 UltraClusters that provide supercomputer-class performance for everyday ML developers, researchers, and data scientists. Time sensitive large scale training Amazon EC2 P3 instances P3 instances deliver up to one petaflop of mixed-precision performance per instance with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput. Cost-sensitive, small-scale training Amazon EC2 G5 instances
G5 instances deliver up to 3.3x higher performance for machine learning training compared to G4dn instances.
Cost-sensitive, small-scale training Amazon EC2 G4 instances G4 instances deliver up to 65 TFLOPs of FP16 performance and are a compelling solution for small-scale training jobs.
Challenge AWS Solution How Multi-node training Elastic Fabric Adapter EFA enables customers to run applications requiring high levels of inter-node communications at scale using a custom-built operating system (OS) bypass hardware interface. Highly scalable complex container orchestration Amazon Elastic Container Service (ECS) ECS is a fully managed container orchestration service. Highly scalable Kubernetes orchestration Amazon Elastic Kubernetes Service (EKS) You can use Kubeflow with EKS to model your machine learning workflows and efficiently run distributed training jobs. Large scale training AWS Batch Batch dynamically provisions the optimal quantity and type of compute resources based on the volume and specific resource requirements of the batch jobs submitted. Optimizing performance for large scale training AWS ParallelCluster AWS ParallelCluster automatically sets up the required compute resources and shared filesystems for large scale ML training projects.
Challenge AWS Solution How Scalable storage Amazon S3 S3 can easily achieve thousands of transactions per second as the storage tier. Throughput and latency of storage access Amazon FSx for Lustre FSx for Lustre integrated with S3 delivers shared file storage with high throughput and consistent, low latencies. Batch processing on central locations Amazon Elastic File System (EFS) EFS provides easy access to large machine learning datasets or shared code, right from a notebook environment, without the need to provision storage or worry about managing the network file system.
High I/O performance for temporary working storage Amazon Elastic Block Store (EBS) EBS enables single digit-millisecond latency for high performance storage needs.
Fully Managed Services
Challenge AWS Solution How Experiment management and tracking Amazon SageMaker Experiments Evaluate and organize training experiments in an easy and scalable manner, organize thousands of training experiments, log experiment artifacts, and visualize models quickly. Debug models Amazon SageMaker Debugger A visual interface to analyze the debug data and watch visual indicators about potential anomalies in the training process. Model Tuning Amazon SageMaker Automatic Model Tuning Automatically tune models by adjusting thousands of different combinations of algorithm parameters to arrive at the most accurate predictions the model is capable of producing.
Once you have completed training and optimizing your model to your desired level of accuracy and precision, you put it into production to make predictions. Inference is what actually accounts for the vast majority of machine learning’s cost. According to customers, machine learning inference can represent up to 90% of overall operational costs for running machine learning workloads.
Challenge AWS Solution How High cost and low performance Amazon EC2 Inf1 instances Inf1 instances feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS.
Inference for models using NVIDIA CUDA, CuDNN, or TensorRT libraries
Amazon EC2 G5 instances
G5 instances feature up to 8 NVIDIA A10G Tensor Core GPUs and deliver up to 3x higher performance for machine learning inference compared to G4dn instances.
Inference for models using NVIDIA’s CUDA, CuDNN or TensorRT libraries Amazon EC2 G4 instances G4 instances are equipped with NVIDIA T4 GPUs which deliver up to 40X better low-latency throughput than CPUs. Inference for models that take advantage of Intel AVX-512 Vector Neural Network Instructions (AVX512 VNNI) Amazon EC2 C5 instances C5 instances include Intel AVX-512 VNNI which helps speed up typical machine learning operations like convolution, and automatically improves inference performance over a wide range of deep learning workloads. Right-sizing inference acceleration for optimal price/performance Amazon Elastic Inference Elastic Inference allows you to attach low-cost GPU-powered acceleration to Amazon EC2 instances. Low latency inference, local data processing, or storage requirements
AWS Outposts AWS Outposts is a fully managed service that extends AWS infrastructure, AWS services, APIs, and tools to virtually any datacenter, co-location space, or on-premises facility.
Challenge AWS Solution How Complex scaling of your infrastructure AWS Cloudformation CloudFormation allows you to use programming languages or a simple text file to model and provision, in an automated and secure manner, all the resources needed for your applications across all regions and accounts. Unpredictable scalability of your infrastructure AWS Auto Scaling AWS Auto Scaling monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost. Unpredictable use of EC2 instances Amazon EC2 Fleet With a single API call, you can provision capacity across EC2 instance types and across purchase models to achieve desired scale, performance, and cost. Ensuring model accuracy Amazon SageMaker Model Monitor Continuously monitor the quality of machine learning models in production and receive alert when there are deviations in model quality without building additional tooling. Managing inference costs Amazon SageMaker Multi-Model Endpoints Deploy multiple models with a single click on a single endpoint and serve them using a single serving container to provide a scalable and cost effective way to deploy large numbers of models.
"The P3 instances helped us reduce our time to train machine learning models from days to hours and we are looking forward to utilizing P4d instances, as the additional GPU memory and more efficient float formats will allow us to train more complex models at an even faster speed."
Intuit is all in on AWS and uses AWS to better serve its customers. Intuit uses Amazon SageMaker to train its machine-learning models quickly and at scale, cutting the time needed to deploy the models by 90 percent. Learn more.
"With previous GPU clusters, it would take days to train complex AI models, such as Progressive GANs, for simulations and view the results. Using the new P4d instances reduced processing time from days to hours. We saw two- to three-times greater speed on training models."
Capital One turns data into insights through machine learning, allowing the company to innovate quickly on behalf of its customers. Capital One uses AWS services including Amazon S3 to power its machine-learning innovation. Learn more.
Zillow runs its ML algorithms using Spark on Amazon EMR to quickly create scalable clusters and use distributed-processing capabilities to process large data sets in near real time, create features, and train and score millions of ML models. Learn more.
By the Numbers
deep learning performance for P4d compared to previous generation P3 instances, offering the highest performance in the cloud.
is the record setting time to train BERT with TensorFlow using 256 P3dn.24xlarge instances with 2,048 GPUs.
cost per inference for Inf1 instances compared to G4 instances, offering the lowest cost per inference in the cloud.
geographic regions with up to 69 Availability Zones available for many AWS machine learning infrastructure services.
Often times, development efficiency of data scientists and ML engineers is limited by how frequently they can train their deep learning models to incorporate new features, improve prediction accuracy, or adjust for data drift. AWS provides a high performance compute, networking, and storage infrastructure, available broadly on a pay-as-you-go basis, enabling development teams to train their models on an as-needed basis and not let infrastructure hold back their innovation.
Compute: reduce training time to minutes and super charge your inference
AWS provides the industry’s first instances purpose built for ML training and inference.
Amazon EC2 Trn1 instances, powered by AWS Trainium chips, are purpose built for high-performance, cost-effective deep learning training. These instances deliver industry leading performance while offering up to 50% cost-to-train savings over comparable GPU-based instances. Trn1 instances are powered by up to 16 AWS Trainium chips. Each chip includes two second-generation NeuronCore accelerators that are purpose built for deep learning algorithms. Trn1 instances are the first EC2 instances with up to 800 Gbps of Elastic Fabric Adapter (EFA) network bandwidth. They are deployed in EC2 UltraClusters that enable scaling up to 30,000 Trainium accelerators, which are interconnected with a nonblocking petabit-scale network to provide up to 6.3 exaflops of compute.
For deploying trained models in production, Amazon EC2 Inf1 instances deliver high performance and the lowest-cost machine deep learning inference in the cloud. These instances feature AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. With 1 to 16 AWS Inferentia chips per instance, Inf1 instances can scale in performance to up to 2000 tera operations per second (TOPS).
Networking: Scalable infrastructure for efficient distributed training or scale-out inference
Training a large model takes time, and the larger and more complex the model is, the longer the training is going to take. AWS has several networking solutions to help customers scale their multi-node deployments to reduce training time. Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system (OS) bypass hardware interface enhances the performance of inter-instance communications, which is critical to scaling efficiently. With EFA, machine learning training applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of GPUs. Coupled with up to 400 Gbps per-instance network bandwidth and NVIDIA GPUDirect RDMA (remote direct memory access) for low latency GPU to GPU communication between instances, you get the performance of expensive on-premises GPU clusters with the on-demand elasticity and flexibility of the AWS cloud.
Storage: Ideal options for creating data-lakes or managing labeled data
Organizations of all sizes, across all industries, are using data lakes to transform data from a cost that must be managed, to a business asset that can be used to derive valuable business insights or to provide enhanced customer experiences with the help of machine learning. Amazon Simple Storage Service (S3) is the largest and most performant object storage service for structured and unstructured data and the storage service of choice to build a data lake. With Amazon S3, you can cost-effectively build and scale a data lake of any size in a secure environment where data is protected by 99.999999999% (11 9s) of durability. For distributed training, if you need faster access to your labeled data, Amazon FSx for Lustre delivers performance that is optimized for sub-millisecond latencies and throughput that scales to hundreds of gigabytes per second. FSx for Lustre integrates with Amazon S3, making it easy to process data sets with the Lustre file system. When linked to an S3 bucket, an FSx for Lustre file system transparently presents S3 objects as files and allows you to write changed data back to S3.
Organizations are rapidly adopting the use of deep learning to build never seen before applications. Coupled with a rapid increase in model complexity, the cost to build, train and deploy machine learning applications quickly adds up. As companies move from exploring and experimenting with machine learning to deploying their applications at-scale, AWS offers the ideal combination of performance and low-cost infrastructure services across the entire application development lifecycle.
Lowest Cost in the industry for ML inference
Machine learning inference can represent up to 90% of the overall operational costs for running machine learning applications in production. Amazon EC2 Inf1 instances deliver high performance and the lowest cost machine learning inference in the cloud. Inf1 instances are built from the ground up to support machine learning inference applications. They feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. Each AWS Inferentia chip supports up to 128 TOPS (trillions of operations per second) of performance at low power to enable high performance efficiency.
For applications that need GPUs for running their models in production, Amazon EC2 G4 instances are the industry’s most cost-effective GPU instances. Featuring NVIDIA T4 GPUs, these instances are available in different sizes with access to one GPU or multiple GPUs with different amounts of vCPU and memory - giving you the flexibility to pick the right instance size for your applications.
Not all machine learning models are the same, and different models benefit from different levels of hardware acceleration. Intel based Amazon EC2 C5 instances offer the lowest price per vCPU in the Amazon EC2 family and are ideal for running advanced compute-intensive workloads. These instances support Intel Deep Learning Boost and can offer an ideal balance of performance and cost for running ML models in production.
Amazon Elastic Inference allows you to attach low-cost GPU-powered acceleration to Amazon EC2 instances, Amazon SageMaker instances, or Amazon ECS tasks to reduce the cost of running deep learning inference by up to 75%.
Broad choice of GPU instances to optimize time and cost-to-train, available at scale
Depending on the type of machine learning application, customers prefer to optimize their development cycles to either lower the time it takes to train their ML models or lower their total cost to train. In most cases, training costs include not only the cost to train, but also the opportunity cost of idle time that ML engineers and data scientists could have spent optimizing their model.
Amazon EC2 G4 instances deliver the industry’s most cost-effective GPU platform. These instances are optimal for training less complex models and is ideal for businesses or institutions that are less sensitive to time-to-train. G4 instances provide access to up to eight NVIDIA T4 GPUs, each delivering up to 65 TFLOPs of FP16 performance.
Amazon EC2 P4 instances offer best-in-class single instances and distributed training performance, allowing engineering teams to significantly cut down their model iteration times, accelerate time to market, and optimize their overall engineering expenses. These instances provide up to 60% lower cost over previous generation P3 instances and can be deployed via all EC2 pricing options with up to a 90% discount using Spot. As performance of GPUs and hardware ML accelerators improves at least 2X every 18 months, using AWS infrastructure on a pay-as-you-go model gives you the ability to take advantage of the best price performance without locking up valuable CapEx for on-prem clusters that have limited shelf life.
Amazon EC2 P3 and P3dn instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications. These instances deliver up to one petaflop of mixed-precision performance per instance to significantly accelerate machine learning and high performance computing applications. P3 and P3dn instances are available in 4 sizes providing up to 8 GPUs and 96 vCPUs and are available globally across 18 AWS regions.
Support for all major machine learning frameworks
Frameworks like TensorFlow and PyTorch abstract much of the minutia of dealing with the implementation of building ML models by allowing developers to focus on the overall logic and dataflow of their model. Over 70% of companies that are building machine learning applications have stated that their teams use a mix of different ML frameworks. AWS ML infrastructure supports all of the popular deep learning frameworks, allowing your teams to pick the right framework to match their preference and development efficiency.
Optimizations that plug under the frameworks
At AWS, we have a strong focus on enabling customers to not only run their ML workloads on AWS, but also to give them the ultimate freedom to pick the ML framework or infrastructure services that works best for them. Software optimization to effectively train and deploy models on AWS infrastructure services are integrated with the most popular ML frameworks (TensorFlow, PyTorch, and MXNet) allowing customers to continue using whichever framework they prefer, and not be constrained to a specific framework/or hardware architecture. Operating at the framework level allows customers the freedom to always choose the best solution for their needs, and not be tied to a specific hardware architecture or cloud provider.
AWS Neuron is the SDK for AWS Inferentia and AWS Trainium chips. By using AWS Neuron, you can run high-performance and cost-effective ML training by using AWS Trainium-based Amazon EC2 Trn1 instances. You can also run high-performance and low-latency inference by using AWS Inferentia-based Amazon EC2 Inf1 instances. AWS Neuron is natively integrated with popular frameworks such as TensorFlow, PyTorch, and MXNet. To accelerate your training with EC2 Trn1 instances and inference with EC2 Inf1 instances, you can use your pretrained models and change only a few lines of code from within the framework.
To support efficient multi-node/distributed training, AWS has integrated Elastic Fabric Adapter (EFA) with NVIDIA Collective Communications Library (NCCL) - a library for communicating between multiple GPUs within a single node or across multiple nodes. Similar to AWS Neuron, customers can continue to use their ML framework of choice to build their models, and leverage under-the-hood optimization for AWS infrastructure.
Machine learning training and inference workloads can exhibit characteristics that are steady state (such as hourly batch tagging of photos for a large population), spikey (such as kicking off new training jobs or search recommendations during promotional periods), or both. AWS has pricing options and solutions to help you optimize your infrastructure performance and costs.
A - use Spot instances for flexible, fault tolerant workloads such as ML training jobs that are not time-sensitive
B - use On-Demand instances for new or stateful spiky workloads such as short-term ML training jobs
C - use Savings Plans for known/steady state workloads such as stable inference workloads
|Use Case||AWS Solution||How|
|Short-term training jobs||On-Demand Pricing||With On-Demand instances, you pay for compute capacity by the hour or the second depending on which instances you run.|
|Training jobs that have flexible start-stop times||Spot pricing||Amazon EC2 Spot instances allow you to request spare Amazon EC2 computing capacity for up to 90% off the On-Demand price.|
|Steady machine learning workloads over different instance types over a long period of time||Savings Plans||Savings Plans offer significant savings over On-Demand prices, in exchange for a commitment to use a specific amount of compute power for a one or three year period.|
Blog: Amazon Web Services achieves fastest training times for BERT and Mask R-CNN
Aditya Bindal, Kevin Haas, and Indu Thangakrishnan
Blog: Building an interactive and scalable ML research environment using AWS ParallelCluster