2022

Amazon Search Empowers Teams Across Amazon to Bring Deep Learning to Their ML Applications Using AWS

Search M5, a team within Amazon Search, used various AWS solutions to run deep learning experiments with tens of billions of parameters.

Overview | Opportunity | Solution | Outcome | AWS Services Used

Trains thousands of models

with 200 million+ parameters each

Runs deep learning

experiments across teams

Scales to thousands

of training jobs per month

Less than 3 days

to expand to new regions

Achieved high-throughput

streaming solutions

Overview

The Amazon Search team saw the opportunity to create deep learning technology that would empower teams across Amazon to glean intelligence from their data. Search M5, a team within Amazon Search that builds large models to support machine learning (ML) applications at Amazon, used Amazon Web Services (AWS) to run deep learning experiments for models with tens of billions of parameters. Search M5 used various AWS services to build, train, and deploy large ML models with multiple modalities at scale. Search M5 now consolidates data, making it simpler to create large models that teams across Amazon can use to bring the power of deep learning to their ML applications.

Opportunity | Sharing the Power of Deep Learning Across Amazon

As a multinational technology company employing over 1.6 million people, Amazon consists of many teams with distinct focus areas and priorities. Amazon Search works on products and services to enhance the end-user experience on Amazon.com. “There are benefits to be gained by connecting the dots across Amazon Search and building synergies across product areas,” says Belinda Zeng, head of applied science and engineering at Amazon Search. “By creating pretrained models to help interpret information from the rich datasets, we are able to enrich Amazon’s search functionalities using deep learning.”

Search M5 owns the discovery learning strategy for Amazon and builds large-scale models across modalities: multilingual, multi-entity, and multitask. Much of this work is experimental in nature, and the team needs to be able to scale experimentation and move into production quickly, simultaneously train thousands of models with more than 200 million parameters each, and efficiently scale its infrastructure on AWS. To achieve this, Search M5 decided to use Amazon Elastic Compute Cloud (Amazon EC2), which offers secure and resizable compute capacity for virtually any workload, as part of its infrastructure solution. “We opted for Amazon EC2 because it offered access to the latest hardware at scale, available at the click of a button,” says Rejith Joseph, principal engineer at Amazon Search. In addition, Search M5 needs to store many large datasets that are each hundreds of terabytes. The team chose Amazon Simple Storage Service (Amazon S3)—an object storage service offering industry-leading scalability, data availability, security, and performance—to handle its storage needs.

By continuing to increase our efficiency using AWS, we can unlock the possibilities of deep learning and artificial intelligence to benefit our customers.”

Rejith Joseph
Principal Engineer, Amazon Search

Solution | Scaling to Thousands of Training Jobs per Month

In the fourth quarter of 2020, Search M5 began using AWS services to build, train, and deploy its ML models. As of 2022, the team uses various AWS services to scale to thousands of training jobs per month, involving petabytes of data on large clusters of GPUs. Along with using Amazon S3 for its data needs, Search M5 uses Amazon FSx, which makes it easy to launch, run, and scale feature-rich and high-performing file systems in the cloud. The team also uses AWS Batch, a service for achieving fully managed batch processing at virtually any scale, to efficiently run its batch computing jobs. “By extensively using a combination of AWS FSx, Amazon EC2, and AWS Batch, we have increased our experimental velocity,” says Roshan Makhijani, engineering manager at Amazon Search. “The flexibility of building on AWS also helps us to expand to new regions, where there is hardware availability, in less than 3 days.”

From the beginning, Search M5 collaborated alongside AWS product teams to solve the organization’s unique challenges. For example, using cross-region compute was necessary to access the extensive compute resources required for data-intensive training jobs, but previously, no practical solution existed for flexible cross-region computing. “Working closely alongside AWS, we built some new features so that we could achieve cross-region computing and remove that roadblock to progress,” says Zeng. In addition, as the team’s data needs kept growing, it began to push the limits of AWS FSx. Working alongside AWS, Search M5 resolved all performance issues and laid the groundwork for continued scaling. Because of these enhancements, scaling Search M5’s ML infrastructure now takes just 1–2 weeks.

The team also developed a custom solution using a C++ library to set up Amazon S3 cross-streaming—storing data in one region and streaming that data in another region—without impacting the speed of training jobs. “Using Amazon S3, we achieved the high-throughput streaming solutions that we need,” says Makhijani. Search M5 kept its costs low and optimized performance during ML inference by choosing the most optimal GPU, CPU, and AWS Inferentia, a high-performance ML inference chip custom designed by AWS. “Not every model will deliver the same throughput on different hardware,” says Joseph. “So, the choice of hardware helps us to scale our model architecture and optimize for multiple types of hardware while keeping cost in check.” Additionally, the team deployed Amazon EC2 P4d Instances in EC2 UltraClusters, which are comprised of high-performing compute, networking, and storage in the cloud, to attain optimal compute and communication throughput. The use of AWS Deep Learning AMIs and Deep Learning Containers, which provide ML practitioners with optimized and secure ML frameworks and tools to accelerate deep learning on the cloud, streamlined provisioning and the deployment of EC2 instances and enabled scaling. As part of this solution, the team also used Elastic Fabric Adapter (EFA), a network interface for Amazon EC2 instances that customers can use to run applications requiring high levels of internode communications at scale on AWS.

In addition, Search M5 uses PyTorch on AWS, an open-source deep learning framework that helps to simplify developing ML models and deploy them into production. Specifically, the team experiments with various PyTorch libraries, like distributed data parallel and Amazon S3 plug-in, tools like PyTorch Profiler, and fully sharded data parallel for distributed training. Now that departments across Amazon can harness the power of deep learning, the uses for these capabilities are virtually limitless. For example, Search M5 has developed ML models to improve the search experience by accurately correcting customers’ spelling mistakes during searches. “ML applications help the system to accurately read customers’ true intentions and give them a diverse list of relevant recommendations,” says Zeng. “These capabilities are powered by rich, nuanced information from our pretrained models.”

Outcome | Continuing to Optimize Efficiency

Amazon Search now has the technology in place to build ML models at scale. The next steps for Search M5 include plans to keep improving its global cluster to enhance productivity and improve usage. The team will also use new Amazon EC2 instances to match different models, both for training and inference. Search M5 will continue to work alongside AWS to optimize the resiliency of its infrastructure, increase productivity, and reduce the overhead costs of training large models. “By continuing to increase our efficiency using AWS, we can unlock the possibilities of deep learning and artificial intelligence to benefit our customers,” says Joseph.

About Amazon

Amazon is an American multinational technology company that focuses on ecommerce, cloud computing, digital streaming, and artificial intelligence.

AWS Services Used

Amazon Simple Storage Service (Amazon S3)

Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Customers of all sizes and industries can store and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps.

Learn more »

Amazon FSx

Amazon FSx makes it easy and cost effective to launch, run, and scale feature-rich, high-performance file systems in the cloud. It supports a wide range of workloads with its reliability, security, scalability, and broad set of capabilities.

Learn more »

Amazon Elastic Compute Cloud (Amazon EC2)

Amazon Elastic Compute Cloud (Amazon EC2) offers the broadest and deepest compute platform, with over 500 instances and choice of the latest processor, storage, networking, operating system, and purchase model to help you best match the needs of your workload.

Learn more »

AWS Batch

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.

Learn more »

Build new applications with generative AI.

Learn more »

Amazon Search Empowers Teams Across Amazon to Bring Deep Learning to Their ML Applications Using AWS

Trains thousands of models

Runs deep learning

Scales to thousands

Less than 3 days

Achieved high-throughput

Overview

About Amazon

AWS Services Used

Amazon Simple Storage Service (Amazon S3)

Amazon FSx

Amazon Elastic Compute Cloud (Amazon EC2)

AWS Batch

Build new applications with generative AI.

More Amazon Stories

Get Started

Amazon Search Empowers Teams Across Amazon to Bring Deep Learning to Their ML Applications Using AWS

Trains thousands of models

Runs deep learning

Scales to thousands

Less than 3 days

Achieved high-throughput

Overview

About Amazon

AWS Services Used

Amazon Simple Storage Service (Amazon S3)

Amazon FSx

Amazon Elastic Compute Cloud (Amazon EC2)

AWS Batch

Build new applications with generative AI.

More Amazon Stories

Get Started

Ending Support for Internet Explorer