AWS Trainium Customers

See how customers are using AWS Trainium to build, train, and fine-tune deep learning models.

Customers

More than 10,000 organizations worldwide — including Comcast, Condé Nast, and over 50% of the Fortune 500 — rely on the Databricks to unify their data, analytics and AI.
 
“Thousands of customers have implemented Databricks on AWS, giving them the ability to use MosaicML to pre-train, fine-tune, and serve foundation models for a variety of use cases. AWS Trainium gives us the scale and high performance needed to train our Mosaic MPT models, and at a low cost. As we train our next generation Mosaic MPT models, Trainium2 will make it possible to build models even faster, allowing us to provide our customers unprecedented scale and performance so they can bring their own generative AI applications to market more rapidly.”

Naveen Rao, VP of Generative AI, Databricks

With the mission of “reinventing the mechanism of value creation and advancing humanity,” Stockmark helps many companies create and build innovative businesses by providing cutting-edge natural language processing technology.

"With 16 nodes of Amazon EC2 Trn1 instances powered by AWS Trainium accelerator, we have developed and released stockmark-13b, a large language model with 13 billion parameters, pre-trained from scratch on a Japanese corpus of 220B tokens. The corpus includes the latest business domain texts up to September 2023. The model achieved the highest JSQuAD score (0.813) on the JGLUE (Japanese General Language Understanding Evaluation) benchmark compared to other equivalent models. It is available at Hugging Face Hub and can be used commercially with the MIT license. Trn1 instances helped us to achieve 20% training cost reduction compared to equivalent GPU instances."

Kosuke Arima, CTO, Stockmark Co., Ltd.

RICOH offers workplace solutions and digital transformation services designed to manage and optimize the flow of information across businesses.
 
"The migration to Trn1 instances was quite straightforward. We were able to complete the training of our 13B parameter model in just 8 days. Building on this success, we are looking forward to developing and training our 70B parameter model on Trainium and are excited about the potential of these instances in training our models faster and more cost-effectively."

Yoshiaki Umetsu, Director, Digital Technology Development Center, RICOH

Helixon
“At HeliXon, we build next-generation AI solutions to protein-based therapeutics. We aim to develop AI tools that empower scientists to decipher protein function and interaction, interrogate large-scale genomic datasets for target identification, and design therapeutics such as antibodies and cell therapies. Today, we use training distribution libraries like FSDP to parallelize model training over many GPU-based servers, but this still takes us weeks to train a single model. We are excited to utilize Amazon EC2 Trn1 instances, featuring the highest networking bandwidth (800 Gbps) available in AWS to improve the performance of our distributed training jobs and reduce our model training times, while also reducing our training costs."

Jian Peng, CEO, Helixon

Money Forward

Money Forward, Inc. serves businesses and individuals with an open and fair financial platform.

“We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. As we keep fine-tuning tailored NLP models periodically, reducing model training times and costs is also important. Based on our experience from successful migration of inference workload on Inf1 instances and our initial work on AWS Trainium-based EC2 Trn1 instances, we expect Trn1 instances will provide additional value in improving end-to-end ML performance and cost.”

Takuya Nakade, CTO, Money Forward, Inc.

Magic

Magic is an integrated product and research company developing AI that feels like a colleague to make the world more productive.

“Training large autoregressive Transformer-based models is an essential component of our work. AWS Trainium-powered Trn1 instances are designed specifically for these workloads, offering near infinite scalability, fast inter-node networking, and advanced support for 16- and 8-bit data types. Trn1 instances will help us train large models faster, at a lower cost. We are particularly excited about the native support for BF16 stochastic rounding in Trainium, increasing performance while numerical accuracy is indistinguishable from full precision.”

Eric Steinberger, Cofounder and CEO, Magic

Cactus

CACTUS has a suite of products and solutions for researchers, and organizations that improve how research gets funded, published, communicated and discovered.

“At Cactus Labs, we harness the power of AI, with research focused on natural language processing, ranking and recommendation, conversational AI, large language models, computer vision, AR/VR and XAI. In line with our quest to enable faster training of machine learning models as well as enable our researchers to run more experiments while managing the infrastructure cost, we were delighted to evaluate AWS Trainium. AWS Trainium’s out of the box features like XLA optimization, multi-worker data parallel training, and graph caching are really useful for us to reduce our training times and help us run more experiments faster and cheaper.”

Nishchay Shah, CTO and Head of Emerging Products, Cactus Communications

Watashiha

Watashiha offers an innovative and interactive AI chatbot service, “OGIRI AI,” which incorporates humor to provide a funny answer on the spot for a question.

“We use Large Language Models to incorporate humor and offer a more relevant and conversational experience to our customers on our AI services. This requires us to pre-train and fine-tune these models frequently. We pre-trained a GPT-based Japanese model on the EC2 Trn1.32xlarge instance, leveraging tensor and data parallelism. The training was completed within 28 days at a 33% cost reduction over our previous GPU based infrastructure. As our models rapidly continue to grow in complexity, we are looking forward to Trn1n instances which has double the network bandwidth of Trn1 to speed up training of larger models.”

Yohei Kobashi, CTO, Watashiha, K.K.

Partners

PyTorch
"At PyTorch, we accelerate taking machine learning from research prototyping to production ready for customers. We have collaborated extensively with the AWS team to provide native PyTorch support for the new AWS Trainium powered Amazon EC2 Trn1 instances that are purpose built for training deep learning models. Developers building PyTorch models can start training on Trn1 instances with minimal code changes. Additionally, we have worked with the OpenXLA community to enable PyTorch Distributed libraries for easy model migration from GPU-based instances to Trn1 instances. We are excited about the innovation that Trn1 instances bring to the PyTorch community, including more efficient data types, dynamic shapes, custom operators, hardware-optimized stochastic rounding, and eager debug mode. All these makes Trn1 well suited for wide adoption by PyTorch developers and we look forward to future joint contributions to PyTorch to further optimize training performance."

Geeta Chauhan, Applied AI, Engineering Manager, PyTorch

Hugging Face logo
"Hugging Face’s mission is to democratize good ML to help ML developers around the world solve real-world problems. And key to that is ensuring the latest and greatest models run as fast and efficiently as possible on the best ML accelerators in the cloud. We are incredibly excited about the potential for Inferentia2 to become the new standard way to deploy generative AI models at scale. With Inf1, we saw up to 70% lower cost than traditional GPU-based instances, and with Inf2 we have seen up to 8x lower latency for BERT-like transformers compared to Inferentia1. With Inferentia2, our community will be able to easily scale this performance to LLMs at the 100B+ parameters scale, and to the latest diffusion and computer vision models as well."

 

Amazon services using AWS Trainium

Amazon

Amazon’s product search engine indexes billions of products, serves billions of customer queries daily, and is one of the most heavily used services in the world.

“We are training large language models (LLM) that are multi-modal (text + image), multilingual, multi-locale, pre-trained on multiple tasks, and span multiple entities (products, queries, brands, reviews, etc.) to improve the customer shopping experience. Trn1 instances provide a more sustainable way to train LLMs by delivering the best performance/watt compared to other accelerated machine-learning solutions and offers us high performance at the lowest cost. We plan to explore the new configurable FP8 datatype, and hardware-accelerated stochastic rounding to further increase our training efficiency and development velocity.”

Trishul Chilimbi, VP, Amazon Search