Why Trainium?
AWS Trainium is the machine learning (ML) chip that AWS purpose built for deep learning (DL) training of 100B+ parameter models. Each Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instance deploys up to 16 Trainium accelerators to deliver a high-performance, low-cost solution for DL training in the cloud. Although use of DL and generative AI is accelerating, many development teams have fixed budgets, limiting the scope and frequency of training needed to improve their models and applications. Trainium-based Amazon EC2 Trn1 instances solve this challenge by delivering faster time to train while offering up to 50% cost-to-train savings over comparable EC2 instances. Trainium has been optimized for training natural language processing, computer vision, and recommender models used in a broad set of applications, such as text summarization, code generation, question answering, image and video generation, recommendation, and fraud detection.
AWS Neuron SDK helps developers train models on Trainium accelerators (and deploy them on AWS Inferentia accelerators). It natively integrates with popular frameworks, such as PyTorch and TensorFlow, so that you can continue to train on Trainium accelerators and use your existing code and workflows.