AWS Machine Learning Blog

Category: AWS Trainium

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

This is a guest post co-authored with Ville Tuulos (Co-founder and CEO) and Eddie Mattia (Data Scientist) of Outerbounds. To build a production-grade AI system today (for example, to do multilingual sentiment analysis of customer support conversations), what are the primary technical challenges? Historically, natural language processing (NLP) would be a primary research and development […]

Revolutionizing large language model training with Arcee and AWS Trainium

This is a guest post by Mark McQuade, Malikeh Ehghaghi, and Shamane Siri from Arcee. In recent years, large language models (LLMs) have gained attention for their effectiveness, leading various industries to adapt general LLMs to their data for improved results, making efficient training and hardware availability crucial. At Arcee, we focus primarily on enhancing […]

Open source observability for AWS Inferentia nodes within Amazon EKS clusters

This post walks you through the Open Source Observability pattern for AWS Inferentia, which shows you how to monitor the performance of ML chips, used in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, with data plane nodes based on Amazon Elastic Compute Cloud (Amazon EC2) instances of type Inf1 and Inf2.