AWS Machine Learning Blog

Tag: distributed training

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Modern model pre-training often calls for larger cluster deployment to reduce time and cost. At the server level, such training workloads demand faster compute and increased memory allocation. As models grow to hundreds of billions of parameters, they require a distributed training mechanism that spans multiple nodes (instances). In October 2022, we launched Amazon EC2 […]

Best practices for TensorFlow 1.x acceleration training on Amazon SageMaker

Today, a lot of customers are using TensorFlow to train deep learning models for their clickthrough rate in advertising and personalization recommendations in ecommerce. As the behavior of their clients change, they can accumulate large amounts of new data every day. Model iteration is one of a data scientist’s daily jobs, but they face the […]