Skip to main content
2025

WRITER scales distributed AI model training using Amazon SageMaker HyperPod

Discover how generative AI company WRITER accelerated foundation model development by using AWS purpose-built infrastructure.

Benefits

3x

accelerated model iteration cycles

90%

reduction in training pipeline failures

0

manual intervention in workload distribution

Overview

WRITER offers an all-in-one solution that helps enterprises incorporate generative AI into their workflows. The company’s foundation models (FMs) grew increasingly complex and computationally demanding. So, it needed a solution that could handle distributed training at scale without burdening its research team with infrastructure management.
Using Amazon Web Services (AWS) infrastructure, WRITER transformed its approach to training large language models (LLMs). The company migrated to an AWS managed solution that supports seamless multi-node distributed training. This migration empowered WRITER’s research team to focus on model development while improving performance across industry benchmarks.

About WRITER

WRITER is an all-in-one generative AI company that develops state-of-the-art foundation models for enterprise customers. Its offerings include general-purpose (such as Palmyra X5) and domain-specific models for finance, healthcare, and creative use.

Opportunity | Scaling distributed training at the enterprise level

Founded in 2020, WRITER is a generative AI company that develops and maintains its own FMs, called the Palmyra family. The company has a family of enterprise-grade models, including its newest model, Palmyra X5. It offers a massive context size and can process 1 million tokens in about 20 seconds. WRITER also provides three domain-specific models: Palmyra Med (for healthcare), Palmyra Fin (for financial services), and Palmyra Creative (for creative professionals who require diverse responses).

WRITER faced some challenges as its FMs grew in size and complexity. Modern LLMs have become too large to fit on single nodes, requiring sophisticated capabilities for multi-node distributed training with high performance GPU-to-GPU communication. Additionally, hardware failures were inevitable in large-scale training operations, and WRITER’s research team was spending valuable time managing infrastructure issues rather than focusing on model development and innovation.

To overcome these limitations, WRITER used Amazon SageMaker HyperPod, which removes the undifferentiated heavy lifting involved in building generative AI models. The service’s managed approach to distributed training encouraged WRITER to minimize infrastructure management overhead while benefiting from automated recovery features and robust multi-node communication capabilities.

Solution | Streamlining model training by using Amazon SageMaker HyperPod

WRITER migrated from its previous infrastructure to a managed solution that’s built on SageMaker HyperPod, which provided the foundation for training the increasingly sophisticated Palmyra models at scale. The implementation centered on Amazon Elastic Compute Cloud (Amazon EC2) P5 Instances, specifically P5en Instances, which are high performance GPU-based instances for deep learning and high performance computing (HPC) applications. The instances support Elastic Fabric Adapter (EFA)—which is used to run HPC and machine learning applications at scale—facilitating the high performance inter-node communication that’s essential for distributed training.

“We rely extensively on SageMaker HyperPod clusters for training our Palmyra models and conducting large-scale distributed-training jobs,” says Waseem Alshikh, cofounder and chief technology officer at WRITER. “The infrastructure has proved exceptionally resilient and high performing, especially with the cluster P5en Instances that are equipped with NVIDIA H200 GPUs, which have significantly accelerated our multi-node training workflows.”

The team used the Slurm-based job scheduling system in SageMaker HyperPod to manage training workloads while incorporating existing PyTorch-based training pipelines into open source libraries such as DeepSpeed. WRITER also paired SageMaker HyperPod with Amazon FSx for Lustre, a fully managed service that provides high performance, cost-effective, and scalable storage. This way, the company achieved the high-throughput file I/O performance that large-scale training datasets require.

The managed nature of SageMaker HyperPod proved transformative for WRITER’s operations, reducing the infrastructure management burden that had previously consumed the research team’s time. When hardware failures occurred, the automated recovery systems in SageMaker HyperPod maintained training continuity without manual intervention.

“When we encountered infrastructure challenges, the SageMaker HyperPod team responded promptly and provided the necessary support to keep our projects on track,” says Alshikh. “SageMaker HyperPod features, such as robust orchestration, automated health checks, and seamless job recovery, empower us to focus on advancing model development without worrying about cluster management.”

Outcome | Accelerating innovation through robust AI infrastructure

After this implementation, WRITER accelerated its model iteration cycles by three times, reduced training pipeline failures by 90 percent, and removed manual intervention in workload distribution. The company’s research team, which grew from 6 to 15 people, can now dedicate its expertise entirely to model innovation rather than infrastructure troubleshooting. Reliable multi-node communication, supported by EFA-backed instances, enhanced the performance of distributed training. These benefits have empowered WRITER to maintain its position at the forefront of FM development and keep its Palmyra models performing well on industry benchmarks and leaderboards.

 

WRITER continues to use its new infrastructure to push the boundaries of what’s possible in enterprise AI. The company maintains close collaboration with technical teams at AWS so that it can quickly adopt emerging technologies and hardware advancements. WRITER has also entered into a new engagement with the AWS team, now offering Palmyra models through Amazon Bedrock, a fully managed service that offers a choice of high-performing FMs from leading AI companies. With this foundation, WRITER can focus on what it does best: building breakthrough AI solutions that transform how businesses operate.

Missing alt text value
SageMaker HyperPod features, such as robust orchestration, automated health checks, and seamless job recovery, empower us to focus on advancing model development without worrying about cluster management.

Waseem Alshikh

Cofounder and Chief Technology Officer, WRITER

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages