Amazon SageMaker HyperPod customers
Top AI start-ups and organizations of all sizes are training and deploying foundation models at scale on SageMaker HyperPod
Luma AI
Training frontier visual AI models requires massive compute power and seamless infrastructure. Luma AI trains on 1,000 times more data than the largest LLMs, demanding an advanced, scalable solution. SageMaker HyperPod delivers the reliability, performance, and efficiency needed to keep GPUs, networking, and storage working in perfect unison. With HyperPod, AI developers can train complex models faster, optimize resources, and bring cutting-edge AI to market with confidence.
Amazon Nova
Amazon AGI team trained Amazon Nova foundation models on SageMaker HyperPod with optimized infrastructure, high-speed storage, and integrated monitoring and observability tools. SageMaker HyperPod enables resilient, efficient, and scalable model development across large, distributed clusters.
Hugging Face
Hugging Face used SageMaker HyperPod to create new open foundation models like StarCoder, IDEFICS, and Zephyr. SageMaker HyperPod purpose-built resiliency and performance capabilities have enabled their open science team to focus on innovating and publishing important improvements to the ways foundation models are built, rather than managing infrastructure.
Perplexity AI
Perplexity built and fine-tuned the LLMs that power their conversational answer engine, which answers questions along with references provided in the form of citations. With SageMaker HyperPod, they perform model training 40% faster and run experiments twice as fast.
Articul8 AI
With HyperPod, Articul8 increased 35% productivity and scaled up of GenAI operations. With automated task prioritization and resource allocation in SageMaker HyperPod, they have seen a dramatic improvement in GPU utilization, thereby reducing idle time and accelerating their model development process by optimizing tasks ranging from training and fine-tuning to inference. With SageMaker HyperPod observability, they deploy metric collection and visualization systems in a single click, saving teams days of otherwise manual setup and enhancing cluster observability workflows and insights.
Coastal Carbon
EvolutionaryScale
Writer
Writer is pioneering a new era of LLM development. They trained their industry-leading models on HyperPod with faster model training, reduced latency, and optimized AI performance.
Noetik
Noetik is an AI-native biotechnology company leveraging SageMaker HyperPod to discover and develop cancer therapeutics.
Sony Honda Mobility
Sony Honda Mobility is using SageMaker HyperPod for model training within their MLOps pipeline to enhance AFEELA Intelligent Drive. “HyperPod's out-of-the-box observability features provide us with a comprehensive set of metrics across multiple dimensions (cluster, node, task, etc.), we look forward to gaining deeper, preconfigured health and performance insights, with task-level aggregation.“
Motoi Kataoka, MLOps Engineer in the Network Service Development Division at Sony Honda Mobility

Thomson Reuters
Thomson Reuters has been at the forefront of AI development for over 30 years, and we are committed to providing meaningful solutions that help our customers deliver results faster, with better access to trusted information. To accelerate our innovation in generative AI, in addition to partnering with LLM providers, we also are exploring training custom models more efficiently with our unique and proprietary content and human expertise. SageMaker HyperPod’s distributed training libraries helps us improve large scale model training performance. And its resiliency feature saves time as we monitor and manage infrastructure. Training our foundation models on SageMaker HyperPod will increase our speed to market and help us provide quality solutions for our customers at pace.
Joel Hron, Head of AI and Labs, Thomson Reuters and John Duprey, Distinguished Engineer, Thomson Reuters Labs

Stability AI
As the leading open-source generative AI company, our goal is to maximize the accessibility of modern AI. We are building foundation models with tens of billions of parameters, which require the infrastructure that can scale optimized training performance. With SageMaker HyperPod’s managed infrastructure and optimization libraries, we can reduce training time and costs by over 50%. It makes our model training more resilient and performant to build state-of-the-art models faster.
Emad Mostaque, Founder and CEO, Stability AI

Recursal AI
The whole process was streamlined. Using SageMaker HyperPod, we can take advantage of cluster resiliency features that identify and automatically recover training jobs from the last saved checkpoint in the event of a hardware failure. We run very diverse workloads - from application, inference and training - with Kubernetes as the common thread. For us, Amazon EKS with SageMaker HyperPod just works: the nodes just drop into our cluster.
Nathan Wilce, Infrastructure/data lead, Recursal

Hippocratic AI
Hippocratic AI, an AI company that develops the first safety-focused Large Language Model (LLM) for healthcare. To train its primary LLM and the supervisor models, Hippocratic AI required powerful compute resources, which were in high demand and difficult to obtain. Amazon SageMaker HyperPod flexible training plans made it easier for them to gain access to Amazon Elastic Compute Cloud (Amazon EC2) P5 Instances. HippocraticAI is also leveraging AWS services such as Grafana to track important GPU utilization metrics. Using Amazon EC2 P5 Instances, Hippocratic AI has increased model training speed by four times and scales its solution to accommodate hundreds of use cases. It helped them to secure the required compute resources and train models quickly.

NinjaTech
NinjaTech AI, a generative AI company that provides an all-in-one SuperAgent for unlimited productivity, used Amazon SageMaker HyperPod flexible training plans to accelerate fine-tuning of various internal models including the Llama 3.1 405B model to reduce model training costs, and automate the process. The company aims to provide a seamless experience to its users wanting access to various AI agents powering their SuperAgent Technology. To achieve this, they needed a model that could automatically predict user intention and determine which AI agent would be a good fit for it. This mechanism required making frequent updates to the model by incorporating customer feedback and new features iteratively, involving 10m-100m tokens at each round of LoRA fine-tuning. As a startup, acquiring and operating high-performance compute resources is challenging due to its steep cost and bandwidth issues, specifically in multi-node clusters which involve fast network and fast storage in addition to accelerated computing. In addition, the training process is time-consuming, involving steps like model downloading, distributed training, checkpoint, monitoring, auto remediation, merging, and quantization. HyperPod’s flexible training plans provided the company with reliable and affordable compute in advance of the training run, matching their specific compute and timeline requirements, while ensuring efficient model training.

OpenBabylon
Developers and data scientists at OpenBabylon, an AI company that customizes large language models for underrepresented languages, has been using SageMaker HyperPod flexible training plans for a few months to streamline their access to GPU resources to run large scale experiments. Using the multi-node SageMaker HyperPod’s distributed training capabilities, they conducted 100 large scale model training experiments, achieving state-of-the-art results in English-to-Ukrainian translation. This breakthrough was achieved on time and cost-effectively, demonstrating SageMaker HyperPod’s ability to successfully deliver complex projects on time and at budget.

Salesforce
Researchers at Salesforce were looking for ways to quickly get started with foundational model training and fine-tuning, without having to worry about the infrastructure, or spend weeks optimizing their training stack for each new model. With Amazon SageMaker HyperPod recipes, researchers at Salesforce can conduct rapid prototyping when customizing FMs. Now, Salesforce’s AI Research teams are able to get started in minutes with a variety of pre-training and fine-tuning recipes, and can operationalize frontier models with high performance.

H.AI
" With Amazon SageMaker HyperPod, we built and deployed the foundation models behind our agentic AI platform using the same high-performance compute. This seamless transition from training to inference streamlined our workflow, reduced time to production, and ensured consistent performance in live environments. HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency. "
Laurent Sifre, Co-founder & CTO, H.AI

Datology AI
" We are excited to use Amazon SageMaker HyperPod’s one-click observability solution. Our senior staff members needed insights into how we’re utilizing expensive GPU resources. The pre-built Grafana dashboards will give us exactly what we needed, with immediate visibility into critical metrics - from task-specific GPU utilization to file system (FSx for Lustre) performance - without requiring us to maintain any monitoring infrastructure. As someone who appreciates the power of the Prometheus Query Language, I like the fact that I can write my own queries and analyze custom metrics without worrying about infrastructure problems. "
Josh Wills, Member of Technical Staff, Datology AI

Amazon SageMaker HyperPod partners
Drive innovation and unlock greater business value with AWS partners that have deep technical knowledge and proven customer success
Accenture
" We are extending our partnership with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Our collaboration with AWS will allow us to guide customers towards the latest technological breakthroughs while helping to reduce generative AI application costs. By bringing together centralized governance capabilities in SageMaker HyperPod, and our experience in generative AI projects, we can help companies realize the value of generative AI even faster, improving customer experience and increasing return on investment. "
Jennifer Jackson, Global Lead for Accenture AWS Business Group & Senior Managing Director

Slalom
" We are thrilled to collaborate with AWS as a launch partner for Amazon SageMaker HyperPod task governance. Working with AWS, we can now help our customers rapidly adopt the latest technological advancements and reduce the costs of their generative AI applications. By bringing together centralized governance capabilities in SageMaker HyperPod, with Slalom’s extensive AI and cloud experience, we can deliver exceptional customer experiences alongside increased return on investment. "
Jeff Kempiners, Managing Director of Slalom’s Amazon Center of Excellence (CoE)

Rackspace Technology
" We are excited to collaborate with AWS as a launch partner for SageMaker HyperPod task governance. Together, we can help our customers reduce the costs of generative AI applications, while keeping up with the latest technological advancements. By combining SageMaker HyperPod’s centralized governance capabilities with Rackspace’s deep AI and cloud expertise, we can transform customer experiences and improve their return on investment simultaneously. "
Srini Koushik, President, AI, Technology and Sustainability at Rackspace Technology

Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages