Amazon SageMaker HyperPod customers

Top AI startups and organizations of all sizes are training and deploying foundation models at scale on SageMaker HyperPod
  • Hugging Face

    Hugging Face has been using SageMaker HyperPod to create important new open foundation models like StarCoder, IDEFICS, and Zephyr which have been downloaded millions of times. SageMaker HyperPod’s purpose-built resiliency and performance capabilities have enabled our open science team to focus on innovating and publishing important improvements to the ways foundation models are built, rather than managing infrastructure. We especially liked how SageMaker HyperPod is able to detect ML hardware failure and quickly replace the faulty hardware without disrupting ongoing model training. Because our teams need to innovate quickly, this automated job recovery feature helped us minimize disruption during the foundation model training process, helping us save hundreds of hours of training time in just a year.

    Jeff Boudier, head of Product at Hugging Face
  • Perplexity AI

    We were looking for the right ML infrastructure to increase productivity and reduce costs in order to build high-performing large language models. After running a few successful experiments, we switched to AWS from other cloud providers in order to use Amazon SageMaker HyperPod. We have been using HyperPod for the last four months to build and fine-tune the LLMs to power the Perplexity conversational answer engine that answers questions along with references provided in the form of citations. Because SageMaker HyperPod automatically monitors cluster health and remediates GPU failures, our developers are able to focus on model building instead of spending time on managing and optimizing the underlying infrastructure. SageMaker HyperPod’s built-in data and model parallel libraries helped us optimize training time on GPUs and double the training throughput. As a result, our training experiments can now run twice as fast, which means our developers can iterate more quickly, accelerating the development of new generative AI experiences for our customers.

    Aravind Srinivas, co-founder and CEO at Perplexity AI
  • Articul8 AI

    Amazon SageMaker HyperPod has helped us tremendously in managing and operating our computational resources more efficiently with minimum downtime. We were early adopters of the Slurm-based HyperPod service and have benefitted from its ease-of-use and resiliency features, resulting in up to 35% productivity improvement and rapid scale up of our GenAI operations. As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and finetuning workloads in a more streamlined manner.

    Arun Subramaniyan, Founder and CEO of Articul8 AI
  • Thomson Reuters

    Read the blog

    “We were able to meet our large language model training requirements using Amazon SageMaker HyperPod. Using Amazon EKS on SageMaker HyperPod, we were able to scale up capacity and easily run training jobs, enabling us to unlock benefits of LLMs in areas such as legal summarisation and classification.”

    John Duprey, Distinguished Engineer, Thomson Reuters Labs

    Thomson Reuters has been at the forefront of AI development for over 30 years, and we are committed to providing meaningful solutions that help our customers deliver results faster, with better access to trusted information. To accelerate our innovation in generative AI, in addition to partnering with LLM providers, we also are exploring training custom models more efficiently with our unique and proprietary content and human expertise. SageMaker HyperPod’s distributed training libraries helps us improve large scale model training performance. And its resiliency feature saves time as we monitor and manage infrastructure. Training our foundation models on SageMaker HyperPod will increase our speed to market and help us provide quality solutions for our customers at pace.

    Joel Hron, Head of AI and Labs, Thomson Reuters
  • Stability AI

    As the leading open-source generative AI company, our goal is to maximize the accessibility of modern AI. We are building foundation models with tens of billions of parameters, which require the infrastructure that can scale optimized training performance. With SageMaker HyperPod’s managed infrastructure and optimization libraries, we can reduce training time and costs by over 50%. It makes our model training more resilient and performant to build state-of-the-art models faster.

    Emad Mostaque, Founder and CEO, Stability AI
  • Observea

    As a fast-moving startup and AI research company, Amazon EKS support in SageMaker HyperPod have been instrumental to accelerating our time-to-market. With SageMaker HyperPod, we have been able to launch a stable and secure platform to offer containerized high-performance-compute (HPC) applications as a service to our end customers which include top University AI research programs, AI startups and traditional enterprises. Through our use of SageMaker HyperPod, our customers and internal teams no longer have to worry about operating and configuring the Kubernetes control plane, and SageMaker HyperPod provides the network performance and optimized configurations to support complex HPC workloads. With EKS Support in SageMaker HyperPod, we can reduce time we spent for undifferentiated heavy lifting in infrastructure management and reduce operational costs by over 30%.

    Vamsi Pandari, Founder of Observea
  • Recursal AI

    The whole process was streamlined. Using SageMaker HyperPod, we can take advantage of cluster resiliency features that identify and automatically recover training jobs from the last saved checkpoint in the event of a hardware failure. We run very diverse workloads - from application, inference and training - with Kubernetes as the common thread. For us, Amazon EKS with SageMaker HyperPod just works: the nodes just drop into our cluster.

    Nathan Wilce, Infrastructure/data lead, Recursal