Overview
TwelveLabs has two primary models: Marengo and Pegasus. Marengo is a multimodal embedding model that captures the visual, audio, and spatial-temporal context of videos to power any-to-any video search. This capability allows users to easily search and retrieve video data from massive archives. Pegasus is a video language model that analyzes video content to generate text-based outputs like summaries and analyses to structured outputs like metadata and JSON to power downstream video workflows and tools. It is designed to address the practical challenges of real-world video understanding and analysis, from fine-grained temporal reasoning to handling content that spans from seconds to hours. It has been adopted by a broad spectrum of users: from enterprises managing video datasets comprised of millions of hours to individuals pursuing creative projects. With these models, TwelveLabs helps users effectively search, classify, and utilize video data for various applications. “Nearly 90% of the world’s data is unstructured, a majority of it in video, yet most of it is unsearchable. We are now able to address this challenge, surfacing highly contextual videos to bring experiences to life, similar to how humans see, hear, and understand the world around us,” said Jae Lee, CEO and co-founder of TwelveLabs. From media giants like MLSE and Washington Post to enterprises in advertising, manufacturing, and government, TwelveLabs powers video intelligence across industries where visual data drives decisions. The company also partners with leading media asset management providers including Mimir, Iconik, and Adobe.
About TwelveLabs
TwelveLabs is a pioneer in building multimodal, video-native AI models that unlock the full potential of video for the world. TwelveLabs models, available in Amazon Bedrock, empower users to find specific moments in vast video archives using images, video, or natural language search capabilities like “show me the first touchdown of the game” or “find the scene where the main characters first meet.” The models can also power applications that analyze and interpret video content, automatically generating descriptive text such as topics, summaries, chapters, or highlights. Together, these capabilities deliver precise retrieval and rapid insight discovery across even the largest video archives.
Challenge | Scaling video AI from research to global production
TwelveLabs has developed breakthrough foundation models that understand video the way humans do—not as sequences of frames or transcripts, but as unified stories across sight, sound, and time. But bringing this research to life required infrastructure that could handle the unique demands of video AI at scale.
Video is the most complex form of unstructured data. For enterprises managing millions of hours—from broadcast archives to security footage—traditional search methods fall apart. The data volume and complexity is simply too massive. Finding specific moments, extracting insights, or making video truly searchable at this scale demands infrastructure purpose-built for the challenge.
The TwelveLabs team needed cloud infrastructure that could support the full lifecycle: training cutting edge multimodal models, running inference on petabytes of video data, and enabling customers to deploy video intelligence globally—all while maintaining the reliability and cost-efficiency required for production workloads.
Opportunity | Building on AWS infrastructure to accelerate deployment
AWS provided the scalable, reliable infrastructure foundation that allowed TwelveLabs to move quickly from research to global production deployment.
The company's flagship models, Marengo and Pegasus, required infrastructure that could handle both intensive training workloads and high-volume inference at scale. Amazon SageMaker HyperPod's built-in resiliency enabled the team to train models across thousands of GPUs without managing infrastructure failures manually.
But training was only part of the equation. At the heart of TwelveLabs' technology is vector search: Marengo encodes each scene, audio segment, and visual element into multi-dimensional vector embeddings—numerical representations that capture semantic meaning across sight, sound, and time. When a user searches for "show me clips where someone is celebrating," that query gets embedded into the same vector space. The system performs approximate nearest neighbor search, mathematically comparing the query against billions of stored embeddings to find the closest matches. This enables precise retrieval without exact keyword matches or metadata tags.
The scale is massive: a single hour of video generates thousands of embeddings. For customers processing millions of hours, that's billions of vectors requiring storage, indexing, and sub-second search performance.
Solution | TwelveLabs + AWS: A unified platform for video intelligence
Today, TwelveLabs runs its production stack on AWS. Model training, inference, and customer-facing services all operate within a unified environment built for reliability and scale.
The company's core models run on Amazon Elastic Kubernetes Service (EKS) for portability and scalability, supported by Amazon EC2 instances that auto-scale dynamically based on demand—from processing a single query to indexing entire broadcast archives overnight. For inference, TwelveLabs uses Amazon EC2 G6e Instances powered by NVIDIA L40S Tensor Core GPUs for cost-efficient, high-volume workloads, with the flexibility to deploy Amazon EC2 P5 instances powered by NVIDIA H100 Tensor Core GPUs for the most computationally intensive tasks on demand.
Amazon S3 serves as the backbone of TwelveLabs' data infrastructure—storing raw video assets, processed multimodal data, and intermediate outputs with the durability and global reach required for enterprise customers. When customers send requests to the platform's APIs, their video, audio, and text assets flow directly into and out of S3. The platform processes these assets, generates vector embeddings through models hosted on EKS, and returns structured results from unstructured data—all without moving data across systems unnecessarily.
For customers deploying TwelveLabs models through Amazon Bedrock, Amazon S3 Vectors transformed the architecture. Rather than maintaining separate infrastructure for video storage and vector embeddings, S3 Vectors enables customers to store both in the same S3 buckets. Video assets and their searchable embeddings live together, with sub-second semantic search across billions of vectors and the same durability guarantees as standard S3 storage. When a user searches for "show me clips where the team celebrates after scoring," the query gets embedded into the same vector space as the video content, S3 Vectors performs approximate nearest neighbor search across billions of stored embeddings, and results return with precise timestamps and video IDs—all orchestrated through a single, integrated platform.
This infrastructure enables TwelveLabs to deliver video intelligence that scales from individual developers to enterprises processing millions of hours of content, with the reliability and performance that production applications demand. The architectural simplification dramatically improves unit economics and makes video AI viable at scales that weren't previously possible.
Outcome | Twelve Labs and AWS are Better together
Building breakthrough AI models requires more than just great ideas—it demands infrastructure that can scale from research to global production without compromise. By leveraging AWS's cloud infrastructure, TwelveLabs transformed its foundation models from research prototypes into production systems powering video intelligence at petabyte scale. "Video understanding at scale isn't just about model performance—it's about engineering the entire system," said Lee.
With TwelveLabs models now available in Amazon Bedrock, developers can build sophisticated video AI applications while maintaining complete control over their data. For organizations with massive video archives, S3 Vectors offers a seamless path to making that content searchable and actionable—transforming petabytes of footage from write-only storage into a source of continuous insight and intelligence.
"Nearly 90% of the world's data is unstructured, a majority of it in video, yet most of it is unsearchable," said Lee. "We are now able to address this challenge, surfacing highly contextual videos to bring experiences to life, similar to how humans see, hear, and understand the world around us."
AWS's infrastructure gave us the foundation to focus on what matters: structuring how our models process temporal sequences and multimodal context. We've built video AI that reasons like humans do: understanding not just what happens, but why it matters and how moments connect across time.
Jae Lee
CEO and co-founder of TwelveLabsAWS Services Used
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages