Skip to main content
2025

Breaking the sound barrier: How AudioShake is teaching machines to hear like humans

AI-powered audio separation startup built scalable infrastructure on AWS for training and inference of audio workflows

Overview

Walk into a packed sports stadium, you hear the roar of the crowd, the play-by-play announcer, stadium music pumping through speakers, thousands of overlapping conversations. Or listen to your favorite song: the vocals, drums, guitar, and bass are woven together into a single sonic tapestry. Yet, without even thinking about it, the human brain picks apart these layered sounds, focusing on what matters while filtering out what doesn’t.

Machines? Not so much. AudioShake set out to solve this puzzle, and in doing so, they are fundamentally changing how we create, edit, and understand sound. The San Francisco-based AI company has built technology that can “unmix” any audio recording, separating it into individual, pristine components even decades after it was mixed together.

One of their most recent milestones was winning the top prize at the 2024 AWS re:Invent Unicorn Tank competition and becoming a key infrastructure partner with major artists like Green Day and companies like Disney Music Group as the next generation of multimodal AI.

About AudioShake

AudioShake is an AI-powered music and audio technology company that uses advanced machine learning to “unmix” sound, separating any audio recording into its individual components, or stems. The company serves the music, film, sports broadcasting, and AI industries with professional-grade audio separation technology.

Challenge | Machines hear sound as one big, unusable mix

For most of human history, recorded sound has come as a full mix. A song from the 1960s? That’s vocals, drums, guitar, and bass blended into a single track. A sports broadcast? That’s commentary, crowd noise, and often copyrighted music all mixed together.

This creates problems when artists lose access to their own work when master tapes are lost. Or a film studio can’t dub content into a new language without re-recording. Sports broadcasts get copyright suits for background music they can’t remove. AI companies training the next generation of multimodal models often struggle because machines can’t learn as well when there’s chaotic, overlapping audio.

“Humans are good at understanding nuances in noisy environments. We can talk in a noisy restaurant and still pick up emotional cues and variations,” said Jessica Powell, CEO and Co-Founder of AudioShake. “Machines struggle with these tasks. But machines excel at detecting patterns and editing sounds in ways humans can’t. We’re trying to give each side the other’s superpowers.”

Opportunity | Opening new possibilities across industries

AudioShake is transforming how industries work with sound, using AI-powered audio separation technology to solve challenges across media, technology, and beyond.

In media and entertainment, AudioShake gives editors new ways to interact with and control over their audio. Whether it’s isolating dialogue from a complex soundtrack, removing copyrighted music from content, or adapting media for international markets, AudioShake can help professionals separate and manipulate audio elements with remarkable precision. This capability has already helped resurrect classic recordings and enabled new formats for everything from vintage jazz albums to modern streaming content.

Beyond entertainment, AudioShake is reshaping how businesses handle voice communication. Call centers can now extract clear audio from multi-speaker conversations, significantly improving clarity and analysis capabilities. For AI development teams, AudioShake provides a crucial tool for creating more natural-sounding synthetic voices. By separating the nuances of human speech, including laughter, filler words, and natural speech patterns, developers can train AI models that better reflect authentic human communication.

“Think of it as bringing the precision of digital editing to the world of mixed audio,” says Powell. “Whether you’re working in post-production, developing AI models, or solving complex audio challenges, we can provide the tools to separate, analyze, and work with sound in ways that were previously impossible.”

AudioShake’s advances in sound separation technology can also address critical healthcare needs, particularly for patients with voice-affecting diseases. In partnership with nonprofit organizations, AudioShake is supporting ALS patients in recovering their voices after disease progression has impaired their speech. By isolating the patient’s original voice from archival recordings—even when other voices are present—AudioShake makes accurate voice cloning and synthesis possible, allowing patients to communicate using their authentic voice.

Solution | AI gives machines ears—and artists superpowers

AudioShake built proprietary deep learning models that can isolate any component of audio: vocals, instruments, dialogue, sound effects, background noise. That’s where AWS came in. From day one, AWS supported AudioShake’s journey, scaling infrastructure as they grew and connecting them with go-to-market opportunities. Today, AudioShake runs its production system on AWS, using:

AudioShake uses a process called source separation to separate mixed audio into individual component, called “stems.” They use advanced AI and machine learning to train deep learning models on vast datasets of audio to teach them how to recognize and isolate distinct sounds, like vocals, drums, or overlapping speakers.

AudioShake takes a mixed audio file — music, film audio, a podcast, or a live stream — and uses its proprietary AI models to separate the sound into its individual components. Instead of relying on hand-engineered DSP rules, AudioShake’s models learn the structure of real-world audio from large, unique licensed datasets.

When audio is uploaded or streamed through the SDK, the system converts it into a high-resolution representation, analyzes patterns across a wide range of audio aspects, and then isolates each source.

For music, this means extracting individual stems such as vocals, drums, bass, and instruments. For speech, AudioShake can identify and separate individual speakers — even when they’re talking at the same time — while also isolating dialogue from background effects or copyrighted music.

Unlike open-source models, AudioShake’s architecture is optimized for speed, quality, and deployment flexibility, enabling real-time separation, on-prem and edge processing, and enterprise-grade workflows across media, AI training, and live streaming.

AudioShake offers different separation models designed for various use cases:

  • Music: A model trained to separate different musical instruments from one another, enabling remixes, remastering, and sync licensing.
  • Dialogue, music, and effects: A model that can isolate speech from background music or sound effects, which is critical for dubbing, captioning, and post-production editing in film and TV.
  • Multi-speaker separation: A model that can detect and isolate multiple individual voices, even when they are overlapping, for use in podcast editing, voice AI, and accessibility services.

These models are used across a wide range of workflows. For example, AudioShake helped Green Day separate guitar tracks from their 1991 album and remixed them for their fans on TikTok. The company also helped the estates of Oscar Peterson and Nina Simone remaster classic live jazz albums. They split The Wizard of Oz into individual audio stems so it could be re-released in immersive Dolby Atmos for the Sphere in Las Vegas. They’re helping sports broadcasters “vanish” copyrighted background music from live events while keeping all the crowd energy and commentary intact.

Outcome | Building the intelligence layer of sound

This isn’t just about remixing old songs or cleaning up noisy recordings. It’s about fundamentally changing how people interact with sound, making it programmable, searchable, and structured like text or code.

“We’re laying the foundation for the ‘intelligence layer’ of sound,” says Powell. “Just as text became searchable and programmable, we believe sound will follow. By unlocking usable, structured audio data within Amazon S3 and RDS databases, we’re enabling entirely new categories of applications—from training AI that can truly hear like humans, to creating immersive experiences that adapt in real-time, to making sound accessible in ways that were never possible before.”

The applications span industries: dubbing and localization for film and TV, isolating patient and doctor voices in telemedicine, cleaning evidence in legal proceedings, training voice AI that understands overlapping conversation, and enabling the next generation of GenAI tools where users can edit audio outputs at the stem level.

AudioShake is just getting started. With AWS as their infrastructure backbone, they’re moving from unmixing individual tracks to unmixing the entire world’s sound—one recording at a time. “With partners like AWS, their ease of setup and choice of microservices, we’re able to move fast and compete with anyone,” Powell concludes. “When people thought what we were building was too niche or too hard, AWS believed in us—and helped us prove that the impossible is possible.” 

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages