AWS for M&E Blog

Identifying music in audio files and streams on AWS

In today’s digital age, where music is readily accessible through a myriad of platforms and devices, the ability to accurately identify songs is important for customers across many industries. From entertainment and broadcasting to streaming services and beyond, there is a need for accurate music identification solutions in order to detect unauthorized uses of content, ensure proper royalty payments, enhance user experiences, and enable content discovery.

In this blog post, we delve into the world of song fingerprinting—a technique used to identify songs based on their unique audio characteristics—and explore how leveraging Amazon Web Services (AWS) can empower organizations to implement scalable and efficient music identification systems that work with stored audio files like MP3s, as well as with streaming media.

How sounds are stored

When an audio recording is made, sound waves push against a microphone component called a diaphragm, causing it to move back and forth. That displacement of the diaphragm is measured and the values are stored in a file. Each of those measurements is called a sample, and a typical sound recording stores tens of thousands of samples per second. To play a recording back, those samples are used to send electrical signals to a speaker, which reproduces the sounds by moving a diaphragm back and forth. The movement of the speaker diaphragm duplicates the movement of the microphone diaphragm, which results in playback of the recorded sounds.

The following image shows a waveform, a visual representation of the samples from a song (note that there are two graphs since this is a stereo recording, and each channel has its own graph).

A sample waveformFigure 1: A sample waveform

Unfortunately, waveform data is difficult to work with, since it only displays the amount of diaphragm displacement over time. In order to analyze the data more deeply, we need to convert it into another form called a spectrogram.

A spectrogram is a visual representation of the frequencies present in an audio signal over time. It provides a detailed analysis of the spectral content of the audio, allowing us to see how the amplitude of different frequencies changes over the duration of the signal. Spectrograms are commonly used in audio processing and analysis for tasks such as speech recognition, music analysis, and sound classification. Following is a depiction of the same audio data as the previous image, displayed in spectrogram form.

A spectrogram based on the waveform in figure 1Figure 2: A spectrogram based on the waveform in figure 1

Just as with the waveform image, the X axis of the image represents time. However, the Y axis differs from a waveform, as it represents different frequency bands rather than simple amplitude. The color at each point represents the intensity of a frequency at a particular point in time.

In essence, a spectrogram allows us to convert a set of audio samples into a visual representation that encapsulates information about the amplitude of frequencies over time. This enables us to analyze and interpret the spectral characteristics of the audio signal, providing valuable insights into its timbral qualities, rhythmic structure, and harmonic content. Spectrograms serve as powerful tools for tasks such as audio visualization, feature extraction, and pattern recognition, forming the basis for many advanced audio processing techniques, including song fingerprinting.

Song fingerprinting

Song fingerprinting is a technique used to identify songs based on their unique audio characteristics, allowing for efficient and accurate matching, even in the presence of noise or distortions. At the core of song fingerprinting lies the concept of extracting distinctive features from audio signals, enabling rapid comparison and retrieval of similar content from a database of known songs.

The process of song fingerprinting involves several steps. First, the audio signal is analyzed to extract relevant features that are robust to variations in volume and audio quality. Here we look at peak frequencies across time, combined with information about the chronological relationships between peaks. These features are then transformed into a compact representation known as a fingerprint, which captures the essential information needed for identification while minimizing storage requirements.

The following diagram illustrates peaks found in the spectrogram, each designated by a blue dot.

Amplitude peaks on top of the spectrogram in figure 2Figure 3: Amplitude peaks on top of the spectrogram in figure 2

Once the fingerprints are generated, we store them in a searchable database along with metadata such as song titles, artists, and album information. When a new audio sample needs to be identified, it undergoes the same fingerprinting process to generate a query fingerprint. We then compare this query fingerprint against the fingerprints stored in the database using an efficient matching algorithm to find the closest matches.

Implementing song fingerprinting on AWS

In the GitHub repository associated with this post, you’ll find code that implements a song fingerprinting solution. The solution allows you to fingerprint your own songs and also check stored media files (like MP3 files) or media streams for the presence of those songs. The solution is serverless and is invoked by adding files to an Amazon Simple Storage Service (Amazon S3) bucket. That bucket has different folders for ingesting and for checking audio files.

The process begins with uploading your sound files into the S3 bucket’s songs_to_index folder. This causes an AWS Lambda function to be invoked that retrieves the newly uploaded sound file from the S3 bucket and generates fingerprints for it. The number of fingerprints generated per song varies based on the length of the song and the frequencies used in it, but 10,000 to 25,000 fingerprints per song is a reasonable estimate.

Once the fingerprints are created, the Lambda function stores the resulting hash values in an Amazon Aurora Serverless database. Aurora Serverless offers a scalable and cost-effective solution for storing and managing large volumes of fingerprint data. By leveraging the capabilities of Aurora Serverless, the system can efficiently index and query the fingerprint database, enabling fast and accurate song identification.

The following architecture diagram shows this process.

Architecture diagram showing how known songs are fingerprintedFigure 4: Architecture diagram showing how known songs are fingerprinted

Identifying songs in audio files

Once the songs you wish to track have been fingerprinted, you’ll likely want to check for the presence of those songs in either stored audio files or in media streams.

To scan a media file and check for the presence of any known songs, simply place the media file in the songs_to_check folder in the S3 bucket that the solution creates. Once a file is detected in this folder, the processing Lambda function will scan it and determine if the file contains one of the indexed sound files. This matching works well even when the original song and the song to check have different volumes, or contain a certain amount of distortion like static. The results of the matching process are written into a file that is stored in the S3 bucket.

The following diagram describes this process.

Architecture diagram showing how fingerprint matching is done for stored media filesFigure 5: Architecture diagram showing how fingerprint matching is done for stored media files

Identifying songs in audio streams

AWS Elemental MediaLive provides a powerful solution for processing live media streams in real time, offering features tailored to the needs of broadcasters, content providers, and streaming platforms. One key capability of MediaLive is the ability to save live content periodically by utilizing the Archive output type. This feature allows organizations to capture and store live media streams at regular intervals, facilitating tasks such as content archiving, backup, and compliance recording.

As the live media stream is ingested into the MediaLive channel, the service continuously processes the content and triggers the archival process at predefined intervals, saving snapshots of the live stream to the specified storage destination, which in this case is the S3 bucket the solution uses.

By using this Archive feature, we save every 10 seconds of media received from the stream, each in its own file. If the output of this archiving is done to the songs_to_check/streams/ folder of the S3 bucket, the processing Lambda will process it as it does for stored media, with a slight difference. The difference is that the checking process understands that it is checking a small chunk of a larger stream, and it keeps track of the last detected song per stream. When a new song is found on a stream, the Lambda sends an Amazon Simple Notification Service (SNS) notification, which is then directed to an Amazon Simple Queue Service (SQS) queue for later processing.

The following diagram describes this process.

Architecture diagram showing how fingerprint matching is done for streaming mediaFigure 6: Architecture diagram showing how fingerprint matching is done for streaming media

Accuracy of the solution

After testing, we determined that in order to be identified with a good chance of success, an audio recording must be at least 10 seconds long. Recordings that are shorter than that result in a drop of accuracy.

Given that minimum 10 second length, testing shows an overall accuracy of about 96%. Tests were done with 1,000 indexed songs (resulting in over a million fingerprints), and the test files were extracted from the source material at random locations, and with volume variations and overlaid static.

There are a number of parameters used when fingerprinting, and adjusting those parameters may result in higher accuracy. Such changes should be tested on your own media files.

Conclusion

In conclusion, audio fingerprinting stands as an important technology in the realm of music identification, offering a robust solution for efficiently and accurately recognizing songs based on their unique audio characteristics. By leveraging spectrograms to extract key features and generate compact fingerprints, audio fingerprinting enables rapid matching and retrieval of music content, facilitating tasks such as content recognition, copyright enforcement, and personalized recommendations.

Through the power of AWS, organizations can harness the scalability, flexibility, and reliability of cloud computing to implement audio fingerprinting systems with ease. AWS services such as Amazon S3, Lambda, and Aurora Serverless provide the infrastructure and tools necessary to process, store, and manage large volumes of audio data efficiently. Services like AWS Elemental MediaLive offer specialized capabilities for processing live media streams and archival, further enhancing the capabilities of audio fingerprinting systems.

By combining the capabilities of audio fingerprinting with the scalability and agility of AWS, organizations can unlock new opportunities in content identification, audience engagement, and content monetization. Whether deployed for music streaming platforms, broadcasting networks, or digital content providers, integrating audio fingerprinting on AWS empowers organizations to deliver seamless, personalized, and compliant music experiences to their audiences, driving innovation and success in the digital age.

Greg Sommerville

Greg Sommerville

Greg Sommerville is a Principal Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team, where he helps AWS customers implement innovative solutions to challenging problems with machine learning, IoT and serverless technologies. He lives in Ann Arbor, Michigan and enjoys practicing yoga, catering to his dogs, and playing poker.