Business Productivity

Label multiple speakers on a call with speaker search

Use Amazon Chime SDK’s call analytics to label speakers in a conference call

Why is it useful to label speakers on conference calls?

In many multi-party voice communications use cases, it is important to identify and label the active speakers in real time, in order to know what was said and by whom. This information is also valuable after the call, for identity attribution in transcripts. In this post, we will demonstrate the use of Amazon Chime SDK call analytics, and its speaker search capability, to attach speaker identity labels to calls in real time, and for post-call insights.

There are many real-world use cases for such a multi-speaker labeling capability. For example:

  • to transcribe executives on an earnings call and label the transcript with “who said what”
  • to identify and label the active speaker in real time on a squawk box communication platform used by traders

What is speaker search?

Speaker search is a new machine-learning-powered capability that is part of Amazon Chime SDK call analytics. It provides the ability to takes a short speech sample from call audio and return a set of closest matches from a database of voice embeddings, or voice profiles, for enrolled speakers. This capability is available through speaker search APIs for Amazon Chime SDK Voice Connector.

Solution overview

In this simple demonstration, we will periodically call speaker search over the duration of a conversation, and show how it can be used to label the different speakers active during a conversation.

For the purpose of this demonstration, we assembled four AWS employees (“Alice”, “Bob”, “Charlie” and “David”) in a room with a speaker phone that was dialed in to a phone number associated with a Voice Connector. For the callee in this demo, we used an Asterisk server running an automated agent response system. We applied speaker search to the caller leg only (containing the four volunteers). All four volunteers had previously enrolled their voice embeddings by providing a short voice sample, and, importantly, all four volunteers consented to the creation and processing of their voice embedding as required by applicable privacy and biometric laws and as a condition of using the service (for more information please see section 53 of the AWS Service Terms).

In the demo, we used a script to trigger speaker search API calls to the voice analytics service approximately every 30 seconds (with a sample of approximately 10 seconds of silence-free speech for each API call), and logged the speaker search results to an SNS notification target configured for this Voice Connector. The speakers spoke sequentially for roughly two minutes each.

When the speaker search API is called using an inference speech sample, it generates an embedding, which is a vector representation that captures some of the distinguishing attributes of the speaker’s voice. This embedding is compared to the embeddings for all the enrolled speakers in the embeddings database, and a shortlist of up to 10 highest confidence matches are returned, ranked in order of a confidence score. For the purpose of this demo, we retained only the highest confidence match and used that as our estimate for who is speaking.

Results

The audio recording for the session is shown below along with annotations of the estimated speaker identity as well as the true speaker identity. Comparing the estimate to the ground truth over the duration of the call, we obtain an accuracy score, defined as the percentage of searches for which the estimated identity matched the true speaker identity.

speaker search recognition points

The accuracy score for this demonstration was 88%. Speaker search accurately identified the speaker during most of the segments with a few exceptions, where the speaker was not recognized as a match with a high-enough confidence level to exceed our preset threshold.

This test demonstrates that even a simple implementation such as the above can help address speaker labelling use cases. There are many ways to improve the labeling accuracy such as:

  • increasing the frequency of the speaker search API calls
  • tuning the inference speech sample length
  • tuning the match confidence threshold
  • using the Amazon Transcribe integration to extract speaker endpoints which partition the utterances, and run speaker search queries against those partitioned utterances.

Get started with Amazon Chime SDK call analytics and speaker search

  • First, you will need to configure an Amazon Chime SDK Voice Connector with a call analytics configuration with voice analytics enabled (including creating a voice profile domain, and creating a set of notification targets to receive speaker search results), and you will need to configure a speaker search workflow. See the Amazon Chime SDK Developer Guide for details.
  • To use the speaker search feature, you will need to open an SLI ticket. See here for how to request a quota increase.

Learn More

To learn more about Amazon Chime SDK voice analytics, review the following resources:

Narasimha Chari

Narasimha Chari

Chari is a Principal Product Manager for the Amazon Chime SDK service team where he focuses on machine learning applications to audio and video communications and analytics. Outside of work, Chari enjoys spending time with his family, and going for runs in the hills.

Mike Goodwin

Mike Goodwin

Mike is an Applied Science Senior Manager for the Amazon Chime SDK. His team focuses on machine learning and signal processing solutions for audio and video workloads. In his spare time he enjoys running, kayaking, and playing guitar.

Zhihai Xiang

Zhihai Xiang

Zhihai is a Software Engineer in AWS Chime SDK science team. He is specialized in backend and frontend development, cloud infrastructure, and machine learning, etc.