AWS Machine Learning Blog
Capturing Voice Input in a Browser and Sending it to Amazon Lex
Ever since we released Amazon Lex, customers have asked us how to embed voice into a web application. In this blog post, we show how to build a simple web application that uses the AWS SDK for JavaScript to do that. The example application, which users can access from a browser, records audio, sends the audio to Amazon Lex, and plays the response. Using browser APIs and JavaScript we show how to request access to a microphone, record audio, downsample the audio, and PCM encode the audio as a WAV file. As a bonus, we show how to implement silence detection and audio visualization, which are essential to building a user-friendly audio control.
Prerequisites
This post assumes you have some familiarity with
Don’t want to scroll through the details? You can download the example application here: https://github.com/awslabs/aws-lex-browser-audio-capture
The following sections describe how to accomplish important pieces of the audio capture process. You don’t need to copy/paste them–they are intended as a reference. You can see everything working together in the example application.
Requesting access to a microphone with the MediaDevices API
To capture audio content in a browser, you need to request access to an audio device, in this case, the microphone. To access the microphone, you use the navigator.mediaDevices.getUserMedia method in the MediaDevices API. To process the audio stream, you use the AudioContext interface in the Web Audio API. The code that follows performs these tasks:
- Creates an AudioContext
- Calls the getUserMedia method and requests access to the microphone. The getUserMedia method is supported in Chrome, Firefox, Edge, and Opera. We tested the example code in Chrome and Firefox.
- Creates a media stream source and a Recorderobject. More about the Recorder object later.
The code snippet illustrates the following important points:
- The user has to grant us access the microphone. Most browsers request this with a pop-up. If the user denies access to the microphone, the returned Promise rejected with a PermissionDeniedError.
- In most cases, you need only one AudioContext instance. Browsers set limits on the number of AudioContextinstances you can create and throw exceptions if you exceed them.
- We use a few elements and APIs (audio element, createObjectURL, and AudioContext) that require thorough feature detection in a production environment.
So, let’s do a little feature detection and check whether the browser supports the navigator.mediaDevices.getUserMedia method. The following function checks if the method is present, and then requests access to the microphone. If the method isn’t present, or if the user doesn’t give access to the microphone, the function returns false.
Recording and exporting audio
Now that you have captured the audio device you are ready to record and export audio. This section provides examples of recording audio and exporting it in a format that the Amazon Lex PostContent API will recognize.
Recording
In the first step, you accessed an audio media source, created a new media stream source, and passed the source to your Recorder object. The source can be used to create a new script processor node and define an onaudioprocess event handler.
Important points about the code snippet:
- The createScriptProcessor method creates a ScriptProcessorNode instance that you use for direct audio processing.
- The onaudioprocess event handler is called with an audio process event (AudioProcessingEvent) which you can use to retrieve and store the input buffer while recording.
The record and stop methods set or unset the recording flag to start or stop recording. If you’re recording, the script passes the input buffer to a web worker for storage. In the web worker, the script stuffs the buffer into an array that you can process when you’re done recording.
Preparing to export the recording to Amazon Lex
After recording the audio, you need to manipulate it a bit before you can send it to the Amazon Lex PostContent API.
The exportBuffer function does the following:
- Merges the array of captured audio buffers
- Samples the buffer at 16 kHz
- Encodes the buffer as a WAV file
- Returns the encoded audio as a Blob
We’ll go into more detail on each of these steps shortly.
Note: the code in the following sections is a modified version of the Recorderjs plugin.
Merging the buffers
On each invocation of the onaudioprocess event handler, the audio buffers are stored in an array. To export the captured audio, the array of buffers must be merged into a single audio buffer:
Down sampling
You need to make sure that the audio buffer is sampled at 16 kHz (more on this later). In my experience the AudioContext sample rate in Chrome and Firefox is 44100Hz (the same sampling rate as used for compact discs (CDs)) but this can vary based on the recording device. The following function down samples the audio buffer to 16 kHz.
Encoding the audio to PCM
Now convert the audio to WAV format encoded as PCM (pulse-code modulation). The Amazon Lex PostContent API requires user input in PCM or Opus audio format.
Sending the audio to the Amazon Lex PostContent API
Finally, you are ready to send the audio to Amazon Lex. The following example shows how to set up and execute the Amazon Lex PostContent call using voice.
Note – In a production environment, never include your AWS credentials directly in a static script. Check out this post to see how you can use Cognito for in-browser authentication.
Important points about the code snippet:
- You can find the AWS JavaScript SDK, here.
- You can make PostContent calls to any alias for a published bot. To test against the $LATEST version of the bot without publishing it, use the special alias $LATEST.
- The MIME content type value for the contentTypeparameter is Audio/L16. This means that audio input to the Amazon Lex runtime must be sampled at 16 kHz.
- You may have noticed that the example is not streaming captured audio to the PostContent The Amazon Lex JavaScript SDK has two transport layers: XMLHttpRequest, and Node’s HTTP module. When it’s used in a browser it uses XMLHttpRequest, which does not support streaming. When it’s invoked in Node it uses the HTTP module, which does support streaming. So, for now, when using the Amazon Lex JavaScript SDK from a browser you must buffer all the audio before sending it to PostContent.
Play the audio response
The easiest way to play the audio response from the PostContent operation is with an HTML audio element. The following example takes an 8-bit unsigned integer array—you can pass the PostContent response data.audioStream to it directly—and:
- Creates a binary large object (Blob) instance with the MIME type audio/mpeg
- Creates an audioelement
- URL encodes the Blob instance and attaches it as the src attribute of the audio element
- Calls the .play() method on the audio element
It also takes an optional callback parameter that is called when audio playback has completed.
Bonus Features
In the last section, you created a ScriptProcessorNode to process the audio. You can use an AnalyserNode to perform silence detection and simple audio visualizations.
Implementing silence detection
The AnalyserNode is a node that can provide information about real-time frequency and time-domain analysis.
In general, you can link AudioNodes together to build a processing graph. In this example, you use a ScriptProcessorNode to process the audio and an AnalyserNode to provide time-domain information for the silence detection and visualization bonus features.
Create the AnalyserNode:
Now you can use the analyser in the onaudioprocess event handler. The following example uses the getByteTimeDomainData method of the AnalyserNode to copy the current waveform, or time-domain data, into an unsigned byte array. It selects the first index in the array and checks to see if the value is “far” from zero. If the current value remains “close” to zero for 1.5 seconds, “silence” has been detected.
Note – To tune sensitivity and speed silence detection for your use case, you can change these values or inspect the entire data array. For our example, the following values worked well.
Visualization
To create a visualization of the audio that you are buffering and processing, you can render the time domain data that’s used for silence detection.
Here’s what it looks like:
The visualizeAudioBuffer method registers its draw function with a window.requestAnimationFrame instance that refreshes at about 60 fps (frames per second). When the time domain data is updated, the canvas element shows a visualization of the waveform. For more visualization ideas, see the Mozilla Developers Network Voice-change-O-matic demo.
Conclusion
Hopefully this blog post and example application have made it easier for you to capture, format, and send audio to Amazon Lex. We’d love to hear what you think about the post, answer any questions you have, and/or hear about the web based audio projects you put together. You can give us feedback in the comment section below.
See the example application
You can find the complete example application, here. It includes:
-
- Amazon Lex PostContent API integration with the JavaScript SDK
- A stateful audio control
- Examples of using the getUserMediamethod and WebAudio API
- Examples of recording and formatting audio
- Example of audio playback
- Example of simple silence detection
- Example audio visualization
References
Additional Reading
Learn how to integrate your Amazon Lex bot with an external messaging service.
About the Author
Andrew Lafranchise is a Senior Software Development Engineer with AWS Deep Learning. He works with different technologies to improve the Lex developer experience. In his spare time, he spends time with his family and is working on a Lex bot that can interact with his twin 3 year old daughters.