Capturing Voice Input in a Browser and Sending it to Amazon Lex

Ever since we released Amazon Lex, customers have asked us how to embed voice into a web application. In this blog post, we show how to build a simple web application that uses the AWS SDK for JavaScript to do that. The example application, which users can access from a browser, records audio, sends the audio to Amazon Lex, and plays the response. Using browser APIs and JavaScript we show how to request access to a microphone, record audio, downsample the audio, and PCM encode the audio as a WAV file. As a bonus, we show how to implement silence detection and audio visualization, which are essential to building a user-friendly audio control.

Prerequisites

This post assumes you have some familiarity with

Don’t want to scroll through the details? You can download the example application here: https://github.com/awslabs/aws-lex-browser-audio-capture

The following sections describe how to accomplish important pieces of the audio capture process. You don’t need to copy/paste them–they are intended as a reference. You can see everything working together in the example application.

Requesting access to a microphone with the MediaDevices API

To capture audio content in a browser, you need to request access to an audio device, in this case, the microphone. To access the microphone, you use the navigator.mediaDevices.getUserMedia method in the MediaDevices API. To process the audio stream, you use the AudioContext interface in the Web Audio API. The code that follows performs these tasks:

Creates an AudioContext
Calls the getUserMedia method and requests access to the microphone. The getUserMedia method is supported in Chrome, Firefox, Edge, and Opera. We tested the example code in Chrome and Firefox.
Creates a media stream source and a Recorderobject. More about the Recorder object later.

  // control.js
 
  /**
   * Audio recorder object. Handles setting up the audio context, 
   * accessing the mike, and creating the Recorder object.
   */
  lexaudio.audioRecorder = function() {
    /**
     * Creates an audio context and calls getUserMedia to request the mic (audio).
     * If the user denies access to the microphone, the returned Promise rejected 
     * with a PermissionDeniedError
     * @returns {Promise} 
     */
    var requestDevice = function() {
 
      if (typeof audio_context === 'undefined') {
        window.AudioContext = window.AudioContext || window.webkitAudioContext;
        audio_context = new AudioContext();
      }
 
      return navigator.mediaDevices.getUserMedia({ audio: true })
        .then(function(stream) {
          audio_stream = stream; 
        });
    };
 
    var createRecorder = function() {
      return recorder(audio_context.createMediaStreamSource(audio_stream));
    };
 
    return {
      requestDevice: requestDevice,
      createRecorder: createRecorder
    };
 
  };

The code snippet illustrates the following important points:

The user has to grant us access the microphone. Most browsers request this with a pop-up. If the user denies access to the microphone, the returned Promise rejected with a PermissionDeniedError.
In most cases, you need only one AudioContext instance. Browsers set limits on the number of AudioContextinstances you can create and throw exceptions if you exceed them.
We use a few elements and APIs (audio element, createObjectURL, and AudioContext) that require thorough feature detection in a production environment.

So, let’s do a little feature detection and check whether the browser supports the navigator.mediaDevices.getUserMedia method. The following function checks if the method is present, and then requests access to the microphone. If the method isn’t present, or if the user doesn’t give access to the microphone, the function returns false.

    // control.js
 
    /**
     * On audio supported callback: `onAudioSupported`.
     *
     * @callback onAudioSupported
     * @param {boolean} 
     */
 
    /**
     * Checks that getUserMedia is supported and the user has given us access to the mic.
     * @param {onAudioSupported} callback - Called with the result.
     */
    var supportsAudio = function(callback) {
      if (navigator.mediaDevices.getUserMedia) {
        audioRecorder = lexaudio.audioRecorder();
        audioRecorder.requestDevice()
          .then(function(stream) { callback(true); })
          .catch(function(error) { callback(false); });
      } else {
        callback(false);
      }
    };

Recording and exporting audio

Now that you have captured the audio device you are ready to record and export audio. This section provides examples of recording audio and exporting it in a format that the Amazon Lex PostContent API will recognize.

Recording

In the first step, you accessed an audio media source, created a new media stream source, and passed the source to your Recorder object. The source can be used to create a new script processor node and define an onaudioprocess event handler.

    // recorder.js
 
    // Create a ScriptProcessorNode with a bufferSize of 4096 and a single input and output channel
    var recording, node = source.context.createScriptProcessor(4096, 1, 1);
 
    /**
     * The onaudioprocess event handler of the ScriptProcessorNode interface. It is the EventHandler to be 
     * called for the audioprocess event that is dispatched to ScriptProcessorNode node types. 
     * @param {AudioProcessingEvent} audioProcessingEvent - The audio processing event.
     */
    node.onaudioprocess = function(audioProcessingEvent) {
      if (!recording) {
        return;
      }
 
      worker.postMessage({
        command: 'record',
        buffer: [
          audioProcessingEvent.inputBuffer.getChannelData(0),
        ]
      });
    };
 
    /**
     * Sets recording to true.
     */
    var record = function() {
      recording = true;
    };
 
    /**
     * Sets recording to false.
     */
    var stop = function() {
      recording = false;
    };

Important points about the code snippet:

The createScriptProcessor method creates a ScriptProcessorNode instance that you use for direct audio processing.
The onaudioprocess event handler is called with an audio process event (AudioProcessingEvent) which you can use to retrieve and store the input buffer while recording.

The record and stop methods set or unset the recording flag to start or stop recording. If you’re recording, the script passes the input buffer to a web worker for storage. In the web worker, the script stuffs the buffer into an array that you can process when you’re done recording.

    // worker.js
 
    var recLength = 0,
        recBuffer = [];
 
    function record(inputBuffer) {
      recBuffer.push(inputBuffer[0]);
      recLength += inputBuffer[0].length;
    }

Preparing to export the recording to Amazon Lex

After recording the audio, you need to manipulate it a bit before you can send it to the Amazon Lex PostContent API.

    // worker.js
 
    function exportBuffer() {
      // Merge
      var mergedBuffers = mergeBuffers(recBuffer, recLength);
      // Downsample
      var downsampledBuffer = downsampleBuffer(mergedBuffers, 16000);
      // Encode as a WAV
      var encodedWav = encodeWAV(downsampledBuffer);                                 
      // Create Blob
      var audioBlob = new Blob([encodedWav], { type: 'application/octet-stream' });
      postMessage(audioBlob);
    }

The exportBuffer function does the following:

Merges the array of captured audio buffers
Samples the buffer at 16 kHz
Encodes the buffer as a WAV file
Returns the encoded audio as a Blob

We’ll go into more detail on each of these steps shortly.

Note: the code in the following sections is a modified version of the Recorderjs plugin.

Merging the buffers

On each invocation of the onaudioprocess event handler, the audio buffers are stored in an array. To export the captured audio, the array of buffers must be merged into a single audio buffer:

    // worker.js
 
    function mergeBuffers(bufferArray, recLength) {
      var result = new Float32Array(recLength);
      var offset = 0;
      for (var i = 0; i < bufferArray.length; i++) {
        result.set(bufferArray[i], offset);
        offset += bufferArray[i].length;
      }
      return result;
    }

Down sampling

You need to make sure that the audio buffer is sampled at 16 kHz (more on this later). In my experience the AudioContext sample rate in Chrome and Firefox is 44100Hz (the same sampling rate as used for compact discs (CDs)) but this can vary based on the recording device. The following function down samples the audio buffer to 16 kHz.

    // worker.js
 
    function downsampleBuffer(buffer) {
          if (16000 === sampleRate) {
            return buffer;
          }
      var sampleRateRatio = sampleRate / 16000;
      var newLength = Math.round(buffer.length / sampleRateRatio);
      var result = new Float32Array(newLength);
      var offsetResult = 0;
      var offsetBuffer = 0;
      while (offsetResult < result.length) {
        var nextOffsetBuffer = Math.round((offsetResult + 1) * sampleRateRatio);
        var accum = 0,
          count = 0;
        for (var i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {
          accum += buffer[i];
          count++;
        }
        result[offsetResult] = accum / count;
        offsetResult++;
        offsetBuffer = nextOffsetBuffer;
      }
      return result;
    }

Encoding the audio to PCM

Now convert the audio to WAV format encoded as PCM (pulse-code modulation). The Amazon Lex PostContent API requires user input in PCM or Opus audio format.

    // worker.js
 
    function encodeWAV(samples) {
      var buffer = new ArrayBuffer(44 + samples.length * 2);
      var view = new DataView(buffer);
 
      writeString(view, 0, 'RIFF');
      view.setUint32(4, 32 + samples.length * 2, true);
      writeString(view, 8, 'WAVE');
      writeString(view, 12, 'fmt ');
      view.setUint32(16, 16, true);
      view.setUint16(20, 1, true);
      view.setUint16(22, 1, true);
      view.setUint32(24, sampleRate, true);
      view.setUint32(28, sampleRate * 2, true);
      view.setUint16(32, 2, true);
      view.setUint16(34, 16, true);
      writeString(view, 36, 'data');
      view.setUint32(40, samples.length * 2, true);
      floatTo16BitPCM(view, 44, samples);
 
      return view;
    }

Sending the audio to the Amazon Lex PostContent API

Finally, you are ready to send the audio to Amazon Lex. The following example shows how to set up and execute the Amazon Lex PostContent call using voice.

    // index.html
 
    var lexruntime = new AWS.LexRuntime({
        region: 'us-east-1',
        credentials: new AWS.Credentials('...', '...', null)
    });
 
    var params = {
        botAlias: '$LATEST',
        botName: 'OrderFlowers',
        contentType: 'audio/x-l16; sample-rate=16000',
        userId: 'BlogPostTesting',
        accept: 'audio/mpeg'
    };
 
    params.inputStream = ...;
    lexruntime.postContent(params, function(err, data) {
        if (err) {
            // an error occured
        } else {
            // success, now let's play the response
        }
    });

Note – In a production environment, never include your AWS credentials directly in a static script. Check out this post to see how you can use Cognito for in-browser authentication.

Important points about the code snippet:

You can find the AWS JavaScript SDK, here.
You can make PostContent calls to any alias for a published bot. To test against the $LATEST version of the bot without publishing it, use the special alias $LATEST.
The MIME content type value for the contentTypeparameter is Audio/L16. This means that audio input to the Amazon Lex runtime must be sampled at 16 kHz.
You may have noticed that the example is not streaming captured audio to the PostContent The Amazon Lex JavaScript SDK has two transport layers: XMLHttpRequest, and Node’s HTTP module. When it’s used in a browser it uses XMLHttpRequest, which does not support streaming. When it’s invoked in Node it uses the HTTP module, which does support streaming. So, for now, when using the Amazon Lex JavaScript SDK from a browser you must buffer all the audio before sending it to PostContent.

Play the audio response

The easiest way to play the audio response from the PostContent operation is with an HTML audio element. The following example takes an 8-bit unsigned integer array—you can pass the PostContent response data.audioStream to it directly—and:

Creates a binary large object (Blob) instance with the MIME type audio/mpeg
Creates an audioelement
URL encodes the Blob instance and attaches it as the src attribute of the audio element
Calls the .play() method on the audio element

It also takes an optional callback parameter that is called when audio playback has completed.

    // control
    /**
     * On playback complete callback: `onPlaybackComplete`.
     *
     * @callback onPlaybackComplete
     */
 
    /**
     * Plays the audio buffer with an HTML5 audio tag. 
     * @param {Uint8Array} buffer - The audio buffer to play.
     * @param {?onPlaybackComplete} callback - Called when audio playback is complete.
     */
    var play = function(buffer, callback) {
      var myBlob = new Blob([buffer], { type: 'audio/mpeg' });
      var audio = document.createElement('audio');
      var objectUrl = window.URL.createObjectURL(myBlob);
      audio.src = objectUrl;
      audio.addEventListener('ended', function() {
        audio.currentTime = 0;
        if (typeof callback === 'function') {
          callback();
        }
      });
      audio.play();
      recorder.clear();
    };

Bonus Features

In the last section, you created a ScriptProcessorNode to process the audio. You can use an AnalyserNode to perform silence detection and simple audio visualizations.

Implementing silence detection

The AnalyserNode is a node that can provide information about real-time frequency and time-domain analysis.

In general, you can link AudioNodes together to build a processing graph. In this example, you use a ScriptProcessorNode to process the audio and an AnalyserNode to provide time-domain information for the silence detection and visualization bonus features.

Create the AnalyserNode:

    // recorder.js
 
    var analyser = source.context.createAnalyser();
    analyser.minDecibels = -90;
    analyser.maxDecibels = -10;
    analyser.smoothingTimeConstant = 0.85;

Now you can use the analyser in the onaudioprocess event handler. The following example uses the getByteTimeDomainData method of the AnalyserNode to copy the current waveform, or time-domain data, into an unsigned byte array. It selects the first index in the array and checks to see if the value is “far” from zero. If the current value remains “close” to zero for 1.5 seconds, “silence” has been detected.

    // recorder.js
 
    var startSilenceDetection = function() {
      analyser.fftSize = 2048;
      var bufferLength = analyser.fftSize;
      var dataArray = new Uint8Array(bufferLength);
 
      analyser.getByteTimeDomainData(dataArray);
 
      var curr_value_time = (dataArray[0] / 128) - 1.0;
 
      if (curr_value_time > 0.01 || curr_value_time < -0.01) {
        start = Date.now();
      }
      var newtime = Date.now();
      var elapsedTime = newtime - start;
      if (elapsedTime > 1500) {
        onSilence();
      }
    };

Note – To tune sensitivity and speed silence detection for your use case, you can change these values or inspect the entire data array. For our example, the following values worked well.

Visualization

To create a visualization of the audio that you are buffering and processing, you can render the time domain data that’s used for silence detection.

    // renderer.js
 
    /**
     * Clears the canvas and draws the dataArray. 
     * @param {Uint8Array} dataArray - The time domain audio data to visualize.
     * @param {number} bufferLength - The FFT length.
     */
    var visualizeAudioBuffer = function(dataArray, bufferLength) {
      var WIDTH = canvas.width;
      var HEIGHT = canvas.height;
      var animationId;
      canvasCtx.clearRect(0, 0, WIDTH, HEIGHT);
 
      /**
       * Will be called at about 60 times per second. If listening, draw the dataArray. 
       */
      function draw() {
        if (!listening) {
          return;
        }
 
        canvasCtx.fillStyle = 'rgb(249,250,252)';
        canvasCtx.fillRect(0, 0, WIDTH, HEIGHT);
        canvasCtx.lineWidth = 1;
        canvasCtx.strokeStyle = 'rgb(0,125,188)';
        canvasCtx.beginPath();
 
        var sliceWidth = WIDTH * 1.0 / bufferLength;
        var x = 0;
 
        for (var i = 0; i < bufferLength; i++) {
          var v = dataArray[i] / 128.0;
          var y = v * HEIGHT / 2;
          if (i === 0) {
            canvasCtx.moveTo(x, y);
          } else {
            canvasCtx.lineTo(x, y);
          }
          x += sliceWidth;
        }
 
        canvasCtx.lineTo(canvas.width, canvas.height / 2);
        canvasCtx.stroke();
      }
 
      // Register our draw function with requestAnimationFrame. 
      if (typeof animationId === 'undefined') {
        animationId = requestAnimationFrame(draw);
      }
    };
  };

Here’s what it looks like:

The visualizeAudioBuffer method registers its draw function with a window.requestAnimationFrame instance that refreshes at about 60 fps (frames per second). When the time domain data is updated, the canvas element shows a visualization of the waveform. For more visualization ideas, see the Mozilla Developers Network Voice-change-O-matic demo.

Conclusion

Hopefully this blog post and example application have made it easier for you to capture, format, and send audio to Amazon Lex. We’d love to hear what you think about the post, answer any questions you have, and/or hear about the web based audio projects you put together. You can give us feedback in the comment section below.

See the example application

You can find the complete example application, here. It includes:

- Amazon Lex PostContent API integration with the JavaScript SDK
- A stateful audio control
- Examples of using the getUserMediamethod and WebAudio API
- Examples of recording and formatting audio
- Example of audio playback
- Example of simple silence detection
- Example audio visualization

References

Additional Reading

Learn how to integrate your Amazon Lex bot with an external messaging service.

About the Author

Andrew Lafranchise is a Senior Software Development Engineer with AWS Deep Learning. He works with different technologies to improve the Lex developer experience. In his spare time, he spends time with his family and is working on a Lex bot that can interact with his twin 3 year old daughters.