Real-time streaming transcription with the AWS C++ SDK

Today, I’d like to walk you through how to use the AWS C++ SDK to leverage Amazon Transcribe streaming transcription. This service allows you to do speech-to-text processing in real time. Streaming transcription uses HTTP/2 technology to communicate efficiently with clients.

In this walkthrough, you build a command line application that captures audio from the computer’s microphone, sends it to Amazon Transcribe streaming, and prints out transcribed text as you speak. You use PortAudio (a third-party library) to capture and sample audio. PortAudio is a free, cross-platform library, so you should be able to build this on Windows, macOS, and Linux.

Note: Amazon Transcribe streaming transcription has a separate API from Amazon Transcribe, which also allows you to do speech-to-text, albeit not in real time.

Prerequisites

You must have the following tools installed to build the application:

CMake (preferably a recent version 3.11 or later)
A modern C++ compiler that supports C++11, a minimum of GCC 5.0, Clang 4.0, or Visual Studio 2015
Git
An HTTP/2 client
- On *nix, you must have libcurl with HTTP/2 support installed on the system. To ensure that the version of libcurl you have supports HTTP/2, run the following command:
  - $ curl --version
  - You should see HTTP2 listed as one of the features.
- On Windows, you must be running Windows 10.
An AWS account configured with the CLI

Walkthrough

The first step is to download and install PortAudio from the source. If you’re using Linux or macOS, you can use the system’s package manager to install the library (for example: apt, yum, or Homebrew).

Browse to http://www.portaudio.com/download.html and download the latest stable release.
Unzip the archive to a PortAudio directory.
If you’re running Windows, run the following commands to build and install the library.

$ cd portaudio
$ mkdir build
$ cd build
$ cmake. 
$ cmake --build . --config Release

Those commands should build both a DLL and a static library. PortAudio does not define an install target when building on Windows. In the Release directory, copy the file named portaudio_static_x64.lib and the file named portaudio.h to another temporary directory. You need both files for the subsequent steps.

4. If you’re running on Linux or macOS, run the following commands instead.

$ cd portaudio
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release
$ cmake --build .
$ cmake --build . --target install

For this demonstration, you can safely ignore any warnings you see while PortAudio builds.

The next step is to download and install the Amazon Transcribe Streaming C++ SDK. You can use vcpkg or Homebrew to do that step. But I show you how to build the SDK from source.

$ git clone https://github.com/aws/aws-sdk-cpp.git
$ cd aws-sdk-cpp
$ mkdir build
$ cd build
$ cmake .. -DBUILD_ONLY=”transcribestreaming” -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=OFF
$ cmake --build . --config Release
$ sudo cmake --build . --config Release --target install

Now, you’re ready to write the command line application.

Before diving into the code, I’d like to explain a few things about the structure of the application.

First, you need to tell PortAudio to capture audio from the microphone and write the sampled bits to a stream. Second, you want to simultaneously consume that stream and send the bits you captured to the service. To do those two operations concurrently, the application uses multiple threads. The SDK and PortAudio create and manage the threads. PortAudio is responsible for the audio capturing thread, and the SDK is responsible for the API communication thread.

If you have used the C++ SDK asynchronous APIs before, then you might notice the new pattern introduced with the Amazon Transcribe Streaming API. Specifically, in addition to the callback parameter invoked on request completion, there’s now another callback parameter. The new callback is invoked when the stream is ready to be written to by the application. More on that later.

The application can be a single C++ source file. However, I split the logic into two source files: one file contains the SDK-related logic, and the other file contains the PortAudio specific logic. That way, you can easily swap out the PortAudio specific code and replace it with any other audio source that fits your use case. So, go ahead and create a new directory for the demo application and save the following files into it.

The first file (main.cpp) contains the logic of using the Amazon Transcribe Streaming SDK:

// main.cpp
#include <aws/core/Aws.h>
#include <aws/core/utils/threading/Semaphore.h>
#include <aws/transcribestreaming/TranscribeStreamingServiceClient.h>
#include <aws/transcribestreaming/model/StartStreamTranscriptionHandler.h>
#include <aws/transcribestreaming/model/StartStreamTranscriptionRequest.h>
#include <cstdio>

using namespace Aws;
using namespace Aws::TranscribeStreamingService;
using namespace Aws::TranscribeStreamingService::Model;

int SampleRate = 16000; // 16 Khz
int CaptureAudio(AudioStream& targetStream);

int main()
{
    Aws::SDKOptions options;
    options.loggingOptions.logLevel = Aws::Utils::Logging::LogLevel::Trace;
    Aws::InitAPI(options);
    {
        Aws::Client::ClientConfiguration config;
#ifdef _WIN32
        config.httpLibOverride = Aws::Http::TransferLibType::WIN_INET_CLIENT;
#endif
        TranscribeStreamingServiceClient client(config);
        StartStreamTranscriptionHandler handler;
        handler.SetTranscriptEventCallback([](const TranscriptEvent& ev) {
            for (auto&& r : ev.GetTranscript().GetResults()) {
                if (r.GetIsPartial()) {
                    printf("[partial] ");
                } else {
                    printf("[Final] ");
                }
                for (auto&& alt : r.GetAlternatives()) {
                    printf("%s\n", alt.GetTranscript().c_str());
                }
            }
        });

        StartStreamTranscriptionRequest request;
        request.SetMediaSampleRateHertz(SampleRate);
        request.SetLanguageCode(LanguageCode::en_US);
        request.SetMediaEncoding(MediaEncoding::pcm);
        request.SetEventStreamHandler(handler);

        auto OnStreamReady = [](AudioStream& stream) {
            CaptureAudio(stream);
            stream.flush();
            stream.Close();
        };

        Aws::Utils::Threading::Semaphore signaling(0 /*initialCount*/, 1 /*maxCount*/);
        auto OnResponseCallback = [&signaling](const TranscribeStreamingServiceClient*,
                  const Model::StartStreamTranscriptionRequest&,
                  const Model::StartStreamTranscriptionOutcome&,
                  const std::shared_ptr<const Aws::Client::AsyncCallerContext>&) { signaling.Release(); };

        client.StartStreamTranscriptionAsync(request, OnStreamReady, OnResponseCallback, nullptr /*context*/);
        signaling.WaitOne(); // prevent the application from exiting until we're done
    }

    Aws::ShutdownAPI(options);

    return 0;
}

The second file (audio-capture.cpp) contains the logic related to capturing audio from the microphone.

// audio-capture.cpp
#include <aws/core/utils/memory/stl/AWSVector.h>
#include <aws/core/utils/threading/Semaphore.h>
#include <aws/transcribestreaming/model/AudioStream.h>
#include <csignal>
#include <cstdio>
#include <portaudio.h>

using SampleType = int16_t;
extern int SampleRate;
int Finished = paContinue;
Aws::Utils::Threading::Semaphore pasignal(0 /*initialCount*/, 1 /*maxCount*/);

static int AudioCaptureCallback(const void* inputBuffer, void* outputBuffer, unsigned long framesPerBuffer,
    const PaStreamCallbackTimeInfo* timeInfo, PaStreamCallbackFlags statusFlags, void* userData)
{
    auto stream = static_cast<Aws::TranscribeStreamingService::Model::AudioStream>(userData);
    const auto beg = static_cast<const unsigned char*>(inputBuffer);
    const auto end = beg + framesPerBuffer * sizeof(SampleType);

    (void)outputBuffer; // Prevent unused variable warnings
    (void)timeInfo;
    (void)statusFlags;
 
    Aws::Vector<unsigned char> bits { beg, end };
    Aws::TranscribeStreamingService::Model::AudioEvent event(std::move(bits));
    stream->WriteAudioEvent(event);

    if (Finished == paComplete) {
        pasignal.Release(); // signal the main thread to close the stream and exit
    }

    return Finished;
}

void interruptHandler(int)
{
    Finished = paComplete;
}
 
int CaptureAudio(Aws::TranscribeStreamingService::Model::AudioStream& targetStream)
{
 
    signal(SIGINT, interruptHandler); // handle ctrl-c
    PaStreamParameters inputParameters;
    PaStream* stream;
    PaError err = paNoError;
 
    err = Pa_Initialize();
    if (err != paNoError) {
        fprintf(stderr, "Error: Failed to initialize PortAudio.\n");
        return -1;
    }

    inputParameters.device = Pa_GetDefaultInputDevice(); // default input device
    if (inputParameters.device == paNoDevice) {
        fprintf(stderr, "Error: No default input device.\n");
        Pa_Terminate();
        return -1;
    }

    inputParameters.channelCount = 1;
    inputParameters.sampleFormat = paInt16;
    inputParameters.suggestedLatency = Pa_GetDeviceInfo(inputParameters.device)->defaultHighInputLatency;
    inputParameters.hostApiSpecificStreamInfo = nullptr;

    // start the audio capture
    err = Pa_OpenStream(&stream, &inputParameters, nullptr, /* &outputParameters, */
        SampleRate, paFramesPerBufferUnspecified,
        paClipOff, // you don't output out-of-range samples so don't bother clipping them.
        AudioCaptureCallback, &targetStream);

    if (err != paNoError) {
        fprintf(stderr, "Failed to open stream.\n");        
        goto done;
    }

    err = Pa_StartStream(stream);
    if (err != paNoError) {
        fprintf(stderr, "Failed to start stream.\n");
        goto done;
    }
    printf("=== Now recording!! Speak into the microphone. ===\n");
    fflush(stdout);

    if ((err = Pa_IsStreamActive(stream)) == 1) {
        pasignal.WaitOne();
    }
    if (err < 0) {
        goto done;
    }

    Pa_CloseStream(stream);

done:
    Pa_Terminate();
    return 0;
}

There is one line in the audio-capture.cpp file that is related to the Amazon Transcribe Streaming SDK. That is the line in which you wrap the audio bits in an AudioEvent object and write the event to the stream. That is required regardless of the audio source.

And finally, here’s a simple CMake script to build the application:

# CMakeLists.txt
cmake_minimum_required(VERSION 3.11)
set(CMAKE_CXX_STANDARD 11)
project(demo LANGUAGES CXX)

find_package(AWSSDK COMPONENTS transcribestreaming)

add_executable(${PROJECT_NAME} "main.cpp" "audio-capture.cpp")

target_link_libraries(${PROJECT_NAME} PRIVATE ${AWSSDK_LINK_LIBRARIES})

if(MSVC)
    target_include_directories(${PROJECT_NAME} PRIVATE "portaudio")
    target_link_directories(${PROJECT_NAME} PRIVATE "portaudio")
    target_link_libraries(${PROJECT_NAME} PRIVATE “portaudio_static_x64”) # might have _x86 suffix instead
    target_compile_options(${PROJECT_NAME} PRIVATE "/W4" "/WX")
else()
    target_compile_options(${PROJECT_NAME} PRIVATE "-Wall" "-Wextra" "-Werror")
    target_link_libraries(${PROJECT_NAME} PRIVATE portaudio)
endif()

If you are building on Windows, copy the two files (portaudio.h and portaudio_static_x64.lib) from PortAudio’s build output that you copied earlier, into a directory named “portaudio” under the demo directory as such:

├── CMakeLists.txt

├── audio-capture.cpp

├── build

├── main.cpp

└── portaudio

├── portaudio.h

└── portaudio_static_x64.lib

If you’re building on macOS or Linux, skip this step.

Now, “cd” into that directory and run the following commands:

$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE=Release
$ cmake --build . --config Release

These commands should create the executable ‘demo’ under the build directory. Go ahead and execute it and start speaking into the microphone when prompted. The transcribed words should start appearing printed on the screen.

Amazon Transcribe streaming transcription sends partial results while it is receiving audio input. You might notice that the partial results changing as you speak. If you pause long enough, it recognizes your silence as the end of a statement and sends a final result. As soon as you start speaking again, streaming transcription resumes sending partial results.

Clarifications

If you’ve used the asynchronous APIs in the C++ SDK previously, you might be wondering why the stream is passed through a callback instead of setting it as a property on the request directly.

The reason is that Amazon Transcribe Streaming is one of the first AWS services to use a new binary serialization and deserialization protocol. The protocol uses the notion of an “event” to group application-defined chunks of data. Each event is signed using the signature of the previous event as a seed. The first event is seeded using the request’s signature. Therefore, the stream must be “primed” by seeding it with the initial request’s signature before you can write to it.

Summary

HTTP/2 opens new opportunities for using the web and AWS in real-time scenarios. With the support of the AWS C++ SDK, you can write applications that translate speech to text on the fly.

Amazon Transcribe Streaming follows the same pricing model as Amazon Transcribe. For more details, see Amazon Transcribe Pricing.

I’m excited to see what you build using this new technology. Send AWS your feedback on GitHub and on Twitter (#awssdk).

AWS Developer Tools Blog

Real-time streaming transcription with the AWS C++ SDK

Prerequisites

Walkthrough

Clarifications

Summary

Resources

Follow