Skip to main content

What is text-to-voice software?

From web pages read aloud to requesting user data, voice is fast becoming the norm as a modern user interface. Customers are increasingly expecting voice capabilities in every application they interact with. Beyond that, text-to-voice use cases in healthcare, sales, content creation, customer service, and other applications can accelerate automation while enhancing the customer experience. This guide explores text-to-voice features and capabilities and how to get started with using them.

Text-to-voice or text-to-speech (TTS) software produces an audio ‘voice’ by synthesizing speech from text. The software is powered by a text-to-speech engine trained on a vast volume of human voice recordings. It converts written words to their spoken form by analyzing sound waveforms in voice data.

Stilted, robot-sounding voices are a result of outdated speech technologies. Modern text-to-speech engines based on generative AI produce output that is nearly indistinguishable from human speech. The generated voice can include natural pauses, various accents, different speeds, and intonations that reflect human emotions.

Types of text-to-speech software

The type of TTS tool you choose depends on your use case. For developers, an all-in-one, customizable, integrative package is the best choice for multi-app, multi-environment development.

Developers can choose from open-source and commercial TTS software with self-managed deployments, or a fully integrated managed cloud service like Amazon Polly. It enables existing applications to integrate speech as a first-class feature, creating opportunities for entirely new categories of speech-enabled products, from mobile apps and cars to devices and appliances.

Amazon Polly comes with four voice engines based on different AI model architectures, suitable for various use cases. To use an Amazon Polly voice, simply select the engine, voice synthesis operation, and output file format via API in your code. Then provide input text for the engine to synthesize. Amazon Polly will generate the voice output file in the format you requested. These engines can also be trained further for specific voice or brand requirements.

What are features to look for in text-to-voice software?

Amazon Polly includes the following text-to-voice features essential for modern voice development.

Range of voices

Having the option to select different languages, regions, genders, and voices within a region provides a more comprehensive product suite for development. Amazon Polly supports dozens of languages, along with their country-based variations and accents in both male and female formats.

API-based integration

Check that your TTS software has a fully functional API and is available in multiple programming languages, for the broadest range of integrations across projects. Amazon Polly provides the Amazon Polly API and various language-specific SDKs. It can also be accessed from the AWS Management Console and the AWS Command Line Interface (CLI). You have complete control over all the capabilities of Amazon Polly, no matter how you use it.

Precise voice control

Speech Synthesis Markup Language (SSML) is an XML-based markup language that allows you to provide more information about how your speech should sound. For example, you can include pauses, interpretation (e.g., dates, acronyms), pitch, rate, volume, emphasis, fade in, and other audio elements to customize the generated voice. SSML allows you to fully control voice outputs and port the customization to other systems.  

Amazon Polly supports both common and custom Amazon SSML tags, such as the ability to make a voice sound like a newscaster. This flexibility helps you create lifelike speech that captures and holds audience attention.

Metadata hooks for synchronized animation

Some applications, such as gaming and media, require animation with characters that follow audio, including mouth movements or a karaoke-style word-follow-along. Multilingual training videos would also benefit from synchronized timing in multiple languages, so the audio aligns with the video at the same time for all languages.

For such types of applications, developers need metadata to mark which speech elements occur at a given time in a time-stamped format. Amazon Polly allows you to request such additional metadata, or speech marks, alongside your voice file. Speech marks provide information such as the audio file timestamp, visemes (the positions of the face and mouth when speaking a word), and other details that link the written text to the voice output.

Customization

You want your text-to-speech software to be fully customizable for maximum flexibility. For example, audio output should be customizable for different formats and configurations, including by file type (e.g.,), file size, and data quality. The software should be able to handle custom vocabulary that falls outside of its training data.

Amazon Polly supports text-to-voice customization at every stage.

Vocabulary

You can create a custom dictionary with personalized pronunciations for company names, acronyms, foreign words, and neologisms. You can request outputs in multiple voice formats, such as MP3 and WAV.

Output format

Amazon Polly also supports long-form audio, such as reading documents, in a natural-sounding voice. You can generate continuous audio streams for lower-bandwidth or low-latency connections in real-time use cases.

Voice

We also provide Brand Voice, a custom engagement where you work with the Amazon Polly team to build a voice for the exclusive use of your organization. Rather than sounding like other apps, you can create a unique voice-based brand mark that helps you stand out.

How can you get started with text-to-voice software?

Getting started with AWS text-to-voice software is easy. In this guide, we walk through a quick how-to demo of Amazon Polly in the console.

First, sign in to the AWS Management Console and open the Amazon Polly console. Click on Try Polly to get started. This will bring up a Text-to-Speech dialog.

Step 1—Choose an engine

In the Text-to-Speech dialog, you can select which voice engine you want to use. Amazon Polly currently has four different voice engines to choose from.

  • The Standard engine uses the concatenative synthesis method as a voice generator.
  • The Neural engine uses a neural network and vocoder method to produce more natural-sounding speech.
  • The Generative engine uses a billion-parameter model trained on a large variety of voice data for even more natural-sounding speech.
  • The Long-form engine is another generative-AI text-to-speech engine, developed for long, narrative-style speech.

Not all engines are available in all AWS regions.

Step 2—Choose a language

Once you’ve selected a voice engine, choose which Language you’d like to generate and a male or female Voice from the drop-down menus.

Each voice engine supports a different range of languages and AI voices. For example, if you select Neural for Engine, only the languages and voices that support Neural Text-to-Speech (NTTS) are available, and all Standard and Long Form voices are disabled.

Step 3—Convert text to speech

In the Input text box, change the default text to your own written text input. You can choose the Listen button to hear the output read aloud, the Download button to download the MP3 file, or the Save to S3 button to save the spoken words to Amazon Simple Storage Service.

Accessing Amazon Polly via the API

You can access Amazon Polly through the console, as above, or via its API in application code. The Amazon Polly API lets you do many things, from real-time translation to generating subtitles and bringing video game or other animation characters to life. Try out some of the samples on GitHub for examples of how to use the Amazon Polly API in code.

How can AWS support your text-to-voice software needs?

Text-to-voice allows you to create voice-based audio via text instead of human speech. It was initially used as an assistive technology for people with visual impairments, but is now becoming a requirement in many applications and customer interactions, ranging from browser extensions to call centers and enterprise applications. Using a managed service like Amazon Polly, developers can easily integrate a modern, life-like voice engine into applications via text-to-speech API calls. Amazon Polly pricing is based on the engine and the number of characters processed, and includes a free tier for personal use.

Amazon Polly’s spoken audio is just one of the generative AI services that you can leverage in application development. Take a look at the range of AI solutions on AWS to help you build and scale apps faster and stronger.