Building a speech-to-text notification system in different languages with AWS Transcribe and an IoT device
Have you ever wished that people visiting your home could leave you a message if you’re not there? What if that solution could support your native language? Here is a straightforward and cost-effective solution that you can build yourself, and you only pay for what you use.
This post demonstrates how to build a notification system to detect a person, record audio, transcribe the audio to text, and send the text to your mobile device in your preferred language. The solution uses the following services:
- AWS CloudFormation
- AWS Lambda
- Amazon Polly
- Amazon Simple Notification Service (Amazon SNS)
- Amazon Simple Storage Service (Amazon S3)
- Amazon Transcribe
To complete this walkthrough, you need the following:
- Raspberry Pi 4 device running Noobs
- Ultrasonic sensor connected to Raspberry Pi
- AWS account
Workflow and architecture
When the sensor detects a person within a specified range, the speaker attached to the Raspberry Pi plays the initial greeting and asks the user to record a message. This recording is sent to Amazon S3, which triggers a Lambda function to transcribe the speech to text using Amazon Transcribe. When the transcription is complete, the user receives a text notification of the transcript from Amazon SNS.
The following diagram illustrates the solution workflow.
Amazon Transcribe uses a deep learning process called automatic speech recognition (ASR) to convert speech to text quickly and accurately in the language of your choice. It automatically adds punctuation and formatting so that the output matches the quality of any manual transcription. You can configure Amazon Transcribe with custom vocabulary for more accurate transcriptions (for example, the names of people living in your home). You can also configure it to remove specific words from transcripts (such as profane or offensive words). Amazon Transcribe supports many different languages. For more information, see What Is Amazon Transcribe?
Uploading the CloudFormation stack
This post provides a CloudFormation template that creates an input S3 bucket that triggers a Lambda function to transcribe from audio to text, an SNS notification to send the text to the user, and the permissions around it.
- Download the CloudFormation template.
- On the AWS CloudFormation console, choose Upload a template file.
- Choose the file you downloaded.
- Choose Next.
- For Stack Name, enter the name of your stack.
- Under Parameters, update the parameters for the template with inputs below
||<Requires input>||A valid mobile number to receive SNS notifications|
||<Requires input>||A language code of your audio file, such as English US|
||<Requires input>||A unique bucket name|
- Choose Next.
- On the Options page, choose Next.
- On the Review page, review and confirm the settings.
- Select the check-box that acknowledges that the template creates IAM resources.
- Choose Create.
You can view the status of the stack on the AWS CloudFormation console. You should see the status
CREATE_COMPLETE in approximately 5 minutes.
- Record the
RaspberryPiUserNamefrom the Outputs
Downloading the greeting message
To download the greeting message, complete the following steps:
- On the Amazon Polly console, on the Plain text tab, enter your greeting.
- For Language and Region, choose your preferred language.
- Choose Download MP3.
- Rename the file to
- Move the file to the folder on
Setting up AWS IoT credentials provider
Set up your AWS IoT credentials to allow you to securely authenticate IoT devices. For instructions, see How to Eliminate the Need for Hardcoded AWS Credentials in Devices by Using the AWS IoT Credentials Provider. Add the following policy below in Step 3 of the post to upload the file to Amazon S3 instead of updating an Amazon DynamoDB table:
Setting up Raspberry Pi
To set up Raspberry Pi, complete the following steps:
- On Raspberry Pi, open the terminal and install AWS CLI.
- Create a Python file and code for the sensor to detect if a person is between a specific range (for example, 30 to 200 cm), play the greeting message, record the audio for a specified period (for example, 20 seconds), and send to Amazon S3. See the following example code.
- Run the Python file.
The ultrasonic sensor continuously looks for a person approaching your home. When it detects a person, the speaker plays and asks the guest to start the recording. The recording is then sent to Amazon S3.
If your speaker and microphone are connected to more than one device, such as HDMI and USB, configure the asoundrc file.
Testing the solution
Place the Raspberry Pi in your house at a location where it can sense the person and record their audio.
When the person appears in front of the Raspberry Pi, they should hear a welcome message. They can record a message and leave. You should receive a text message of the recorded audio.
This post demonstrated how to build a secure voice-to-text notification solution using AWS services. You can integrate this solution the next time you need a voice-to-text feature in your application in various different languages. If you have questions or comments, please leave your feedback in the comments.
About the Author
Vikas Shah is an Enterprise Solutions Architect at Amazon web services. He is a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His areas of interest are ML, IoT, robotics and storage. In his spare time, Vikas enjoys building robots, hiking, and traveling.
Anusha Dharmalingam is a Solutions Architect at Amazon Web Services with a passion for Application Development and Big Data solutions. Anusha works with enterprise customers to help them architect, build, and scale applications to achieve their business goals.