Skip to main content

Guidance for Image-to-Text and Image-to-Speech on AWS

Overview

This Guidance shows how to convert images to text and speech with machine learning and generative AI services on AWS. Converting images to text is done with the help of Amazon Kendra, a search engine that can be used to index an image repository and search for data. Next, generative AI is used for captioning the images, recognizing objects and features to generate a human-readable textual description, typically a caption based on extracted visual features. This Guidance also shows how to convert image to speech and can be extended to serve content through voice-enabled devices, such as Amazon Alexa. This involves the Describe for Me web app which generates a caption of an image and reads it back in a clear, human-sounding voice, including a variety of languages and dialects.

How it works

How it works

These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.
Architecture diagram illustrating an AWS workflow for image-to-text processing using Amazon S3, AWS Lambda, Amazon SageMaker, Amazon Textract, and Amazon Kendra.

How it works

These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.
Architecture diagram illustrating the AWS Image to Speech workflow using AWS services such as Amplify, Cognito, S3, API Gateway, Lambda, Textract, Rekognition, SageMaker, Translate, and Polly within an AWS Step Functions workflow. The process begins with users uploading images, which are processed for extraction, recognition, translation, and then converted to audio using AWS Polly, with the resulting audio stored back in S3.

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

This Guidance uses AWS services like Lambda and Step Functions to automate tasks, reducing manual work and errors, and Amazon S3 to provide reliable data storage. These services easily adapt to changing workloads and support efficient, consistent operations. Additionally, you can use Amazon CloudWatch to monitor operations and gain insights.

Read the Operational Excellence whitepaper 

This Guidance uses Lambda and Step Functions to automate security-related tasks, reducing the risk of human error in security processes. Additionally, API Gateway enforces secure management of API endpoints, Amazon Cognito enhances user authentication and authorization processes, and AWS Identity and Access Management (IAM) controls access to AWS resources. Finally, CloudWatch helps detect security incidents or anomalous activities in real time, facilitating swift incident responses and threat mitigation.

Read the Security whitepaper 

This Guidance uses automation through Lambda and Step Functions to reduce the risk of human errors that might compromise reliability. Additionally, Amazon S3 provides data replication and redundancy features that increase data reliability, and API Gateway grants users consistent and secure access to APIs to maintain workload reliability. CloudWatch monitors operations, aiding in issue detection and resolution. This proactive approach enhances workload reliability by minimizing downtime and disruptions.

Read the Reliability whitepaper 

This Guidance reduces latency and resource inefficiency by using Lambda and Step Functions to automate processes and streamline workflows. Additionally, SageMaker and Amazon Polly facilitate real-time content generation, supporting faster and more efficient workloads, and API Gateway optimizes API management, delivering low latency and consistent access to promote high performance efficiency.

Read the Performance Efficiency whitepaper 

This Guidance minimizes operational expenses by using Lambda and Step Functions to facilitate efficient resource use and reduce the need for constant manual intervention, minimizing human error and resource waste. Additionally, Amazon Polly reduces the need for costly manual content creation, API Gateway optimizes API management, decreasing compute-related costs, and Amazon Kendra improves search efficiency, reducing the time and resources spent on information retrieval. Finally, Amazon S3 offers scalable and cost-effective storage solutions so that you can store and access data efficiently without incurring unnecessary expenses.

Read the Cost Optimization whitepaper 

This Guidance uses serverless services like Lambda and API Gateway for their energy efficiency, their efficient use of resources, and their incorporation of renewable energy sources. These practices align with sustainability goals, helping you reduce your carbon footprint.

Read the Sustainability whitepaper 

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.