First Place Winner:
ReadToMe
Inspiration
The inspiration behind ReadToMe came from my kids. Specifically, my younger two who are 3 and 5. Both kids love books, but they aren’t reading by themselves yet. I wanted to create something that they could use to still enjoy reading, even when a parent isn’t available to read to them.
What it does
ReadToMe works by using deep learning and computer vision combined with services available in AWS. Just show the AWS DeepLens a page you want it to read, and it reads it out loud.
Created By: Alex Schultz
Learn more about Alex and the ReadToMe project in this AWS Machine Learning blog post.
How I built it
As simple as the workflow sounds, there were many challenges that had to be overcome to get this to work. These are the individual problems I needed to solve.
- Determine when a page that needs to be read is in the camera frame
- Isolate the Text Block and clean up the image using OpenCV
- Perform OCR (Optical Character Recognition)
- Transform text into audio
- Play back the audio through speakers
Each one of the above steps introduced unexpected challenges, (some of which made me question if I would be able to finish this project in time!)
Determining when a page that needs to be read is in the camera frame
ReadToMe works by using deep learning and computer vision combined with services available in AWS. Just show the AWS DeepLens a page you want it to read, and it reads it out loud. For the DeepLens to be able to read a page, it needed some way to know if there is something in the camera frame to read. This is where the deep learning model was used.
I looked online but could not find a pre trained model that was able to classify a page in a book. As a result, I knew I would need to train a model with new data to classify this type of object.
To get training data, I did a quick search online to try and find images of children’s books. I found tons of images of book covers, but practically nothing showing someone holding the book in the correct orientation for me to be able to train with. I needed images of the actual pages with text on them. Fortunately, I have four young kids, and they have hundreds of children’s books. So, one night I grabbed about forty books, and started taking lots of pictures of myself holding the different books in different orientations, lighting, and with my hand occluding different parts of the page.
During this process, I started to realize that the blocks of text in these children’s books varied greatly. Sometimes the text was white on black, or sometimes black on white. Other times it was colored text on different colored backgrounds. Sometimes the text ran along the bottom of the page, and some books had text all over the page with no logical flow. So, I decided to narrow my focus and only capture images of books that had text in a “somewhat normal” positioning. I figured that if I tried to get too broad of a dataset that I would end up with a model that thought everything was a block of text. Down the road maybe I can experiment with more varied types of data but for this project I decided to limit it.
After I captured the data, I used labelImg to generate the Pascal VOC xml files that would be used in training. At this point, I got stuck trying to figure out how to get my data formatted correctly to be trained with MXNet. I found a few examples using TensorFlow and so that is the road I went down. I figured if I finished the project, I could always go back and get it working using MXNet if I had extra time at the end. I was able to follow one of the examples I found on YouTube and ended up with a working model that I could use to detect text on a page.
Perform optical character recognition
I was surprised how easy it was to integrate Tesseract into the project. Just install it on the device and pip install the python package and then the workflow was really just a single function call. You provide an image to process, and it just spits out text. I was originally going to use a separate Lambda function with Tesseract installed to perform the OCR, but I ended up just including it in my main Lambda function because it was simpler and cut down on traffic to and from AWS. The actual OCR didn’t seem to take up too much compute power and compared to the round trip to AWS, the time seemed to be comparable. Plus, now I am doing more "at the edge" which is more efficient and costs less.
There was one gotcha with Tesseract. It is very picky about the image quality. I had to spend a considerable amount of effort figuring out how to clean up the images enough to get a clean read. It also wants you to have the text almost completely horizontal, (which is pretty much impossible considering I want preschool aged children to be able to use this thing) I used OpenCV for most of this image pre-processing and after a number of iterations, I was able to produce images that were plain black and white text with minimal noise. This was key to getting this project to work. The end results were better that I expected, however, there is still room for improvement in this area.
Transform text into audio
This step was the easiest of the whole project. Once I could get audio to play on the device, I was able to encapsulate this logic into a single function call which simply calls AWS Polly to generate the audio file. I never write the file to disk, I just feed the byte stream to the audio library and discard it when it’s done playing. I do have a few static mp3 files that are played on startup of the Green Grass service. I use the audio files to speak instructions to the user so they know how to use the device. I figure there was no reason to call Polly to generate this speech as it never changes so I generated them ahead of time and deploy it with the Lambda. (I love how easy it is to integrate existing AWS services into my projects!)
Play back the audio through speakers
Green Grass requires you to explicitly authorize all the hardware that your code has access too. One way you can configure this through the Group Resources section in the AWS IOT console. Once configured, you deploy these settings to the DeepLens which results in a JSON file getting deployed greengrass directory on the to the device.
To enable Audio playback through your Lambda, you need to add two resources. The sound card on the DeepLens is located at the path “/dev/snd/”. You need to add both “/dev/snd/pcmC0D0p” and “/dev/snd/controlC0” in order to play sound.
As an aside, You will need to re-add these resources every time you deploy a project to the device. At the time of this writing, Green Grass overwrites the resources file with a pre-configured group.json whenever you deploy a project to DeepLens.
Accomplishments that I'm proud of
I was very happy with the end results of this project. In spite of having a full-time job and four kids, I was able to create something really cool and get it to actually work reasonably well in just a couple months. I have more ideas for future projects with this device and am excited to keep learning.
What I learned
Before diving into this project, I had no experience with Deep Learning or AI in general. I have always been interested in the topic but it always seemed to unapproachable to "normal developers". I have discovered, through this process, that it is possible to to create real useful deep learning projects without a PhD in math, and that with enough effort, and patience anyone with a decent development background can start using it.
What's next for ReadToMe
I have a few ideas for improving the project. One feature I would like to add is the ability to translate the text that is read. (I signed up for early access to Amazon’s new Translate service but haven’t yet been approved.) I also plan to continue to improve my model to see if I can increase the model accuracy a bit as well as make it work with a broader range of books.
Lastly, the text image cleanup function function, which feeds directly into Tesseract, can be improved. Specifically, it would be beneficial to be able to rotate and or warp the image before sending it to Tesseract. That way when a child isn’t holding the book correctly, it could still read the text. Motion blur was also a definite issue I had to contend with in image cleanup. If the book isn’t held very still for a few seconds, the image is just to blurry for the OCR to work. I have read about various techniques to solve this problem, like using image averaging over multiple frames, or applying different filters to the image to smooth out the pixels. I am sure that it's possible to achieve a better/faster outcome but it's tricky working on a resource constrained device.
Helpful resources
There were many online resources that helped me along the way but these links proved to be the most helpful.
(In no particular order)
https://becominghuman.ai/an-introduction-to-the-mxnet-api-part-1-848febdcf8ab
http://gluon.mxnet.io/chapter08_computer-vision/object-detection.html?highlight=ssd
https://github.com/zhreshold/mxnet-ssd
https://pythonprogramming.net/introduction-use-tensorflow-object-detection-api-tutorial/
Also, many thanks to all the participants in the forums answering questions and especially to the AWS DeepLens team for getting my unstuck numerous times! :)
Built with
opencv
deeplens
python
polly
tesseract-ocr
lambda
mxnet
tensorflow