Serverless Optical Character Recognition in Support of NASA Astronaut Safety
A guest post by Chris Shenton, CTO, V!Studios
The NASA Extravehicular Activity (EVA) Office at NASA Johnson Space Center (JSC) needed to be able to search and make decisions based upon a huge volume of spacesuit safety and test documentation, many of which were only available as scans of paper reports. Timeliness was critical, especially in the event of a spacewalk mishap. While the EVA Data Integration (EDI) infrastructure was hosted in AWS GovCloud (US), the load to perform optical character recognition (OCR) on 100,000 pages per month overwhelmed their systems, so they had to suspend their OCR activity.
Our company, V! Studios, had been using AWS Lambda to prototype an OCR product, and we happened to run into our JSC EVA contact at the 2017 AWS Public Sector Summit. After giving him a brief demo of our prototype, he asked us to develop a solution to the EVA OCR integration bottleneck. There was one catch: it had to be completed in about two months, before the end of the fiscal year.
We knew we could develop a conventional cloud autoscaling solution based on Elastic Load Balancing, Amazon Elastic Compute Cloud (Amazon EC2), and Amazon Simple Queue Service (SQS), but the complexity of building and tuning that infrastructure would eat into our application development time. Instead, we leveraged the event-driven, automatic scaling features of AWS Lambda, driven by S3 events to deliver a solution that was elegant, fast, scalable, and cost-effective. We used IAM Roles and Policies and a separate VPC to integrate it securely with NASA’s existing system.
The EDI system is configured with an IAM Role containing a policy that lets it drop PDF scans into a specific S3 prefix, or “upload folder,” within our S3 bucket. This first event starts the OCR process by triggering a Lambda function which splits the document, which could be around 500 pages, into individual pages, and writes each page back to S3 into a second S3 prefix. These events trigger a parallel swarm of Lambda functions, which perform OCR on the individual pages and drop the text into separate files under another S3 prefix. Each of those events trigger a Lambda function to check whether all pages are done; when true, it assembles a JSON document associating each page of text to the page number and parent document ID and places it in the bucket’s final prefix. This invokes the final Lambda function, which feeds each page sequentially to EDI’s ElasticSearch API. Using S3 events to trigger Lambda functions allowed us to decompose the functionality into small, well-understood pieces with loose coupling, rather than a tightly coupled monolith.
In order to minimize operational costs, V! Studios eschewed the use of a database – even Amazon DynamoDB. Instead, we used S3 prefixes to track process, and propagated the original document’s name, unique ID, and other information with each processed file using S3’s user-defined metadata fields. We used an S3 Lifecycle Configuration that removes content after 24 hours to reduce costs further and strengthen our security posture. All data is encrypted in transit as well as at rest in S3. We deployed the Lambda functions into a VPC separate from the EDI services in order to provide isolation and give the security team the ability to verify that no data is “leaked” from Lambda. It also provided sufficient address space to scale out a fleet of Lambda servers that could easily launch a thousand functions for a single document. All this was deployed in the AWS GovCloud (US) Region due to the sensitivity of the information.
The Lambda service allowed us to focus on our problem and deliver on time, within budget, and deploy NASA’s first serverless operation running in AWS GovCloud (US). Operational costs are expected to be a small fraction of what would be incurred with conventional cloud architectures. Cuong Q Nguyen of the JSC/NASA EVA Office had this to say: “The work you’ve accomplished is a big step proving out this new technology for NASA.” We hope to leverage Lambda for other NASA, federal, and commercial solutions – it’s a game-changing technology.