AWS Case Study: Lingit
Dyslexia is a surprisingly common problem. A 2005 study estimated that between 5 and 17% of the U.S. population are affected by dyslexia and unable to read and write adequately. While there is no known cure to correct the underlying cause of dyslexia, Lingit is trying to help dyslexia sufferers cope.
Lingit has created software that helps individuals with dyslexia improve their written communication. For a dyslexic person, standard word processing spell checkers are often insufficient. For example, if a word deviates too much from the intended word, a spell checker will not provide alternative suggestions. Or, if the word is syntactically correct but is used in the wrong context, the spell checker may not catch it. Lingit understands the deficiencies of most spell checkers and aims to offer a better alternative: a more powerful spell checker with additional writing support, such as word prediction or the ability to read text out aloud.
Lingit knew that in order to build a statistically accurate language model, there was one fundamental rule: the more data you have, the better. To build this tool, Lingit turned to Atbrox, a Norwegian-based company focusing on data mining/data analysis and cloud-based solutions.
Why Amazon Web Services
Since they were dealing with up to terabyte-sized, structured set of texts or text corpora, Atbrox created a solution using Amazon Web Services (AWS). Lingit uploads their data to Amazon Simple Storage Service (Amazon S3), starts the extraction process on an arbitrary number of compute instances using Amazon Elastic MapReduce (Amazon EMR) parallel processing and finally downloads the resulting files from Amazon S3 once the Amazon EMR job has finished.
Amund Tveit, Founder of Atbrox describes the solution, “The job is divided into four different phases, each phase having a map and a reduce operation. First, the raw data is tokenized, meaning that the text is split into single words or tokens. Next, certain tokens, such as dates and phone numbers, are normalized into a standard notation according to Lingit’s requirements. The third phase is where the tokens are grouped into sentences based on a set of rules that looks at special tokens such as abbreviations and punctuation. Finally, the n-grams are extracted and written as a set of files to Amazon S3.”
For Lingit, the cost and time savings of using Elastic MapReduce are substantial. Building their own infrastructure for getting similar results in the same amount of time would require considerable upfront investment. What is more, the purchased hardware would sit idle most of the time, as language model building is a fairly infrequent operation.
“Cloud computing is a perfect solution for a company such as ours,” said Prof. Torbjørn Nordgård, CEO of Lingit. “With Amazon Web Services, we can experiment with different approaches towards statistical language model building and get results in a short amount of time. It doesn’t cost a lot either. Ultimately, this helps us to innovate and to keep improving our products.”