What is natural language processing and machine learning and how do you use Mechanical Turk?
NLP and machine learning are two fields where the performance of an application is heavily dependent on the quality of the data that is used to build the system. I use Mechanical Turk to create high quality datasets. This helps us learn more about how people process and understand language. We hope that our research will one day enable computers to understand and communicate knowledge as well as people do.
My interest in Mechanical Turk was to explore whether nonexpert annotators could give you the same level of quality as expert annotators. Conducting similar surveys with undergraduate students, graduate students, or researchers could potentially cost many thousands of dollars.
What were your misconceptions about Mechanical Turk and what was your first project?
When I first started my experiments with Mechanical Turk, I wasn’t particularly convinced I would get high quality data. So, my initial skepticism actually led me to conduct a series of experiments to question whether or not the annotators on Mechanical Turk, these anonymous, non-expert annotators, would be able to produce high quality datasets. Having heard from other researchers both ways – that a lot of the data that comes out is just junk, or sometimes it’s really of high quality – I wanted to run my own experiments and measure the quality of annotations myself.
I experimented with a variety of natural language tasks, mostly where various machine learning algorithms have already been applied, but where the typical performance is still nowhere near human-level ability. However, these are still tasks that can be performed adequately by a native English speaker, that is, it wouldn’t require substantial training beyond basic language comprehension. Our goal was to see if we could create datasets for these tasks that we could potentially use to train and improve natural language processing algorithms.
Our first tasks already had a number of high-quality datasets that were already assembled by linguists and graduate students. We used this data to compare the Turker annotations to the original expert labels, to see whether the nonexpert quality could match up to the experts. What we found was, in general if you just asked a single annotator on Turk to label all of the samples, the quality wasn’t going to be nearly as good as if you went and asked a linguist or graduate student to do the same thing. However, if you were able to break the work up among a large group of Turkers and ask them to perform multiple independent annotations per question you can actually do quite a bit better than experts. Say that we get 10 separate independent annotations for each question, and then aggregate the responses by voting or averaging; we found this data to be better than that of expert annotators.
What other studies have you conducted with Mechanical Turk?
Another group of tasks was to recognize the emotions invoked by a series of headlines. Here we found that we could achieve the same level of quality as expert annotators by using an average of four Turker annotations per one expert annotation.
These findings are discussed at length in our paper “Cheap and Fast – But is it Good? Evaluating Nonexpert Annotations for Natural Language Tasks”, presented at EMNLP (Empirical Methods in Natural Language Processing) this summer. It’s also summarized in a post at the Dolores Labs blog written by my co-author Brendan O’Connor, under the title “AMT is Fast, Cheap, and Good for Machine Learning Data.”
How is Mechanical Turk a game changer in machine learning and NLP?
One of the major revolutions in applied machine learning is that dataset creation is now more easily available. Mechanical Turk allows anyone to send tasks out to a wide number of people and obtain a great number of different responses and opinions on a subject very quickly, and very cheaply. In the past user studies have been very expensive and time consuming to create similar datasets.
I think that the academic community is still early in discovering the potential for using Mechanical Turk. One of the earliest papers I know about using AMT for data collection was a paper by Qi Su in the proceedings of WWW 2007, “Internet-Scale Collection of Human-Reviewed Data”, and there have been a slow but steady stream of papers this year on similar topics. Mechanical Turk is not really widespread in academic use now, but I do feel that, within the last year or so, it has started to become much more popular. I expect that it – or something like it – will be much more widespread within the next year, or within a couple of years.
How did you hear about Mechanical Turk?
The first time I heard about Mechanical Turk was in 2007 when Turkers were searching for Jim Gray’s boat. An influential computer scientist was lost at sea, and there was a Mechanical Turk task with satellite images that asked Turkers to identify unusual elements that might be Jim Gray’s boat. It would be ideal if we had a computer vision algorithm that could reliably go through those satellite photos and do the same thing but we’re just not there yet. By being able to break the job into small chunks, you can cover a lot of space in a small amount of time.
I was more formally introduced to Mechanical Turk as a research tool while I was consulting with Powerset, a semantic search company. Building out search algorithms requires a lot of testing and evaluation, and you really want as large a search quality assessment group as possible. Going to Mechanical Turk to annotate the quality of the search results really made a lot of sense.
Why do you think workers put in as much effort as they do?
The best studies I’ve seen on understanding the makeup of the workers on Mechanical Turk has been done by Panagiotis Ipeirotis, a professor at the Stern School of Business at NYU. He’s studied a lot of different aspects of Mechanical Turk and has been doing surveys asking Turkers: “Where are you from?”, “Why do you do this?”, and collecting general demographics of the workers who complete his tasks. He’s found that there is a very large number of people who are just doing it for fun, who are bored, who are not necessarily relying on Mechanical Turk as their primary means of income, but rather doing it as a form of entertainment, and happy to make a little extra money on the side.
So, if you believe the survey results that Ipeirotis has collected, the vast majority of workers on Mechanical Turk aren’t relying on their AMT work as their primary source of income – they’re really doing tasks primarily for entertainment value, and making a bit of money and having the intellectual challenge is what makes AMT more appealing than playing a Flash game or watching television or some other way to spend leisure time.
Is there work that you used to do that you now rely on Mechanical Turk for?
There are some datasets that I would have personally sorted through or I would ask student annotators to sort through. I now rely on the Mechanical Turk workforce. For example when you’re trying to evaluate how well an algorithm is performing you often need one fixed test set to run multiple variations against, similar to the case of search quality evaluation. If you don’t have a fixed test set you might need a much larger amount of data in order manually label each of the outputs of different versions of your system. This can be incredibly time-consuming, especially as you continue to update and revise your algorithm. Now we are able to send a set of system outputs to Mechanical Turk and get several thousand judgments over the course of a day with just a couple of dollars.
What do you think your time and cost savings are?
In our most recent work we were able to get more than 20,000 annotations across five different projects for about $26. So, roughly speaking we’re looking at about 1,000 annotations per dollar. Our average is to get about 10,000 annotations a day per project using AMT.
In my previous annotation studies, convincing multiple annotators to do a project was difficult and time-consuming. Mechanical Turk provides an always-on, always-available army of workers to do the task. I no longer have to spend the time to personally organize a group of annotators for each experiment and this is an enormous savings.
What advice do you have for other Requesters using Mechanical Turk?
My primary advice for other researchers who are considering using Mechanical Turk is to jump right in. Start running experiments as soon as possible. Mechanical Turk has changed the way I look at problems. This is a direct result of realizing that I now have access to an army of thousands of workers. Mechanical Turk has become a new tool in my toolbelt, allowing me to consider approaches to problems that involve manual process that I might otherwise try to avoid. In many cases I wouldn’t have considered these solutions to be feasible without Mechanical Turk.
What are your future plans with Mechanical Turk?
I’m definitely interested in using Mechanical Turk on a long-term basis. Until we are able to actually use a computer to solve the problems of translating human communication and at the same level as people, we will still need to learn these methodologies from people. Human communication is a complicated process, and in many ways, we’re still learning about very simple elements of it. Mechanical Turk represents a potential method for really getting at this data.
There is a lot of work in natural language processing and other applications of machine learning that rely heavily on being able to get accurate evaluations. If we can use Turkers by performing massive evaluations and improve these processes even just a little bit, then we’ll only continue to get better at designing more accurate, higher performing systems.
You can reach Rion Snow at email@example.com or learn more about the Stanford AI Lab (SAIL) on their website.