The documents included Hillary Clinton’s daily activities as a First Lady during President Bill Clinton’s two terms in office, from 1993-2001 that were being made public under the Freedom of Information Act after multiple requests from journalists and watchdog organizations.
Harkins knew that reporters would be very interested in this data but it would take hundreds of man hours to pore through the document’s low-quality PDF files. So, about 45 minutes after the release, Harkins started working with the data, trying to find a way to convert the images into usable, searchable text and deliver them to the newsroom within the same news cycle.
Harkins first tested various PDF and Optical Character Recognition (OCR) tools to convert the images into machine-readable text. With these software tools, he estimated that it would take about 30 minutes per page to process the sizable document including reformatting, resizing, and scanning each page.
Working against time, Harkins moved the project to the cloud—Amazon Elastic Compute Cloud (Amazon EC2). With Amazon EC2, he launched 200 server instances to process the images to his specifications. With a processing speed of approximately 60 seconds per page, the project was completed within nine hours and sent to the eager writers who began searching against the data. Then, Harkins and team created a polished web interface and made their searchable database available to the public 26 hours later.
Harkins ruminates, “EC2 made it possible for this project to happen at the speed of breaking news. I used 1,407 hours of virtual machine time for a final expense of $144.62. We consider it a successful proof of concept.”The database of Hillary Clinton’s 1993-2001 Schedule is publicly available at: http://projects.washingtonpost.com/2008/clinton-schedule .