Getty Images is a global business-to-business marketplace for digital content. The company provides its customers—who include journalists, media professionals, and corporations—with a suite of content, license models, purchase options and services for licensing still imagery, video and music. Getty Images is an intermediary between content suppliers and a broad set of customers who use digital content in digital and print media, websites, marketing materials, books, publications and video productions. Its products and services offer customers an alternative to commissioning their own content production.
Getty Images, a company renowned for its coverage of the Olympic Games, needed to implement faster relevancy rankings for its content before the Summer 2012 Olympics. With more than 1,000 images a day generated at the Olympics, the Games usually result in high numbers of searches for iconic and trending subjects, and Getty Images needed to enable more relevant searches of their content.
Getty Images offers robust content search, enabling customers to tailor their results to specific elements so they can find the image they need. The searches are informed by customer ranking algorithms which assign a numerical weight to various elements of a photograph (for instance, the subject, geography or mood). The Getty Images team regularly runs between 15 and 100 relevancy rankings, and each ranking takes an hour to complete. The rankings are performed offline and feed into the online search tool that customers use.
To improve performance before the Olympics, Getty Images needed a solution that would scale. The team needed something that could bolt on to their existing architecture, so that they could minimize risk to their system. And they needed something they could put in place before the Olympics. They had less than a year to build a prototype, get it approved, and put a working version into production.
Before the company adopted the AWS Cloud, the ranking system ran on Windows Server infrastructure. But once the size of the collection metadata went beyond the capacity of the existing in-memory solution, the company decided to migrate to a custom SQL-based solution. However, with gigabytes of data running through the system, scaling SQL tables and stored procedures became an issue.
So the company moved the operation to Apache Hadoop and began investigating custom workflows and cloud solutions.
Getty Images decided to move its relevancy ranking operations to the AWS Cloud to improve performance and save costs. The company’s pre-calculated searches (including searches to find content that is iconic, trending, or both) are processed offline; the output feeds into the search database that customers use every day to find the content they need.
Getty Images’ algorithms weigh input from each asset’s metadata (such as date, subject, and collection) and customer interaction relevancy data (CIR). The system then deploys custom Java classes to Amazon Elastic MapReduce (Amazon EMR). The system outputs a score that runs the sort in Getty Images’ search engine. Amazon Elastic Compute Cloud (Amazon EC2) runs another repository of asset metadata that is used during the calculations, and offers custom workflows as well. Getty Images uses Amazon Elastic Load Balancing and Amazon Simple Storage Service (Amazon S3), in which they are storing about 1 gigabyte a day from CRI data. Getty Images is also using Amazon Elastic Block Store (Amazon EBS) volumes to back up their data for those Amazon EC2 instances, and Amazon Simple Workflow Service (Amazon SWF) is used as their orchestration engine.
This was Getty Images’ first venture into the cloud, but AWS proved easy to use right out of the box. “Putting together the proof of concept was easy,” says Michael Faulhaber, Enterprise Architect at Getty Images. “I was already familiar with Hadoop, and the AWS APIs are one of the best selling points of AWS. It’s great to have that capability.”
Once the proof of concept was developed, Getty Images put together a small team to bring it into production. The team needed only 4 months to get the solution up and running in AWS. Had they gone with a more traditional solution, Faulhaber estimates that between purchasing hardware and putting together a simple workflow process, it could have taken Getty Images up to 2 years to work out the kinks. “The flexibility was what initially attracted us,” Faulhaber says. “Amazon EMR seemed like the most flexible way to deploy our application.”
With Amazon EMR, Getty Images reduced the time required to do a ranking from 60 minutes to 15 minutes—a 300% improvement. The company has also avoided expensive infrastructure costs, saving between $500,000 and $1M. The savings gave Getty Images the opportunity to deploy its IT resources on other projects.
Running on the AWS Cloud rather than on premises has resulted in significant cost savings, Faulhaber says. “To purchase and maintain the servers for the Hadoop cluster and supporting application would have been a large investment,” Faulhaber says. “By utilizing the elastic nature of EMR and EC2, we saw a considerably lower startup cost as well as ongoing maintenance.”
Getty Images has also found the AWS API to be a compelling part of the solution. The company wanted a solution that would complement Getty Images’ existing architecture, rather than requiring extensive integration. “We implemented AWS because it was low-risk in terms of our architecture,” Faulhaber says. “By separating the new application both physically and logically, we did not disrupt our current production applications or environments. With AWS, we were free to scale our application as we were in development rather than having to purchase hardware or implement support services such as Amazon Simple Workflow Service.”
Customers noticed an improvement, too. “We got a lot of strong feedback for the iconic sort, which was new for the 2012 Olympics. Customers have been very happy with the results—and they have AWS to thank.”
All in all, Getty Images has been very pleased with AWS. “It adds flexibility to our implementation and lowers costs significantly,” Faulhaber says. “We’ve had very few problems—we rarely have errors.”
To find out more about how AWS can help you store and process big data, visit our Big Data details page: http://aws.amazon.com/big-data/.