Given the massive volumes of available manufacturer product information—product descriptions, data sheets, and CAD files—Autodesk Seek must utilize explicit domain knowledge models to analyze the data and prepare it for relevant search results by Autodesk Seek users. With an explicit objective to deliver freshly processed indexes of content in drastically reduced timeframes (within eight hours), the Autodesk team built a scalable backend processing network layered on top of Amazon EC2 (Elastic Compute Cloud), Amazon S3 (Simple Storage Service), SQS (Simple Queue Service) and SimpleDB.
Autodesk Seek is driven by a compute-intensive, advanced “web-crawl” to gather, download and process external data using CAD engines. With scalability in mind when building their architecture, the Autodesk Seek team developed a massively-parallel job infrastructure on top of EC2. “We use Amazon EC2 for almost all processes including pre-crawlers, crawlers, processing engine, and indexing. We run separate EC2 instances for each data source and data set which helps us process content more quickly,” says Mike Haley, Senior Manager for SaaS Technologies at Autodesk. “With a continuously growing data set, the free-form scalability of both the EC2 environment and our parallel architecture is critical to delivering this service in a timely manner.”
Because Autodesk Seek processes and serves large CAD files, a scalable storage system is also a mission critical requirement. With such large volumes of data, Amazon S3 has become a key element in Autodesk Seek’s data processing pipeline. The Autodesk Seek engineers built an intermediate processing pipeline using “versioned data sharding” where product data is broken up into small chunks and moved from Amazon S3 to EC2 for processing and back again for long term storage. In addition to serving and storing Autodesk Seek data, Amazon S3 is also used as a content hosting service for Autodesk’s catalog data providers.
Adding to their virtual infrastructure, Autodesk Seek also utilizes Amazon SimpleDB and SQS. “We currently use SimpleDB as our central state store that tracks what feeds of information need processing and what current jobs are being processed. Due to the asynchronous nature of our system, the “eventually consistent” model of SimpleDB works well for us,” added Haley.
Amazon SQS is used as the messaging infrastructure to communicate important job information between EC2 instances at each stage of the data processing pipeline. “SQS gives us good scalability in terms of moving messages between EC2. Plus, the persistent nature of SQS also gives us some degree of fault-tolerance at each stage that is critical when processing large volumes of content continuously.”
“Together, Autodesk Seek and Amazon Web Services infrastructure provide a completely scalable and low cost solution for building data- and processing-intensive web-based applications,” said Haley. “In previous years this required significant upfront investment in physical infrastructure. Now, both our time-to-market and risk have been significantly reduced.”For more on Autodesk Seek, go to http://seek.autodesk.com/ .