Initially the team started storing both crawler queues and pages metadata on a centralized relational database on Amazon EC2. While this was extremely easy and flexible, it soon became a bottleneck because of the massive volume of hundreds of thousands of daily records. “Our distributed crawlers were ready to handle their job, but slowed down anytime they needed a fresh queue or wanted to process a page.”
After a quick examination of their data, the team discovered that: 1) pre and post-processed records were very small in size, 2) schema frequently changed based on publisher and metadata availability, and 3) there was no need for complex queries or functionality in retrieving and processing records.
Based on these findings, Kehalim turned to a more simplified, scalable solution: Amazon SimpleDB. “SimpleDB basically simplified our process, solving our locking and performance issues while being used as an endless repository for what is essentially a large key-value database. Crawler machines running on EC2 were quickly tapped using API and was “speaking” to each other without interfering (or even touching) our relational database. The ultra low latency and high throughput achievable by directing requests towards SimpleDB directly from an EC2 instance proved to be a powerful aspect of the architecture,” claims Lugassy.
In addition to using Amazon SimpleDB, Kehalim also took advantage of the newly launched Amazon Relational Database Service (Amazon RDS), to store high-performance impressions and related metadata. Kehalim is now actively processing approximately 300GB of bandwidth, split between Amazon RDS and SimpleDB. “The ability to rapidly and automatically provision new queues and the required associated tables (in RDS) and domains (in SimpleDB) are a huge time saver and allow us to focus on the use of that data, rather than the head-aches of administering it.”
Lugassy summarizes, “Because of the different requirements, advantages and limitations, I would recommend developers that handle different sets of data to consider a combination of both SimpleDB and relational database (notably Amazon RDS). For data that is rather small, constantly schema-changing, easily re-created if needed and can spawn across different domains – use SimpleDB. For records of a larger size, consistent in nature and data that you wish to examine more closely, perform advanced queries or joins – use RDS.”
Kehalim also uses Amazon S3 for product thumbnails of multiple sizes, Amazon CloudFront to push popular items to edge locations, and Amazon EC2 for computing. Additional application, crawler/bots and caching instances are brought up or down based on demand.
To learn more about Kehalim, visit kehalim.com
.