Josef Bookert, Head of Public Relations for Rent Jungle, a search engine for apartment listings, explains their use of AWS for their web spidering engine.
Hi Josef, briefly tell us about your business.
Rent Jungle is a search engine for apartment listings. We spider the Internet and aggregate over 5 million apartments and houses for rent across 12,000 different property management and apartment owner websites.
We display our results in both a list view and a full-screen Google Map. Users can search listings using full-text queries (e.g., "lofts with a view of the river"), which is a large improvement over other sites that only allow filtering by number of bedrooms and price.
The core of our technology is our spidering engine, which is hosted entirely on Amazon Web Services (AWS). This technology can, with almost no human intervention, determine what is a rental listing and what is not. It can also accurately parse out the pricing, address, bedrooms, baths, and photos from the listing without ever having seen the web page before. This is much more advanced than just "scraping" and allows us to have 10 times the listings of other sites (including Craigslist).
How have you incorporated Amazon Web Services as part of your architecture? What services are you using and how?
Amazon Elastic Compute Cloud (Amazon EC2): We use multiple Amazon EC2 instances for our spidering engine. We spider and parse over 15 million Web pages each month on Amazon EC2. We also download and resize over 2 million images each month.
Amazon Mechanical Turk: For apartment listings where our AI parsing engine is unsure that it has found the correct bedrooms, price, and image, we auto-generate a Turk HIT to have a human double check the results. We had over 30,000 apartment listings evaluated by Turk workers last month.
Amazon Simple Storage Service (Amazon S3): We store over ten million web pages and four million images in Amazon S3. We store all the historical apartment listings we observe over the course of a year in Amazon S3 (over 15 million records). This historical data is pulled from Amazon S3 and analyzed using Amazon Elastic MapReduce. We backup all our production MySQL servers to Amazon S3.
Amazon CloudFront: All the apartment images on the site (~1 million) are hosted on Amazon CloudFront. Amazon Elastic MapReduce: Each month we launch a 20-instance Amazon Elastic MapReduce job to analyze our 20 million rows of historical rent data. We calculate averages, median rents, etc.
We started on AWS, and are almost 100% on AWS, except for our web hosting, which we are in the process of also moving to AWS.
Why did you choose AWS?
As a startup, we needed to scale our IT spend with revenue. In the beginning, we also needed to be modular, when we had no money yet still needed lots of CPU and memory, but only for a short time (maybe a week out of the month). AWS was the perfect fit, and now, we have multiple large instances running 24X7.
Can you share any metrics related to your use of AWS?
How has AWS most helped your business?
Having to buy the infrastructure up front would have been impossible for our startup company that relies on spidering. We needed the scalability offered by AWS to turn on and off on demand.
Do you have any future plans to incorporate other AWS solutions?
Yes, we are moving our web hosting over from another provider's cloud sites because we want root level access to both the database and the server. We also think it may be a tad cheaper.
To learn more, visit http://www.rentjungle.com/ .
Added July 1, 2011