With Amazon Redshift and Amazon EMR, we’re not only scaling much more effectively, but we’ve been able to cut query performance from hours to seconds. 
Jason Fennell Vice President of Engineering

Yelp connects people with great local businesses. Since its launch in 2004, Yelp has grown from offering services for just one city—its headquarters home of San Francisco—to a multinational presence spanning major metros across more than 30 countries. The company’s performance-based advertising and transactional business model led to revenues of more than $500 million during 2015, a 46 percent increase over the prior year. Yelp has evolved into a mobile-centric company, with more than 70 percent of searches, and more than 58 percent of content originating from mobile devices. 

Yelp’s global growth has led to extremely large traffic volumes, garnering more than 21 million average monthly unique app users, 69 million average monthly unique mobile web visitors, and 77 million average monthly unique desktop visitors. At the end of Q1 2016, Yelp had more than 102 million local reviews. Yelp's shift to mobile has been especially challenging, as Yelp mobile app users are more than 10 times as engaged as website users, stressing the analytics infrastructure.

To better understand market trends and its users’ needs, Yelp has traditionally relied on data warehouses that are used daily by a variety of internal teams, including product management, sales, advertising, and mobile. Yelp's rapid growth and shift toward mobile, coupled with the ever-increasing demand for analytics by the different teams, however, created performance and efficiency problems.

“The scale and flexibility of our data warehouse infrastructure was not adequate for our needs,” says Jason Fennell, vice president of engineering for search and data mining activities. “Traditionally, we used mapreduce for search logs, but inexperienced developers were spending at least an hour to pull together and go over just one set of data to answer one question. We also had issues with resource contention. We wanted each group to have all the data they needed while still being able to execute independently." 

Yelp, which had already been using a range of Amazon Web Services (AWS) products, started using Amazon Redshift, a fully managed petabyte-scale data warehouse, and Amazon Elastic MapReduce (Amazon EMR), which provides a managed Hadoop framework that simplifies data processing by distributing data dynamically across scalable Amazon EC2 instances. Yelp uses Amazon Simple Storage Service (Amazon S3) to store daily logs and photos of businesses.

Yelp stores approximately 18 months’ worth of advertising information in Amazon Redshift. Teams use the information to understand how ads are being delivered and to train models that will result in more relevant future ads. The company generates multiple terabytes of logs daily, with that data landing in Amazon S3. The data is transformed using MRjob, a Python package that works with Amazon EMR and is used to write and run Hadoop streaming jobs before being loaded into Amazon Redshift. Outside of log processing, Yelp uses Amazon EMR extensively, and has even used it to efficiently and inexpensively crawl the web and analyze Yelp's partner feed, which powers integrations with companies like Apple, Microsoft, and Yahoo.

Yelp creates Redshift clusters on demand as teams need them for specific analytics tasks. Developers can create a new Redshift or EMR cluster with up to 50 nodes through a simple command-line interface. There is no permission process involved and clusters are not shared, so the resources can remain dedicated to the specific analytics task. All clusters have access to Yelp’s complete data set stored on Amazon S3. This elasticity empowers Yelp developers to answer ad hoc analytics queries immediately, with the best possible tools, whenever the need arises.

For the orchestration of interactions between MRjob, Amazon S3, and Amazon Redshift, Yelp built an open-sourced a tool called Mycroft. Mycroft watches for new data as it arrives and then automatically performs ETL tasks without requiring any user action. Mycroft has a web interface that is used to monitor the progress of in-flight data loading jobs, and can pause, resume, cancel, or delete existing jobs. It also notifies users via email when new data is successfully loaded or if any issues arise. 

Using Amazon Redshift with Amazon EMR has dramatically improved the ability of Yelp’s internal teams to quickly access and analyze data, allowing easy scaling as the company grows.

“We have dozens of Redshift clusters that are owned by different teams,” says Justin Cunningham, tech lead of the Business Analytics team. He notes that the company decided not to deploy a single, “master” cluster but instead created a flexible infrastructure to meet analytics needs as they arise.

“The ability to dynamically create multiple, dedicated clusters eliminates contention issues,” Cunningham says. “Whenever a team wants to run an in-depth analysis on their data, they can do it without needing to consult with any other team. This also decouples development scaling. If one of our teams wants to start a new cluster and bring in a bunch of data, and maybe take it down for a while to facilitate the analysis, they can do that without interfering with other teams.”

Fennell says the data warehousing and analytics infrastructure built on the Amazon products has dramatically expedited daily jobs. “With Amazon Redshift and Amazon EMR, we’re not only scaling much more effectively, but we’ve been able to cut query performance from hours to mere seconds,” he says. “This results in richer, real-time analytics that benefit all of our business teams.” 

To learn more about how AWS can help enterprise analytics, visit our Big Data details page.