Innovating with ML at scale using Amazon EKS with Booking.com
Learn how online travel platform Booking.com migrated its search ranking system to AWS to increase scalability and innovation.
Key Outcomes
Overview
With millions of travelers worldwide relying on Booking.com for their travel accommodations, the company wants to deliver the most reliable, responsive, and accurate search experience possible. To achieve this, Booking.com’s machine learning (ML) teams continually experiment with new ML models for the company’s search ranking and ML inference systems. However, the company’s on-premises infrastructure couldn’t scale quickly, limiting the teams’ ability to innovate.
To keep pace with the growing need for resources to test new models, Booking.com migrated its on-premises search ranking system to Amazon Web Services (AWS). Now, the system has increased reliability, responsiveness, and scalability, and the company’s ML team can explore new opportunities and innovations—helping Booking.com deliver highly personalized search experiences for travelers around the globe.
About Booking.com
Founded in 1996 in Amsterdam, the Netherlands, Booking.com is one of the world’s leading travel platforms.
Opportunity | Using AWS to support ML workloads for Booking.com
Booking.com aims to provide its customers with a connected travel experience, where users can access all trip-related reservation needs—such as flights, taxis, and rental cars—in one place.
The company uses ML ranking models to sort hotels so that search results are personalized to each user. Booking.com’s ML team wanted to experiment with new models to improve the search experience but were limited by resource availability with on-premises workloads.
“There were a lot more models that we wanted to test than there were resources available,” says Alibek Datbayev, engineering manager at Booking.com. “We needed to be mindful of the resources when we did experiments, and that put a lot of constraints on the compute.” What’s more, even when a model had positive impact but required more resources to implement, it could take weeks or months to procure the necessary servers.
The company needed a way to scale compute quickly for large dataset handling without introducing latency or increasing costs. “We wanted to have reasonable performance at a reasonable cost,” says Datbayev. Booking.com had used AWS services in other modernization initiatives, so the company knew that AWS was the right environment for rearchitecting its search ranking system
Solution | Creating a scalable, flexible architecture to support ML
Using AWS, Booking.com built an internal platform to host the ML infrastructure through which the internal team can self-serve and experiment with ML features. “We had deep discussions with the AWS team on pretty much every step, from estimating costs to technical guidance on the architecture,” says Datbayev.
The container-based architecture is built on Amazon Elastic Kubernetes Service (Amazon EKS), a service to build, run, and scale production-ready Kubernetes applications easily across any environment. By deploying on a dedicated Amazon EKS cluster, the teams gained the isolation and flexibility needed to serve hundreds of thousands of ranking requests per second while maintaining a seamless user experience. “Using Amazon EKS, we have a lot more freedom and flexibility to manage workloads and scale as we need per ML model,” says Datbayev.
The company uses Amazon SageMaker AI—a fully managed service that brings together a broad set of tools to facilitate high-performance, low-cost AI model development—to host large language models and automated moderation models, further enhancing the customer experience.
The migration was broken into three phases to verify that it wouldn’t negatively impact operations. During the first phase, the teams migrated ML inference workloads to AWS to verify that separating the inference from the on-premises backend was feasible. In the second phase, the ranking API and models were migrated to the new architecture. The final phase introduced a hybrid serving model, combining public and private cloud infrastructure to balance cost efficiency with architectural flexibility.
Outcome | Optimizing costs while improving compute performance
By migrating its search ranking system to AWS, Booking.com has increased its ability to experiment with ML models. “In the past, we were always very mindful of scalability limitations,” says Datbayev. “Now, there are no issues when it comes to scalability.” With workloads now running on Amazon EKS, the company benefits from a flexible, isolated compute environment that can adapt quickly to changing demand—for example, when the teams need to test a new model.
AWS offers instance types optimized for specific use cases, such as compute, memory, or storage. This meant that Booking.com could benchmark different instance types to find the best fit for the search ranking system, which requires high performance and low latency. As a result of the extensive testing, Booking.com has identified the optimal price-performance ratio for its workloads. The new architecture delivers approximately 40 ms latency for 99.9 percent of requests, maintaining the responsiveness that’s critical to delivering a seamless search experience for users.
With the new architecture in place, Booking.com can focus on accelerating innovation. The company looks forward to continuing to work alongside the AWS team on new initiatives. “We’re constantly monitoring the latest trends in the industry and how we can integrate those to create even more value for our customers” says Datbayev.
Architecture Diagram
Using Amazon EKS, we have a lot more freedom and flexibility to manage workloads and scale as we need per ML model.
Alibek Datbayev
Engineering Manager, Booking.comAWS Services Used
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages