CloudFront Migration Series (Part 3): OLX Europe, The DevOps Way
Business and scale at OLX Group
At OLX Group, we operate the fastest-growing network of trading platforms globally. Serving 300 million people every month in 30+ countries around the world, OLX Group helps buy and sell cars, find housing, get jobs, buy and sell household goods, and much more. With more than 20 well-loved local brands including Avito, OLX, Otomoto, Otodom, and Property24, our solutions are built to be safe, smart, and convenient for our customers. We are powered by a team of 7,500+ people, working across five continents in offices all around the world.
In this blog series, we are sharing our content delivery network migration story for OLX Europe – which focuses on trading used goods. The platform runs in seven markets and is used by 14 independent websites. Across the entire CDN, the platform is serving over 255,000 requests per second during peak hours, with 114,000 Lambda@Edge invocations per second and over 40 Gbit/s of total throughput.
Within a short period, we were successfully able to rebuild our entire edge infrastructure to be more flexible, with a fully automated setup. Today, it allows us to integrate products better, gather more granular data from our ecosystem and provide a better user-experience to our customers.
Click here to learn more about edge networking with AWS.
Challenges with existing CDN
Our former CDN solution consisted of a single logical unit which aggregated all rules (SSL termination, path routing and rewriting, caching configuration, legacy URL redirects, security rules, and features) for all of our markets. The same logical unit was used for both static and dynamic content (including mobile and desktop platforms). Changes to our configuration were introduced manually as no tooling was available to automate the task or manage the configuration as code. While the history of changes was available, it was difficult to compare differences between versions because of the large and concentrated number of rules. As a consequence, we were prone to improper changes during rollouts or rollbacks that impacted all of our markets. To reduce this risk, we spent significant amount of time verifying and completing changes, which reduced our velocity.
High-level overview of former CDN setup
To solve this problem, as well as others, we decided to evaluate CloudFront as a replacement CDN solution during a proof of concept (PoC). If the PoC was successful, we could quickly roll it out to production. We had our sights on the following benefits:
- As we are already using AWS for our origin, the new CDN setup would become part of our AWS infrastructure. Using the same tooling as our existing infrastructure, we would reduce the operational burden of managing another technology stack. We could also reduce some of our costs thanks to this synergy.
- Automate rollouts using CI/CD pipelines, to deploy changes at a higher velocity (multiple times a day) and with less human errors. In our CI/CD pipeline, we aimed at using our version control system to manage configuration changes and perform standard code reviews. We also considered introducing a staging phase before rollouts that maintained a consistent experience with production environments.
- Increase configuration granularity based on markets and traffic type to reduce the blast radius when introducing changes.
- Improve observability (logs, metrics) for both traffic analysis and debugging purposes.
Architecting and preparing
We started with an analysis of the former CDN configuration, and decided to make significant changes to our existing architecture to make it more granular. We also needed to match the existing rules with equivalent rules in CloudFront to maintain a consistent user experience.
The first step was to split our global configuration into smaller logical units to make it simple and reduce the blast radius of an improper change. We did this by using three CloudFront distributions per each market (country), one for each type of traffic (desktop, mobile, static).
Each CloudFront distribution was configured with:
- Multiple aliases (domains) and associated SSL certificates. We chose AWS Certificate Manager (ACM), as this would significantly reduce the operational burden of keeping certificates up to date and renewing them. (*note – ACM certificates are managed in us-east-1 Region)
- Multiple origins with specific configuration (timeouts, headers injection) per origin. This allows us to serve applications from a combination of smaller reusable services.
- Multiple cache behaviors (path-based routing and caching)
High-level overview of CloudFront setup
With this architecture, it was easier to define rules per market, with some of them reusable across markets.
As a next step, we identified existing functionalities that were not provided by standard CloudFront components, these needed additional customization using features like Lambda@Edge.
Protecting origins from direct access
To address this requirement, we used a modified version of a solution presented in an AWS Blog, which is based on dynamically updating VPC security groups to anylist CloudFront IP ranges. We created Regional security groups that could be attached to any AWS-based origin, (in our case it is Application Load Balancer (ALB)). Note that Layer 7 based solutions are also available.
We used Lambda@Edge to replicate our old CDNs custom logic. We did the following sanity check to validate the usage of Lambda@Edge:
- Whether the supported programming languages cover our needs: The rich libraries in nodejs10 runtime supported by Lambda@Edge were more than enough.
- Whether our automation tool of choice (Terraform) supports Lambda@Edge: we had an in-house module already in place for it.
- Whether Lambda@Edge can modify both incoming requests and outgoing responses: Lambda@Edge can be executed on four events available per each cache behavior, which covers this requirement.
We were able to replace our custom logic with Lambda@Edge, which covered a set of features cumulated over years of use on our former CDN configuration. It consisted of standard website cases, like legacy and vanity URL redirects, URL rewrites based on regular expressions, as well as, more advanced features like HTTP headers manipulation and device detection (redirects between mobile and desktop site versioning depending on device headers).
With the information we collected, we created a repository and CI/CD pipelines for our Terraform and Lambda@Edge code bases and started programing. Our pipeline automated the following steps:
- Creation of ACM certificates, CloudFront distributions, cache behaviors, origin configurations
- Creation and deployment of Lambda@Edge functions
- Simple static checks on the code (syntax, linter)
In parallel, we created dashboards and detectors to monitor our new setup (also using infrastructure as code). With enhanced monitoring enabled on our CloudFront distributions we gained access to the most important metrics for alerting when CDN or backend issues occur:
- Origin latency to detect backend slow downs
- Lambda@Edge throttling and execution durations to detect concurrency limits and cold start issues
- 5xx’s rate calculated as percent of all requests to detect errors coming from the backend, either due to performance issues or misconfigurations.
We also monitored other metrics that correlate to our costs:
- Cache hit ratio – enabling reduction of Lambda executions, bandwidth and potentially scaling down our origin
- Incoming/Outgoing traffic rate – we are charged for network transfer
- Number of requests – we are charged for number of requests handled by our CloudFront distributions
- Lambda@Edge invocations count and execution time
Progressive roll out
With monitoring in place and all CloudFront distributions deployed, we started the last phase of the project, which involved shifting our traffic and fine-tuning the CDN configuration.
After a well-prepared migration schedule, we started the rollout country-by-country. In each country, we progressively shifted an increasing percentage of traffic from existing CDN to CloudFront. We used Route 53’s weighted CNAME records with relatively small TTL (60s) to perform this canary roll out. In fact, we set an initial small weight on CloudFront distribution’s CNAME versus the large weight of our existing CDN, and progressively inverting the weights until CloudFront reached 100% of traffic.
Weighted R53 records example for desktop traffic
This technique helped us reduce the blast radius of changes by rolling our traffic out separately on desktop, mobile, and static for each country. It also gave us the possibility to roll back changes if issues were detected, so we could fix them without time pressure. In the following, we list some examples of issues that we resolved smoothly:
- GZIP compression was not happening on CloudFront, even though enabled on our relevant cache behavior. It turned out that it was because CloudFront required Content-Length header on the response from the object to compress it, and that was not done on our origin. We fixed the issue on our backend.
- By default, CloudFront relies on caching-related headers returned from the origin to control caching behavior. We faced an issue with inconsistent caching settings across pages due to lack of headers returned by the origin. Due to the complicated nature of the issue, this was fixed on our backend as well.
- The major challenge we faced was traffic throttling because we exceeded our Lambda@Edge concurrency limits when we scaled our traffic to multiple countries. This was due to three factors: a) We executed Lambda@Edge for dynamic non-cacheable requests, which meant an execution for every received request, b) We executed multiple Lambda@Edge functions on different events for each request and c) we had unbalanced executions across AWS Regions in EU, which concentrated executions in eu-west-1 in our case. To solve this, we worked with AWS to raise quotas during the preparation phase, and when issues occurred.
On the long term, using Lamba@Edge for dynamic traffic was not cost-efficient in our case. We made a decision to move out all redirect and header modification logic to our origins for dynamic content, which reduced Lambda@Edge executions by 99%. We only used Lambda@Edge for our cachable content that didn’t require the same level of executions.
Migration results and takeaways
In less than 3 months and with a small team, we were able to refactor a complex configuration from our former CDN and successfully complete the migration to CloudFront. The process involved considerable architectural changes to split our CDN monolith into smaller independent components, which reduced the complexity and facilitated the migration. In other words, a lift-and-shift approach was not going to make it happen for us. We were successful in building the necessary automations for safer and faster change deployments.
We would like to share with you some of our key takeaways:
- You can easily and safely perform a CDN migration using Route 53 weighted records and CNAMEs.
- Reading CloudFront documentation before working on your solution can save you much time debugging issues and re-designing your CDN to fit into the feature model it provides.
- Consider architectural changes before a migration then look for and follow best practices provided by CloudFront. Differences between CDNs can be significant, what works in one may not be applicable in the other.
- Make sure you understand the scaling model of Lambda@Edge if you will use it, and work with AWS teams if you need to raise your quotas, especially if you are using Lambda@Edge on a large scale.
- Using IaaC approach will greatly improve your operational efficiency overall.
As new blog posts roll out in this series, we will add links here. Additional blogs:
A big thanks to the whole European SRE and engineering teams for all the hard work and support, and a special shout out to: Maciej Sobkowiak, Przemysław Iwanek, Ricardo Silveira and Maxim Pisarenko. Kudos to the Security team for the amazing work building an in-house solution based on AWS services to fight against bad bots, as a replacement of a managed security feature provided by our CDN. This is will be the object of dedicated future article! Stay tuned.
This post was cowritten by OLX Europe Site Reliability engineers: Ricardo Conceicão, Przemysław Iwanek, Maciej Sobkowiak, and Achraf Souk, Principal Edge Specialist Architect at AWS.