Containers

Choosing serverless: Babbel’s migration story

Who is Babbel?

Babbel is a whole ecosystem of language learning offerings, including the world’s best selling language learning app. With more than 10m subscriptions sold and over 60.000 courses for 14 languages, we created the No. 1 destination for language learners globally. We have been running our platform on Amazon Web Services (AWS) since day one with the launch of our product in 2007 and are frequently early adopters of new service offerings from AWS. Since our Babbel learning ecosystem is purely digital, it is heavily dependent on the underlying technology, which is expected to be not only reliable and stable but also scalable at any point in time. This comes with challenges and opportunities along the way, especially as the product offering grows and the service landscape changes. 

Babbel has been expanding our learner base constantly and our traffic has increased subsequently from 2007-2020. During 2020, Babbel’s learner base substantially grew with a two to three-fold increase in traffic from the USA and our main European markets. With the different global regulations facing the pandemic, many people were choosing to learn a new language or improve their language skills. This created additional spikes in the incoming traffic that hadn’t occurred at this scale before. During all of this, we didn’t question if our infrastructure would ever be challenged by the changing demands of the users.

However, prior to 2020, the platform that we built at Babbel which hosted Babbel’s services wasn’t leveraging all of AWS serverless services. It relied on an old stack running on AWS OpsWorks, which didn’t fit well anymore for what was needed. In this article, we describe what led Babbel to think about the change, the options we considered, and how we finally migrated our production workloads to Amazon ECS on AWS Fargate, and AWS Lambda.

Why change our architecture?

Being in a constantly growing and dynamic environment, we are motivated to change and improve things. What we look for are opportunities where the improvements would provide an enhanced learning experience for our learners. As you can imagine, prioritizing technical topics does not always easily translate into an improved learner experience, but there are some pillars we take as signposts:

  • Accelerating development velocity and release times
  • Reducing maintenance work
  • Having and maintaining an up-to-date environment
  • Improving feature delivery times

Before starting the project, we were running an old version of OpsWorks, which required us to use an outdated version of Chef to manage the configuration of OpsWorks EC2 Instances. These instances were based on an older instance type and using an Ubuntu version that was coming close to its end of life release cycle, so action was definitely required. Upgrading the Chef cookbooks to a new Chef version, upgrading the Ubuntu version, and upgrading the old OpsWorks EC2 instances would have taken a substantial amount of time. Additionally, our deployment, rollback, and upgrade times were eating up a lot of developer hours in maintenance work, which we wanted to decrease. In the case of rapid spikes in traffic, we had longer scaling times that we would have liked, and autoscaling wasn’t reliable. In some cases, it took up to 25 minutes to add additional EC2 instances to the OpsWorks cluster. For load balancing, we were limited to using Classic ELB, which didn’t have all the features that we would have liked to use such as Authentication via Cognito and Routing. These features were available in Application Load Balancers (ALBs), but OpsWorks didn’t support ALBs at the time. Given these circumstances, we concluded the ideal solution should address these topics, which meant that we had to move away from our OpsWorks EC2 setup.

Considering the options for migration

Before analyzing the potential technical solutions, we discussed what the ideal solution was for us from a feature perspective. We agreed that ideally the solution should

  • Integrate well with our existing AWS architecture and with our Terraform investment and structure
  • Be actively developed and up-to-date with a dedicated service and support team
  • Free up operational and maintenance time to allow us to work on the things that bring more value to the learners or to the Babbel engineering teams

It was clear to us that the solution was to go serverless. We proceeded to look at the available solutions for moving away from OpsWorks and replacing the whole computing and hosting layer. The options we considered were:

  • AWS Lambda
  • Amazon Elastic Container Service (Amazon ECS) 
  • Amazon Elastic Kubernetes Service (Amazon EKS)

We came up with the following conclusions on these options:

AWS Lambda

Ideally, we would be running almost everything on Lambda. Scaling is automated by default with no configuration required, there are no instances to maintain, no OS and security updates that we have to do ourselves on the OS layer, and deployments/rollback are instant. For some of the services, this was possible and we made the decision to use Lambda for them. However, we decided that Lambda wouldn’t be the right solution for all of our services. We had some multi-purpose services that required Docker and at the time of our evaluation in early 2020, Lambda support for container image format wasn’t a feature yet.

Amazon ECS

As Lambda wasn’t an option for these types of services, we had to make a decision on which platform to run our (Docker) containers. We evaluated Amazon EKS and Amazon ECS and we had the following four options to choose from:

  • ECS on EC2
  • ECS on Fargate
  • EKS on EC2
  • EKS on Fargate

As using ECS on Fargate and ECS on EC2 are very similar, they were one alternative solution for the whole ecosystem versus Kubernetes and EKS (on EC2 or Fargate), so we weighed the pros and cons with using either tech stack. In 2019, running ECS on Fargate was released and initially missed some features that we had needed back then (e.g. cost allocation tags for containers). Our AWS account managers helped us with our feature requests and these functionalities were subsequently implemented. Once the features were released, there were no blockers for us to move all of our freshly Dockerized services to ECS on Fargate. Between EC2 and Fargate, Fargate was a better choice for our architecture as it takes away the maintenance effort for the underlying EC2 machines. This tech stack was also easy to integrate with the remaining AWS services and with our Terraform codebase, which we already had experience managing.

Amazon EKS

While weighing the pros and cons of running EKS, we decided that it wasn’t necessary for our use case and infrastructure setup. Our main objective was to have a platform that scales our Docker containers with the least amount of effort and with the smallest number of changes in the rest of our environment and our AWS service integration. Additionally, we wanted to ensure that the amount of operational work is minimal, as it doesn’t bring any value to our learners. With Kubernetes, we felt that we would have a steeper learning curve, would need to make more changes to our existing environment and have more operational and maintenance work. We thought that we could have a better separation between development and infrastructure with a more AWS-centric infrastructure as code, which we are managing via Terraform (an example for this would be using AWS IAM). In short, we wanted to change our compute/hosting landscape without having to do a bigger adaptation of our systems and services and the setup of how we run deployments, manage our networks, security groups, and so on.

In 2019/early 2020, EKS was still a newer service. At the time, our decision to not adopt EKS (or Kubernetes) was a concern around support for Kubernetes features running on AWS. While EKS uses the upstream Kubernetes code (no modifications), our concern was the delta between the Kubernetes latest releases and the releases available with EKS. At the time, we were unsure if we would have access to all of the latest Kubernetes features immediately. We had no deal breakers from a particular feature in this case and decided that we wanted to go with an AWS-first service, instead of an AWS managed open source service. There are certainly many advantages to using Kubernetes, such as being able to run a hybrid-cloud environment with more fine grained control, but that was not important to us. In summary, because of the above mentioned reasons, we decided to go with ECS instead of EKS (so we did not compare whether we should be running EKS on EC2 or Fargate).

Migrating the workloads 

Since we had previous experience running AWS Lambda, the initial migration of the services from AWS OpsWorks to AWS Lambda went quickly and without any unforeseen problems. Since we didn’t have any experience with AWS Fargate, we had to Dockerize all of our remaining services before starting with the migration to AWS Fargate. Beside the technical challenges that we had to overcome due to lack of experience in this type of migration, a lot of inter-team coordination was required as the migration was touching 10+ services, both customer facing and internal services. Naturally, the first few services were the ones that took the most time as we had to find out what the best way was to do the deployments, fine-tune our autoscaling, and ensure that the migration of the services to Docker was working as intended. We first started migrating internal services that had no product impact, continuing with internal services that did have customer impact and finishing with customer facing services at the end. The final setup now varies, because our services have different integrations and environments (i.e. sometimes we use AWS Cognito with ALBs or have CDNs in front of the ALBs etc.). Here is a simplified before/after comparison illustrated below:

Conclusion

Once we completed the technical changes of our project, it was time to evaluate if we achieved our goals. To summarize, the initial pain points were:

  • High maintenance effort of OpsWorks/Chef/EC2, spending significant development time on maintenance instead of improving the app for customers
  • Unreliable scaling with long 20+ minutes warm-up time due to the underlying OpsWorks and Chef stack
  • A setup with OpsWorks that wasn’t able to use Application Load Balancers, which had features that we wanted to use 

With the switch to Amazon ECS on AWS Fargate, and AWS Lambda, we gained the following benefits:

  • Faster releases and roll-back times with reduced maintenance times, allowing us to focus on building new features for our learners. We went from deployment times from 25-30 minutes per OpsWorks cluster to almost instant deployments/rollbacks with AWS Lambda and Amazon ECS on AWS Fargate. 
  • Rapid automated scalability compared to our previous setup. This turned out to be useful when the unexpected rapid increase in traffic in March 2020 produced peaks of demand around the clock and from around the world. 
  • Using an AWS service that we have integrated together with other AWS services for different purposes like integrating security scans as part of our release process by using Amazon ECR image scanning or direct authentication via ALBs
  • Reduced cost as a side effect of utilising our computing workloads in a more efficient way. We have described this in detail under https://www.babbel.com/en/magazine/how-to-do-more-with-fewer-servers

 

About Babbel

Babbel is driven by a mission: Everyone. Learning. Languages. This means building products that help people connect and communicate across cultures. Babbel, Babbel Live and Babbel for Business use languages in real situations, with real people. And it works: Studies with Yale University, City University of New York and Michigan State University prove its effectiveness. The key is a blend of humanity and technology. More than 60,000 lessons across 14 languages are hand-crafted by 150+ linguists, with user behaviours continuously analysed to shape and tweak the learner experience. From headquarters in Berlin and New York, 750 people from more than 60 nationalities represent all the differences that make humans unique. Babbel is the most profitable language learning app worldwide, with more than 10 million subscriptions sold. For more information, visit www.babbel.com or download the apps in the App Store or Play Store.

About the author

Gyorgi Stoykov MSc. is a Senior Manager working at Babbel’s Infrastructure team, currently based in Berlin. He has extensive experience in Cloud computing and Infrastructure in a variety of environments ranging from Fortune 500 companies, start-ups, as well as Academia. He is deeply passionate about DevOps, AWS and helping organisations build cloud native products by applying Agile and DevOps best practices.