Journey to Being Cloud-Native – How and Where Should You Start?
By Ashley Sole, Senior Engineering Manager at Skyscanner
By Rael Winters, CloudOps Product Manager at DevOpsGroup
By Kamal Arora, Sr. Manager, Solution Architecture at AWS
Cloud-native is one of the hottest topics in IT, so naturally it’s a source of much debate.
Amazon Web Services (AWS), DevOpsGroup, and Skyscanner have teamed up to cut through the hype and offer an objective look at “going native” in the context of large-scale cloud adoption.
DevOpsGroup is an AWS Partner Network (APN) Advanced Consulting Partner that offers digital transformation services based on DevOps practices. Skyscanner is a leading global travel service search engine that recently migrated all-in to AWS.
In this post, our goal is to differentiate between applications that justify the full cloud-native treatment upfront and those where a simpler, phased approach might be more appropriate.
Before we dive in, let’s consider what we mean by cloud-native. With such a complex, rapidly evolving concept, a simple definition is too restrictive. It’s more useful to consider cloud-native as a continuum.
The Cloud Native Maturity Model outlined by Kamal Arora et al in Cloud Native Architectures is a good place to start. It positions “cloud-native services”, “application-centric design”, and “automation” as core elements which can evolve over time. Their sophistication shapes the overall maturity of a given application.
Read more about these three elements on the DevOpsGroup Blog.
Figure 1 – The Cloud Native Maturity Model.
What’s the relevance of cloud-native maturity? This brings us full-circle to the rationale behind all-in cloud migration or widespread adoption.
Mounting evidence shows that the way you implement cloud technology matters. The latest State of DevOps report, which considers data from more than 30,000 surveys, highlights that infrastructure-as-code (IaC), platform-as-a-service (PaaS), containers, and cloud-native architectures are predictive of organizational success.
Using these technologies and practices clearly impacts the speed at which the promised performance benefits are realized and translated into tangible commercial advantage.
If you’re migrating to the cloud, moving as-is has limited value in itself. Likewise, if you were born in the cloud, failure to exploit advanced features, services, and automation techniques will hinder long-term agility and growth.
Targeting Maturity Levels
This is why you need to decide upon the level of sophistication required. If you’re building in the cloud, it’s a case of focusing on what each application needs to achieve.
It may not be necessary to aim for the more advanced end of the cloud-native spectrum, but you still need to consider medium and long-term business goals while ensuring the application can accommodate ongoing improvement.
When it comes to migrations, re-platforming and evolving the existing IT estate is usually simpler, with lower costs and risks attached, than a full rewrite. Rael has written about three fundamental paths for cloud migration, which are closely aligned with the AWS six ‘”R”s of cloud migration.
Kamal’s maturity model, as shown in Figure 1, positions these three options—rewrite, re-platform, or re-host—as a larger spectrum that reflects the myriad of potential choices within each path.
But the million-dollar question is, which applications are best suited to which approach?
In the scope of a large-scale cloud migration, it’s likely that only a small percentage of the estate should be earmarked to go fully native during the move. Developers’ time and energy is finite, and it needs to be invested in applications closest to core value creation.
This is a tough decision for many organizations. Most large-scale cloud migrations have hard deadlines, which often means a lot of compromise. Decision-making must consider desired outcomes, as well as technical factors.
DevOpsGroup has been through this process with several organizations migrating to AWS. Multiple factors can have a bearing on the outcome, but those with the greatest significance are:
- Amount of technical debt within a given application. More debt makes the migration a catalyst for much-needed overhaul.
- Application suitability for running in the cloud. Legacy-architected applications benefit the most from a rewrite.
- Proximity of an application to core value creation activities, which tends to go hand-in-hand with more development activity. A greater need for agility and amplified gains from even marginal improvements.
Skyscanner’s All-In Migration to AWS
When leading global travel service search engine Skyscanner opted to pursue all-in cloud adoption, its data center hardware refresh cycle was a key driver.
The team faced a choice between hardware reinvestment and an extension of its expensive data center estate, or an aggressive, time-pressured migration. They opted for the latter.
There were five global data centers, each holding a VMware installation with more than 7,000 virtual machines (VMs) hosted in total. Together, the data centers held more than 300 different services, owned by multiple engineering teams within Skyscanner. It was inevitable the migration would be highly complex.
Skyscanner operates a “you build it, you run it” approach with product-aligned teams. There was no central migration team, and each product team was responsible for formulating and executing a plan to migrate its own services.
Project roadmaps were used to define milestones and deadlines, and ongoing communication ensured everyone was clear about expectations and accountability.
Most teams’ Plan A was a cloud-native rewrite. But it soon became apparent this was not feasible or appropriate for some applications. Having the roadmaps in place made it easier for teams to transition to a Plan B—such as a rehost—when necessary.
Here, we outline two Skyscanner applications that occupy different positions on the cloud-native maturity spectrum following their migration.
The rewrite of Skyscanner’s Flight Stack is a sophisticated and impressive example of how cloud-native principles can be strategically developed over time. But it was no mean feat and the process took more than two years.
The re-platforming of Skyscanner’s translation tool, Strings-as-a-Service, was much simpler. Even so, challenges were encountered along the way and in-flight decisions had to be made to ensure it was moved quickly and could operate smoothly post-migration.
Long-Haul Migration: The Flight Stack Rewrite
The Flight Stack application is central to Skyscanner’s ability to fulfill its core customer proposition. Users expect a seamless service and want to access the information they need in a matter of seconds. Effective management of the transition to AWS was critical.
To maximize the benefits of moving to AWS, the stack was rewritten from a .NET, SQL Server-backed monolith to a stateless Java microservices application.
Previously, the SQL Server was used for all aspects of data processing and storage. Following the rewrite, Apache Spark on Amazon EMR handles processing, Amazon Simple Storage Service (Amazon S3) is used for storage, and Redis has been deployed for real-time queries.
Figure 2 – Skyscanner’s Flight Stack architecture diagram.
The quote cache at the center of the diagram in Figure 2 runs on Amazon ElastiCache for Memcache, with global replication based on Amazon Simple Notification Service (SNS), which you can read more about in this blog. Redis is used to store search results.
For the Browse/TAPS service outlined in the diagram, a key architectural decision was to store data as immutable objects in Amazon S3. It uses a custom S3-based file system written in-house at Skyscanner and fronted by a Redis cache. A legacy SQL database was also migrated using lift and shift, but this mainly holds historical data.
Statelessness is a critical factor of the rewritten Flight Stack, enabling it to run in Kubernetes on Amazon EC2 Spot Instances, unlocking significant cost benefits.
When AWS wants a Spot Instance back, there is a two-minute window to move the workload. So, Skyscanner developed a procedure to drain services and remove them from the cluster within this timeframe.
The application monitors AWS endpoints every five seconds, thereby allowing at least one minute 55 seconds to drain a node. The node is cordoned to prevent Kubernetes scheduling anything new on it, and it’s deregistered from the Elastic Load Balancer to prevent it receiving traffic.
Finally, the node is drained and all pods are moved to new nodes while waiting for the in-flight connections to terminate.
Rewriting the Flight Stack was a significant undertaking, but the rapid gains in terms of software modernization, resiliency, scalability, and cost optimization made it all worth the effort.
There was no single moment of switchover. Instead, the migration was handled in phases as different elements of functionality moved onto AWS. Additional services had to be implemented to handle traffic-shaping during the process, with flight searches alternating between AWS and the data center for a time.
The process Skyscanner adopted emulates the “strangler pattern” that’s gaining popularity in the cloud-native world for monoliths that cannot feasibly be rewritten in one go. Instead of using a cut-over rewrite, cloud-native functionality is slowly built around the application, progressively strangling it.
Today, Skyscanner operates a number of large multi-tenant Kubernetes clusters across multiple regions. These run thousands of pods, serving tens of thousands of requests per second to power the flight search product.
The product team is still scaling the application, and while this is ongoing, a conductor is being used to split traffic, making it easier to maintain stability and reliability.
Short-Haul Migration: Strings-as-a-Service
As a global business operating in more than 30 languages, Skyscanner relies on a complex localization process managed by translation management executives and software engineers.
A proprietary tool called Strings-as-a-Service holds JSON strings in a central repository, and enables new strings to be translated, then pushed to relevant services.
The goal is to deliver up-to-date native experiences, so customers feel they’re being looked after by local teams that understand their needs.
Strings-as-a-Service’s workflow focuses on checking the validity of translations before they’re pushed into production. Extracting this functionality into a microservice has allowed Skyscanner to ensure completed translations are immediately available instead of being bundled into software releases.
Strings-as-a-Service was part of a package of migrations handled by DevOpsGroup. The brief was to evolve the application to deliver rapid short-term benefits and accommodate further modernization after the migration.
Figure 3 – Strings-as-service architecture.
To achieve this goal, the team redeployed the application into Docker containers, introduced GitHub to allow the containers to become stateless, and made functional changes to reduce the number of steps in the workflow.
Use of Docker containers was driven by the need to avoid scope creep, and to achieve the migration quickly. The existing application architecture wasn’t compatible with AWS Lambda and would have required major code changes.
The application had initially been designed with a concept of state and the team looked at various options to introduce statelessness, a fundamental quality for modern applications.
Amazon S3 buckets were considered, and Amazon Elastic File System (Amazon EFS) looked promising but wasn’t available in all required regions at the time. CIFS shares seemed like a good option, but a proof of concept (PoC) to test this couldn’t overcome performance issues.
Ultimately, GitHub Enterprise was selected as the central source of truth for JSON files, which facilitated statelessness and worked well with various sections of the workflow, pushing and pulling translated strings to and from the repository.
The intention was to use multiple Docker containers, with custom Shell scripts written to manage configuration and replace Ansible, which Skyscanner had been using for VM configuration management.
Docker Compose was used to test mini clusters before they were deployed into Amazon Elastic Container Service (Amazon ECS) using Skyscanner’s proprietary orchestration tool Slingshot.
However, when the time came to orchestrate the migration, it became apparent that Dockerising Strings-as-a-Service was too unreliable. This was largely due to the way it had originally been implemented.
For instance, the high level of state entrenched in the service (including local management of a Git repository) meant container start-up times were excessive. The file system had to be populated before reaching a state of readiness, and there was a high risk of losing in-progress changes if the container was terminated.
What’s more, the application required periodic maintenance. This involved engineers logging in to conduct operations locally, which would have been problematic in a Dockerised Amazon ECS environment.
Ultimately, the decision was made to shelve the Dockerised approach. Instead, Skyscanner rehosted the application’s existing VMs to AWS using a toolkit that DevOpsGroup had devised.
This toolkit utilized open source tooling, including Troposphere and some custom developed libraries, to automate the creation of AWS CloudFormation templates during a lift and shift. The GET endpoint was rewritten as a Java Dropwizard application backed by Amazon S3.
This example underlines the scale of the complexities involved in cloud migrations. It reinforces the need for a phased approach, where learnings are harnessed and used to inform the next stage of work.
Strings-as-a-Service was the first project Skyscanner attempted to evolve in this way, and the extent of its incompatibility with Docker was impossible to predict.
When faced with this challenge, the team acted pragmatically and took an alternative approach without undue impact on progress.
Cloud-native is a complex and ever changing concept. Whether you’re building a new service in the cloud, or migrating existing applications, you need to be aware of this and make decisions accordingly.
When it comes to migrations, it’s important to note that full rewrites are inherently time-consuming. For critical applications close to the core value stream or underpinning competitive differentiation, a cloud-native rewrite may be worth the investment.
For others, approaches such as rehosting and re-platforming offer a perfectly acceptable shortcut and, in some cases, lift and shift is the most feasible option. That’s fine, providing the applications are revisited and modernized later.
There is no one-size-fits-all answer to the cloud-native question. Remembering that cloud-native is a continuum helps maintain perspective and keeps business objectives front-of-mind.
As with any major redevelopment, resources need to be applied intelligently. Establish your vision, and then develop a roadmap to achieve it. This strategy helped Skyscanner keep its ambitious migration on track, enabling teams to swiftly switch from a Plan A rewrite to a Plan B alternative when it looked like targets might be missed.
Whether you’re working towards all-in deployment, large-scale migration, or a gradual shift towards cloud-native principles, it’s important to identify what matters most to your business.
Consider where the greatest commercial benefits can be realized, and where the likely challenges lie. From this vantage point, you can make focused and logical decisions. It may be more appropriate—and beneficial—to address underlying issues like overdue technical debt before undertaking extensive rewrites.
In the digital economy, businesses have to continually evolve and modernize to remain relevant and satisfy customer demands. It’s important that cloud adoption strategies are rooted in this understanding. The ability of IT to adapt and scale has a direct bearing on future business success.
DevOpsGroup – APN Partner Spotlight
DevOpsGroup is an APN Advanced Consulting Partner. They work with global enterprises and offer digital transformation services based on DevOps practices and principles underpinned by agile software development to develop high performing IT teams.
*Already worked with DevOpsGroup? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.