How CoStar reduced compute costs by 90% through modernizing legacy .NET Applications with AWS Serverless
This is a guest post from Mark Osborn, Principal Software Engineer at CoStar Group.
CoStar Group is the leading provider of commercial real estate information in the world. You might know some of our more famous online marketplaces like Apartments.com or LoopNet. Dealing with commercial real estate information means handling a lot of high-fidelity photographs, videos and 3D tours of buildings.
Since 1987, we’ve been on a mission to create efficiency and transparency in commercial real estate. We take pride in empowering industry professionals with knowledge about over 5.9 million commercial real estate properties, 14 million professional photographs and 129 billion square feet tracked.
This blog post tells the story of how we evolved our image distribution API to become cloud-native; by first lifting-and-shifting and then containerizing which helped us save 90% on the compute costs.
2011 – .NET Framework application hosted on Windows Server on-premises
CoStar’s image distribution API is used to apply on-the-fly watermarks to images and distribute them behind a CDN to all our products. Back in 2011 it was hosted in our data centers. The architecture at this time was an ASP.NET MVC application running on .NET Framework 4 atop IIS 7.5 on Windows Server 2008 R2.
For our storage needs, we had been using a third-party storage system installed in our data centers that was expensive to operate and slow to replicate data between our two data centers. The contract was up for renewal which served as a good impetus to consider something different and cloud-native. We evaluated multiple cloud storage solutions available at the time and Amazon Simple Storage Service (Amazon S3) was the perfect fit due to the cost, scalability, and performance it provides. This was the beginning of CoStar’s cloud journey.
Luckily the image distribution API mentioned above was already using popular coding patterns like dependency inversion and abstract factories. This enabled us to seamlessly transition from reading files from UNC network shares in our data centers to instead reading files from S3 buckets with simple refactoring.
We used this opportunity to learn infrastructure as code for the first time; using AWS CloudFormation to create S3 buckets along with bucket policies to ensure that files couldn’t be deleted. We created IAM policies, roles and an IAM user with associated access keys in order to enable us to authenticate to AWS from the application running in our data centers.
Throughout the next couple of years, we made some more small improvements to the API. We moved it to newer versions of the .NET Framework as they were released. We also switched our controllers and call stacks to be asynchronous where possible using C#’s async and await keywords. This led to some good scalability improvements as most of our API operations were IO-bound calls out to Amazon S3. As a result, we were no longer blocking threads whilst files were streamed from the cloud.
Still, the architecture was a little suboptimal as we streamed image files from Amazon S3 into our data centers where the application ran, and then back out to our cloud-hosted CDN:
2014 – Lift and Shift to Windows Server on Amazon EC2
With our acquisition of Apartments.com in early 2014, CoStar planned to re-launch a refreshed version of the product powered by CoStar’s unified data platform. This was the first time in our near-30 year history that we would have a product completely open and available to the public internet. The huge spike in expected traffic from having a widely-used public-facing marketplace like Apartments.com was going to require some scaling of our systems – particularly the image distribution API.
We looked for a quick solution to be able to achieve this in a way that didn’t require a lot of rearchitecting of the system. To understand how we did this, let’s take a deeper look at what that API consisted of in 2014.
At this point in the history of the application, it was an ASP.NET MVC website running on .NET Framework 4.7.2 atop IIS 8.5 on Windows Server 2012 R2. It still had the same responsibilities as at the start of our story – watermarking assets on the fly and serving as the origin to our CDN in order to geo-distribute cached assets. Since we desired a quick scalability win, we decided upon a lift-and-shift approach and chose to run the application up on Amazon EC2 using AWS Elastic Beanstalk.
We started with a very manual process of essentially choosing an AMI (Amazon Machine Image) that most closely resembled the spec of our in-house Windows Server VMs. We manually copied the application binaries into an EC2 instance launched from the AMI and then created our own custom image from the instance that could be used to roll out the application to two AWS Regions. We chose two AWS Regions – even with a CDN in front of the system – to allow for automatic failover of the CDN origin if there was ever an issue in one of the AWS Regions. Whilst this deployment process was very manual at first, it served us well for a few months as the application itself remained largely unchanged.
With the application running on Amazon EC2 and scaling elastically under load (based on a CPU trigger) we were able to shift traffic from our data centers to the Amazon EC2 environment by switching the origin of our CDN to the Elastic Beanstalk. After monitoring this environment for a few weeks without issue we decommissioned 24 Windows Server VMs in our data centers.
As we got some new requirements in from product design to add support for overhead street maps to the new Apartments.com, we felt that the deployment process could benefit from some attention and automation. We gained comfort with tools like CloudFormation and a simple deployment pipeline was built by our DevOps team. This helped us automate the deployment of application code and EC2 image creation using our in-house Azure DevOps system (formerly Team Foundation Server).
With development of the new Apartments.com fully underway we started to make some code changes to the application to embrace the fact that it was now running in the cloud. The first simplification we leveraged was the ability to run the application under an IAM role rather than using access keys to run as an IAM user that assumed the role. No more rotation of access keys! Further simplifications were made to remove the file system implementations that talked to UNC file shares now that everything was being read from Amazon S3.
This elastically-scaled EC2 environment worked well for our Apartments.com launch later in 2014 and further expansion of our public-facing brands with ApartmentFinder in 2015, LoopNet in 2017 and ForRent.com in 2018. It wasn’t until 2019 that we decided to go a step further and re-architect the application to be truly cloud-native to improve scalability, elasticity and greatly reduce costs.
2019 – Containerize and run Serverless on AWS
By 2019 CoStar was making great progress on its cloud journey. We’d lifted and shifted a few other workloads to the cloud but were also deploying net-new cloud-native workloads to AWS Lambda and Amazon Elastic Container Service powered by AWS Fargate to leverage the power and simplicity of serverless environments. After realizing the operational benefits of containerized workloads on AWS, the next logical progression for our image distribution API was to containerize it. This would make the unit of elastic scale a lightweight container rather than a whole Windows Server VM.
We made some organization-wide decisions early on that any applications we chose to containerize should use Linux containers rather than Windows containers. This was mainly down to cost savings and a lighter weight footprint on Linux but also to enable us to standardize our support and patching structure to just a single OS. By 2019 .NET Core was starting to show its maturity as a highly-performant and highly-flexible cross-platform framework that CoStar – as a large C# / .NET shop – could start to transition to. With .NET Core being able to run on Linux in containerized environments it seemed like a good solution for us to transition from .NET Framework to .NET Core.
In fact this became our cloud migration strategy as a whole at CoStar. We would containerize any pre-existing applications to run in our Kubernetes clusters, then move those containerized applications up to Amazon ECS or Amazon Elastic Kubernetes Service (Amazon EKS) where appropriate.
Throughout 2018 we’d been hard at work on an internal framework at CoStar that enabled us to define business operations and RESTful endpoints in a way that was abstracted away from any hosting platform specifics. This framework was based upon ASP.NET Core 3.1 and the benefits of our abstraction enabled us to very easily run certain workloads on Windows, Lambda, Amazon ECS Fargate or EKS Fargate just with a few small host-level changes. We call this framework Neo, and it also enables our developers to easily switch between different APIs at CoStar, knowing that they are all written and function in a homogenous way.
Our image distribution API was a great contender for refactoring to use Neo due to it already being heavily abstracted and having fairly simple dependencies that were already .NET Core-compatible. In the space of about a week of work, we were able to refactor the image distribution API to use our Neo framework and thus change the underlying runtime from .NET Framework to .NET Core. At this point we were still running on Windows VMs via IIS, but instead using .NET Core rather than .NET Framework as the application runtime. We pushed this iteration of the application out to our pre-existing Amazon EC2 production environment.
Our next iteration was to build a Docker container of the application atop an ASP.NET Core base image built on Ubuntu Linux. After achieving this by leveraging our Neo framework, it was time to provision a new production environment in ECS.
Our DevOps team assembled some new Azure DevOps deployment pipelines. A build pipeline compiled the .NET Core service, ran unit tests and then built a Docker image before pushing it to our internal JFrog Artifactory repository. By this point in CoStar’s cloud journey, we had standardized on Hashicorp’s Terraform as our infrastructure tool of choice. So, an Azure DevOps release pipeline was also created that uses Terraform to push the container up to Amazon Elastic Container Registry (Amazon ECR). This pipeline also took care of provisioning a cluster with the service, task definition and all the resources needed to run and elastically scale the containerized image distribution API in ECS Fargate.
We were able to efficiently A/B test the two parallel production environments – the old one in Amazon EC2 on Windows Server VMs and the new one in ECS Fargate on Linux containers. This also helped us appropriately size the containerized environment in order to ascertain what number and configuration of containers would be need to match the equivalent Windows-in-EC2 environment. For sizing, we initially went with the same amount of RAM for each container as the equivalent Windows VM. After monitoring application RAM usage in the containerized environment using Amazon CloudWatch metrics and Datadog, we were able to reduce the RAM required thanks to the inherently leaner stack of .NET Core in a Linux container vs .NET Framework on a Windows Server VM.
2020 – Final traffic switch to a completely serverless implementation
In a short space of time, we were confident to switch our production environment over to the ECS version. Like our lift-and-shift back in 2014, this was as simple as changing the origin of our CDN to point to the new ECS environment. If we needed to roll back for whatever reason we could always point things back to the Amazon EC2 environment. Early one morning in February 2020, we made the switch and observed traffic slowly fall off from the Amazon EC2 environment and slowly ramp up on the ECS environment.
Whilst the application code hadn’t changed this release (it was the .NET Core version of the application in both environments), we had swapped out the operating system under the application! However, the runtime compatibility guarantees of .NET Core, along with our Neo abstraction framework meant that the application worked pretty much identically on both stacks.
We observed the ECS environment scale up from an initial 15 container instances to 18 as we hit our peak load of 1,300 requests per second for the day, maintaining a back-end response time of around 80ms. This was around 20ms more responsive than the .NET Framework version of the code.
Some final tweaks were made to the application environment in March of 2020. We enabled spot for 50% of the instances to further reduce compute costs and also enabled S3 VPC endpoints so that our data egress costs were reduced too.
I’d like to close out our story with a comparison of costs. We’ll focus on just the compute costs to run the image distribution API, as that’s where we saw the most significant savings. AWS Cost Explorer was used to look at our Amazon EC2 compute costs from December 2019 and January 2020 which was just before we switched our production environment over to the ECS version.
|Timeframe||Tech Stack||Average Monthly Compute Cost|
|2011 – 2014||.NET Framework on Windows Server in CoStar data centers||N / A|
|2014 – 2019||.NET Framework on Windows Server in Amazon EC2||$14,000|
|2020 – Present||.NET Core on Linux in AWS ECS Fargate Spot||$1,100|
As can be seen above the containerized version of the application is cheaper to run by over 90% than its Amazon EC2 counterpart and we saw increased performance as a result of the architectural shift, too! While it’s impossible to put numbers on the monthly cost to run the API when it was hosted in CoStar data centers (due to that being a sunk cost) there was observable human cost in the maintenance and patching of the virtual machines that ran the system – something that is no longer a concern for us with Amazon ECS.
“Over the years our image assets have gone from being measured from hundreds of megabytes to over a half of petabyte. Historically to support that type of growth resulted in the four dreaded “Rs” – re-platform, repurchase, rearchitect and retire. It was a costly and time consuming process we went through every 3 or 4 years. With AWS we finally have a platform that can easily grow with us and allow us to evolve our architecture to achieve cost efficiencies, scale and performance that were just not possible with our legacy solutions. One thing that makes AWS so great is as they iterate and introduce new features like Fargate Spot and S3 Intelligent-Tiering with minimal effort we can see massive cost savings without the worry of any sunk capital costs.”
– Andy Ventura, Senior Director & Chief Architect at CoStar Group
The image distribution API remains a core piece of functionality at CoStar powering all our products with beautiful photography. Whilst its core function has remained largely constant it’s compute location, runtime framework and even the underlying operating system have changed drastically over the last 10 years!
Mark is a Principal Software Engineer at CoStar Group where he has spent the last ten years creating APIs, building frameworks and evangelizing platforms and practices. He is a British ex-pat living in San Diego, CA where he enjoys listening to music, sipping a good cup of tea and lots of fun family time.