A serverless architecture for high performance financial modelling

Contributions from Blythe Walker , CTO RenaissanceRe, Colum Thorne, VP Platform Architecture RenaissanceRe, Eoin Shanaghy – CTO fourTheorem, Luciano Mammino, Senior Cloud Architect fourTheorem, Matthew Meckes, Senior Serverless Specialist AWS

RenaissanceRe is one of the world’s leading reinsurance companies, consistently recognized for our innovation, technical excellence, and creative problem-solving. We specialize in matching well-structured risks with efficient sources of capital to help companies and government entities manage the risks of operating in a volatile and uncertain world, including climate change, natural hazards like wildfires and hurricanes, cyber threats, and significant societal upheaval. In 2021 gross premiums were $8B with around $18B of capital under management.

Understanding deal and portfolio risk and capital requirements is a computationally expensive process that requires the execution of multiple financial forecasting models every day and often in real-time. The company invests heavily in developing and refining its modeling expertise and technology to ensure that it can provide world-class services to its customers.

Understanding the risk portfolio requirements

The process of computing market risk exposure is called the portfolio rollup. This process executes Monte-Carlo based risk models and simulations to compute the market exposure across all deals in the portfolio.

The portfolio is a graph that comprises over 7,000 deals with a complex dependency relationship between deals.

RenaissanceRe ran the portfolio rollup as a batch computation processing approximately 45TB of data and producing 600GB of raw analytic output for every run. During typical operations, the requirement is to run around 5 rollup runs per day.

Besides batch-based risk computation, the business requires real-time deal analytics. This mode of operation provides the capability to run risk models on individual deals so that underwriters can price in near real-time, for example, when they are on the phone with a broker. While deal analytics operates on a fraction of the data, it must run at a much higher frequency – thousands of deal analytics runs per day, at peak times.

The on-premises system

RenaissanceRe implemented the previous incarnation of their risk modelling system on-premises.

Figure 1: A Python-based web server distributing jobs to a large Celery cluster through RabbitMQ using Redis and a Network File Share and a high-performing SAN for state management and data persistence. The Celery cluster comprised 9 high specification servers (512GB of RAM and 48 CPU cores).

Because of strong business growth, the number of contracts under risk has grown significantly since this solution was first implemented. This placed considerable stress on the available on-premises compute resources. Specifically:

At current volumes, the on-premises system could run a maximum of only 2-3 runs of portfolio rollups per day, each of which could take up to 10 hours. This limited the ability to re-assess the risk position as the portfolio evolved.
They executed deal analytics in the same cluster, causing resource contention. This required careful coordination among staff and caused frustration, limiting our business agility.
There was a growing requirement to run multiple, isolated, workloads to model additional market exposure scenarios. This could not be done with the on-premises compute and the available resource pool.
The overall total cost of ownership (TCO) was high. Adding capacity to the existing cluster required a lot of upfront capacity planning.

The road to AWS and serverless

In 2020 RenaissanceRe decided to stop investing in the on-premises cluster, and to re-imagine the system on AWS. While it would certainly have been possible to lift and shift this workload to EC2, we instead adopted a ‘serverless first’ approach for the migration based on these six high-level goals:

Reduce overall execution times for portfolio rollup, with a target execution time to be close to 1 hour.
Support higher volume of executions (plan for 15x data volume).
Benefit from scale on-demand and reduce costs through a pay-per-use pricing model.
Reduce undifferentiated heavy lifting and lower the overall TCO.
Adopt Infrastructure as Code (IaC) and CI/CD pipelines for reliable, repeatable automated deployments.
Increase business agility and support rapid innovation.

The solution

Achieving these goals required the team to adopt a design for the system that is radically different to the current on-premises system. Figure 2 illustrates the architecture.

Figure 2: Serverless architecture for the risk modelling system with batch workflows using AWS Fargate and real time analytics using AWS Lambda.

The two primary flows are the bulk / batch computation (yellow) for portfolio rollup and the high priority real-time (blue) for deal analytics. Both types execute the same modelling codes and are controlled by the same orchestration components.

When a portfolio rollup or a deal analytics execution starts, we submit it to the execution planner, which is implemented in AWS Step Functions. This builds the execution plan for the run, which is then stored in Amazon ElastiCache for Redis and begins execution of jobs that can run immediately.

We post jobs that are ready to the Amazon Kinesis ‘Job’ stream and from there, we send the jobs to either AWS Fargate or AWS Lambda for execution. The Fargate containers and Lambda modelling functions run the same modelling code, however both services offer different scaling and pricing characteristics; Lambda provides rapid on demand scaling and execution but at a higher price point. This makes it ideal for real-time deal analytics. At the scale of a full portfolio rollup, Lambda is more expensive and the instant scale out is not required, so we compute these jobs on Fargate. This can significantly lower the cost for bulk modelling, sacrificing some raw throughput to container scale-out time.

On job completion, we send an event to the Job State Change’ Kinesis stream, which is picked up by the Lambda scheduler. The scheduler queries and updates the execution plan and submits further jobs that can now be run to the job stream.

We use Amazon Simple Storage Service (Amazon S3) as the primary storage service, creating a computational data lake, supplying input model parameters, exposures, loss and contract data files. Results from both portfolio rollup and deal analytics are written to S3 for downstream consumption by reporting and line-of-business systems.

Execution planner

The execution planner sits at the head of the process. We provide it with a set of deals to compute which could be as small as a single deal for a deal analytic, or up to seven thousand deals for a full portfolio rollup. There is a complex interdependency between deals. This means that the inputs to some deal computations require the results from previous deals to complete first. The first step in planning the execution is to build a directed acyclic graph (DAG) of deal dependencies (the deal DAG). Once the deal DAG has been constructed, the scheduler then computes an execution plan by walking the DAG.

We consult the execution plan throughout the run, in order to determine which jobs are available for execution. Execution begins with the set of jobs at the start of the chain, i.e., those that have no dependencies.

Job scheduling, caching and error handling

We give each job an execution priority during the planning phase. Typically, we assign a high priority to deal analytics and a lower priority to portfolio rollups.

We place standard priority jobs into an Amazon Simple Queue Service (Amazon SQS) queue for execution by a Fargate modelling container. During a full run, there will be up to 3,000 containers executing in parallel. High priority jobs are submitted to Lambda for execution which provides immediate scale on demand.

As each job completes, we write results to S3 and update the ‘job state change’ Kinesis Stream. The scheduler Lambda functions take input from the change stream and update the job state cache which is held in ElastiCache for Redis. This dynamically updates the overall execution plan for the run. As the run proceeds, the scheduler consults the execution plan to identify jobs that have satisfied their upstream dependencies. These jobs are submitted for execution.

Elasticache is also used to cache results. If a deal has been recently computed with the same input parameters, we use the cached result and update the execution plan – updated as if we had run the job.

Occasionally, failures will occur during job execution, like a Fargate container crash. This will cause an error event to be written to Amazon EventBridge. A handler function will log this error condition and then place an update into the job state stream. The job scheduler will resubmit the failed job for execution up to three times before marking the job as permanently failed. A run that has jobs that repeatedly fail to execute will cause the entire run to be halted and raise an alert condition.

Challenges we overcame

Applying a serverless technology stack to financial modelling has proven to be beneficial to the business, but there were several key challenges to be overcome along the way.

AWS Fargate scaling

Fargate’s default scaling rules are well suited for a typical web application workload. However, they are not optimised for the high-scale batch computation characteristics of the portfolio rollup where the system needs to rapidly scale to thousands of containers.

To overcome this challenge, the solution uses a custom scaling algorithm that leverages the Amazon Elastic Container Service (ECS) RunTask API instead of the Fargate service scheduler. This required engagement with the Fargate team to increase limits.

Amazon S3 partitioning

Input data for a full portfolio rollup is around 20GiB, typically split across 15-20k individual files. During a full run the system will transfer data out of S3 at a rate of 100GiB/s. This high-scale use of S3 required us to structure the key-partitioning scheme carefully to ensure maximum request throughput. This was done with the support of the S3 product team.

Conclusion

The application of serverless technology to this domain has proven to be a great success and has allowed the system to meet a critical business requirement to support a range of computation types from large scale to individual real-time jobs. The ability to switch seamlessly between Lambda and Fargate depending on the execution context aligns costs with the required performance for each specific job.

There’s already been some key business results from using the system.

Our Full Portfolio rollups complete in about 1 hour. The team is confident that they can reduce this further. We can run multiple ad hoc portfolio rollup scenarios (‘what if’ analyses) within known-time and cost parameters.
Our deal analytics are faster and more consistent, because we have eliminated resource contention. This means that the business is not wasting time scheduling and prioritising runs.
We have reduced the overall codebase by about 70% – lowering total cost of ownership and removing responsibility for non-differentiating layers of the stack.

Our development teams now focus more on feature development, and less on hardware specific optimization. That means the business is well positioned to support future portfolio growth so we can expand our risk modelling capabilities – free from on-premise hardware constraints.

AWS HPC Blog