Transforming Credit Risk Analysis with AWS Fargate and AWS Step Functions at Société Générale
This post was co-written with Soufiane Matine, Senior Expert Architect at Société Générale and Thierry Wilmot, IT Managing Director at Société Générale. View the blog post in French.
One of the many benefits of using the AWS Cloud is the ability to access resources on demand in seconds. That’s why many customers are using the power and flexibility of the cloud to solve use cases that require high computing power for a potentially short period of time. This blog post gives you a real-world example of how one of Europe’s leading banking institutions, Société Générale, is using AWS to solve the need for flexible, scalable, and cost-effective compute capacity.
About corporate IT at Société Générale
The corporate IT service unit of Société Générale serves business units in the areas of Finance, Risk, Compliance, Human Resources, Communication, Real Estate, and Purchasing.
As part of the transformation program of this central department and five years after the in-house design of the first credit risk calculator, the team reached a new milestone in December 2020 by deploying the first risk computing engine built on AWS services.
The management wanted to build a new generation of computing engines at the launch of this initiative, in order to allow as many users as possible to benefit from the experience acquired over the years by the team. To do this, they introduced a new type of cost-effective solution, oriented toward the end customer. The first prototype was built in a few months, and tested computing capacities up to 1,800 nodes and 10 GB of volume exchanged per run to meet the needs of Société Générale. The solution was then deployed in production at the end of 2020. It is comparable in terms of computational results with the previous generations, accessible via an API interface, and open to integration with several development languages and big data solutions.
This offer, intended for Tech Leads and Developers, includes a set of automated deployment and starter kit services for batch or real-time processing. It fully decouples storage from processing to address Société Générale’s new simulation needs. It raises the level of resilience, thanks to Multi-AZ architecture, to the level of the best market standards. It reduces development times by 20% and reduces operating costs up to 7x, thanks to a pay-per-use model.
In 2021, two internal customers will benefit from this new solution. Société Générale teams are also planning the complete migration of credit risk calculators and the roll-out of the offer to new business uses (Compliance and ALM).
Société Générale teams are considering an even more generic offer to be rolled out during the next two years, with a generalization of the calculation algorithms, allowing them to onboard new use cases without custom developments.
After a year of development, AWS is able to provide our customers with a computing grid for real-time needs with high call frequency on a low volume, and a batch processing capability for more complex needs.
We anticipate a ramp up of this new solution in the coming months to serve our simulation challenges in the field of credit risk initially, then more broadly in finance, liquidity, and solvency.
Business and Technological Context
First, it was necessary to address the increase in simulation capacities to better support the risk businesses in precisely identifying the credit risk impacts of economic events, for example, the COVID-19 pandemic, Brexit, raw materials availability, and others.
These operations require a large number of processors because the computation is often carried out in parallel. With sufficient computing power, a run may take 30 minutes. Once the run is complete, the results can be stored and no additional IT resources are needed. On-premises scaling for this type of operation is not optimal, as customers must purchase instances to have the capacity to meet peak needs. Recent events have shown that it is very difficult to plan effectively for IT resource requirements.
A second objective was to offer simple access, usable by the greatest number of people to these new computing capacities and to reduce development time by offering developers and tech leads a set of white label applications.
The third objective was to offer more power at the same cost.
This is why in 2020, the “Computing Farm” was migrated to the AWS Cloud. The team identified the following requirements:
- Reduced costs: the new IT farm must have minimal running and operating costs
- Minimization of refactoring: Société Générale has already developed calculation engines that should be able to operate in the new architecture with minimal refactoring effort
- Constant execution time: the execution times of each calculation operation must be constant and must not increase when several calculation operations are requested in parallel
- Secure: each execution must be fully auditable
- (Almost) infinite scalability: the limit of maximum parallel executions must be as high as possible
- Flexible design: the “Computing Farm” must allow the deployment of new computing engines in a standard and reproducible solution
The following diagram shows the general architecture of the solution on AWS:
End users interact with the system through a REST API, exposed through a private API Gateway. Internal users will need to upload all of the data needed for the calculation to a private Amazon S3 bucket. They will then invoke the calculation engine of their choice via the private API. There are two types of calculation engines: real time (synchronous) and batch (asynchronous). Compute engines are triggered by an AWS Lambda function.
Real-time processing runs in seconds and uses Amazon ECS to run containers on AWS Fargate. AWS Fargate eliminates the need to provision and manage servers – it lets you specify and pay for resources per application, and improves security by isolating applications by design. Because these requests can be frequent, the containers are always active and are associated with an Auto Scaling Service, which can add and remove containers based on CPU consumption.
The batch processing type takes more time to be completed, and is therefore managed via a job queue, where end users send the calculation request to the API, and in response obtain a job ID. Users can query the application again through the API Gateway to find out the status of the batch job. Job execution is delegated to AWS Step Functions, a serverless function orchestrator that facilitates the orchestration of AWS Lambda functions and multiple AWS services.
Let’s take a closer look at batch processing. With AWS Step Functions, we can describe the workflow and integrate other AWS services as needed. In this case, we are using Amazon ECS tasks as the primary execution engine. We first need to run some preliminary analysis on the input data, which allows us to do some preprocessing and, most importantly, split the input into smaller parts that can be processed independently and in parallel. We use the Map state to run each task in parallel, with a configurable concurrency setting, as shown in the following AWS Step Functions diagram:
Since we are using Amazon ECS tasks, we pass the input parameters to the application as environment variables, which are dynamically set by the prepare step. At this point, each task will run in an AWS Fargate task (we can actually configure it), and after all the operations are complete, the results of each calculation are gathered into a final state in a private S3 bucket.
Transforming Société Générale code to runs on AWS Fargate was pretty straightforward. We needed to remove all of the complex workload distribution logic first because it’s now completely replaced by AWS Step Functions. Next, we needed to be able to use Amazon S3 instead of the local file system to access and write the results of the calculations. Finally, we added the ability to use environment variables to configure all I/O for each run.
This setup is completely serverless – no operating system patches, no instances to maintain, and run. Plus, the compute runs are independent of each other, and by using container tags and different stage functions, we can even run different versions of the engines in parallel, test faster, and reduce the chance of regressions. A multitude of metrics are also made available from AWS services, and we can track execution times, failures, number of requests, etc.
Here are some tips when using Step Functions with Amazon ECS:
- The amount of data that can be passed between states is governed by limits – using Map-type states with Amazon ECS can produce large volumes of output data. Make sure to use ResultPath to control what information should be passed through the state machine
- Use environment variables to pass data between Step Function states and Amazon ECS tasks – if you need to pass large configurations, use Amazon S3 files as the data bus
- When performing many tasks in parallel with AWS Fargate, you may encounter service limits that allow the continued proper functioning of the Amazon ECS service – you can set a Retry policy to allow the Step Function to automatically retry the tasks in failure
- If your tasks are long, you might consider using Heartbeat to have Amazon ECS tasks send a signal during processing to tell Step Functions that the task is still active
The previous architecture illustrates one of the possible configurations of a specific risk calculation engine: how to make this solution easily reproducible and flexible enough to work with different engines, at scale? We decide to do it all through the use of AWS Cloud Development Kit (AWS CDK), an open-source software development framework to model and provision your cloud application resources using familiar programming languages. With AWS CDK, it is possible to create higher-level constructs that can use and configure multiple services in a stack, thereby increasing the reuse of components by different teams.
For example, for some compute engines, we will need to use EC2 instead of Fargate – with Spot Instances whenever possible. We implemented a standard state machine in AWS CDK that creates a Spot Fleet on the fly. The Spot Fleet generator state machine is provisioned and abstracted as an npm module that can be reused between teams and deployed to different AWS accounts – all you need is to import the module and include it in the CDK stack, with a single line of code that looks like this in TypeScript:
All security, configuration of Lambda functions, retry policy in case of error, and tests are already integrated in the module and simplified for the user.
In this article, we shared how Société Générale is leveraging different AWS technologies to create high-performance, configurable compute engines. Going forward, the team plans to test instances based on ARM Graviton2 processors for better price / performance and leverage additional native AWS services to build future platforms to meet their business needs.
For more information on how we can help you meet your cloud-based business objectives, contact us here.