Customer Stories/Financial Services

2025
peak3

Chaos Engineering helps PayerMax to achieve 99.99% system availability and enhanced the global merchant payment experience extensively

Learn how PayerMax uses the Chaos Engineering of Amazon Web Services combined with expert resources to improve service reliability and response time, enabling a more stable and smooth payment experience for global merchants.

70%

The frequency of system failures was reduced by approximately 70% year over year

80%

Failure recovery time reduced by approximately 80% year over year

99.99%

System availability increased to over 99.99%

50%

The customer complaint rate decreased by more than 50% year over year

Overview

PayerMax is a fintech company rooted in emerging markets with a global business layout. It is dedicated to establish a professional global online payment solutions to provide merchants with a safer and more convenient one-stop payment experience. Currently, PayerMax has covered mainstream local payment methods in emerging markets such as Southeast Asia, the Middle East, Latin America, and Central Asia. PayerMax enhances the payment experience of global merchants by optimizing the core system and improving the stability and reliability of payment services to better support the continuous growth of business scale and transaction volume. PayerMax implements Chaos Engineering based on Amazon Web Services to identify risks in the payment system and verify whether the fault has been resolved after the problem is addressed. In this project, PayerMax's main resources and services included: Amazon Web Services Resilience Analysis Framework, AWS Fault Injection Service (AWS FIS), Amazon CloudWatch, Amazon EventBridge, Amazon Simple Notification Service (Amazon SNS), etc.

payermax

Opportunity | To cope with the rapid expansion of financial payment transaction volume and scale, PayerMax uses Chaos Engineering to improve system stability

As a global cross-border payment solution provider, PayerMax is deeply involved in emerging overseas markets such as Southeast Asia, the Middle East, Latin America, and Central Asia to provide competitive one-stop financial payment solutions. These solutions cover global acquisition, payment, and payout services in the fields of digital entertainment, gaming, e-commerce, internet finance, distance learning, etc.

With the growth of the business and the expansion of the scale of transactions, the acquisition and payout business processes for cross-border payments are long, and the service invocation involves both the order and the capital process. The links are complicated, and PayerMax's demand for business stability becoming stronger. In the mean while, in order to cover payment methods in various overseas regions and provide customers with a smooth local payment experience, PayerMax agilely iterates and updates, releasing 2 versions per week on average. Therefore, it is essential for PayerMax to maintain a rapid iteration of their system and frequent version release while ensure the stability of the cross-border payment system when transaction volume and scale continue to grow. PayerMax focuses on improving stability in the following 3 areas:

  • Continuously monitors system behavior and discovers system weaknesses, reinforces the system in a timely manner, nips the problem in the bud, and improves the Mean Time To Fail of the system;
  • Improve the ability to detect errors in payment services and quickly recovers from failures, to promptly restore payment services when faults occur;
  • Cultivates a technical culture for engineers to design systems for failure, comprehensively establishes the ability to cope with various incidents, and enhances the payment success rate in complex environments.

PayerMax and Amazon Web Services have been working together since 2018. In this 6 years of collaboration, they have gone through multiple sets of environmental isolation construction, security cooperation, multi-cluster construction such as databases and middleware, construction of hot and cold storage, and multi-site deployment. Currently, PayerMax workloads are mainly deployed on Amazon Web Services. After the communication with Amazon Web Services in light of the stated requirements, PayerMax decided to improve reliability based on Amazon Well-Architected and achieve the stability goals by Chaos Engineering.

kr_quotemark

By implementing Chaos Engineering, using Amazon Web Services' best stability-related best practices that have been verified in millions of customers around the world, Amazon Web Services' 'Resilience System Analysis Framework' methodology and expert resources are introduced to PayerMax to help us identify potential weak points in the system in advance, reduce possible losses caused by system stability defects, and effectively exercise the team's ability to identify risks, locate faults, and quickly restore system services. We look forward to further cooperation with Amazon Web Services to jointly protect the stability and reliability of the global financial industry payment network!"

Fu Yangbiao
PayerMax CTO

Solution | From a resilient system analysis framework to a fault injection tool, PayerMax identifies and verifies potential system risks in 16 core subsystems based on Amazon Web Services' Chaos Engineering

Chaos Engineering injects unknown faults into the system, detects potential weak points in the system in advance, checks the robustness of the system, and reduces possible losses caused by system stability defects. Chaos Engineering itself does not solve the stability problem. Instead, it finds out where the system's stability weaknesses or vulnerabilities are. It can perform abnormal tests from different levels, including the entire availability zone level, database level, application level, etc. The final question is: What will bring down the whole business? Fails in server, an availability zone, database, or an application?If the answer is “yes”, then there is a stability risk.

Chaos Engineering is related to the robustness of the business system and faces various challenges during implementation although it is not a business system. Firstly, PayerMax promotes unified thinking for all employees from the C-level, actively embraces new technical concepts, and supports project implementation. Secondly, risk assessment requires project executives to understand the business system, sort out risk points, fault injection points, and be familiar with relevant tools. Poorly defined risk points, injection points, expected behavior and results could easily lead to a significant drop in the effectiveness of Chaotic Engineering. Thirdly, PayerMax hopes to simulate as realistically as possible when a large-scale failure occurs in the cloud infrastructure, such as testing whether the current system could stand in the case of Amazon Web Service’s availability zone level failure. Finally, how to select the right tools to automate is the key to transforming the project from a one-time gig to a routine project.

The deposit and withdrawal main link is the essential path for payment system transactions. Its success rate is an important parameter affecting the user retention rate. Ensuring the stability and availability of the main link service is the key goal of PayerMax. PayerMax's Chaos Engineering project targets the two major payment and withdrawal paths, and covered 100% of the main link application. It involved PayerMax's 16 core subsystems, covering potential failure scenarios such as networks, databases, container platforms, microservice systems, and Amazon Web Service availability zone levels.

First, use the various resources of Amazon Web Services to solve the PayerMax stability engineer. Previously, PayerMax had not systematically practice in Chaos Engineering despite of having a system stability foundation for many years. However, the reliability pillar in the good architecture provided by Amazon Web Services is based on the “Amazon Web Services Resilience System Analysis Framework” from the perspective of system hierarchy and system failure types, combined with industry fault type statistics, etc., and PayerMax can comprehensively identify and evaluate system risks.

Second, based on risk analysis results, Amazon Web Services and PayerMax jointly conducted system steady-state analysis, Chaotic Engineering experimental hypotheses (assuming system performance after failure injection), selection and deployment of experimental tools, summary analysis of experimental results, and optimized and improved the system based on feedback of experimental results.

Then, during the Chaos Engineering experiment, PayerMax teamed up with Amazon Web Services to verify the system's robustness and fault tolerance. By using AWS FIS for Amazon Web Services infrastructure fault injection, including simulation of Amazon Web Services availability zone-level failures, the industry's open source tools are used to inject faults into container platforms and microservices. Both parties also verified the effectiveness of the monitoring and alerting platform, the ability to locate and resolve faults, jointly solved the problems revealed during the experiment, and promoted the automated construction of the PayerMax payment system.

PayerMax's practical process to implement Chaos Engineering experiments based on Amazon Web Services

AWS FIS is part of the Amazon Resilience Hub to run fault injection experiments to improve application performance, observability, and resilience. AWS FIS streamlines the process of setting up and running controlled fault injection experiments with a range of Amazon Web Services, providing PayerMax with the controls and safeguards required for experiments in production, such as automatic roll back or stopping experiments when specific conditions are met.

In addition, in terms of structural stability, PayerMax verifies whether the single-point-to-cluster, multi-site, and multi-AZ deployment of the PayerMax business system could keep the business at a high level of availability and stability based on Amazon Web Services and self-developed technical solutions.

PayerMax set up the project to launch Chaos Engineering since March 2024. It took about 6 months to complete the implementation of the PayerMax Chaos Engineering Project, from putting forward ideas, introducing expert resources, learning relevant methodologies and cases, formulating implementation plans, evaluating business risk points, selecting appropriate tools, and selecting fault injection points, to full-stack testing, individual in-depth testing of the 16 core subsystems, and finally summarizing the problems and risk points revealed during the period, and comprehensively formulating a system upgrade and transformation plan.

Schematic diagram of PayerMax's product architecture based on Amazon Web Services

Business results | The number of system failures decreased by about 70% year over year, while failure recovery time decreased by about 80% year over year, and customer complaint rate decreased by more than 50% year over year

Through the implementation of the Chaos Engineering Project, it was possible to: 1) Eliminate faults before they occur, PayerMax identified 34 types of potential risk points, verified and solved most of the problems, and the number of system faults was reduced by about 70% year over year; 2) Strengthen the system's ability to identify faults, locate faults, and resolve faults. Compared with before Chaos Engineering was implemented, the Mean Time To Fail of the PayerMax payment system is improved and fault recover time is shortened by about 80%; 3) Increase the system availability rate to over 99.99%, helping PayerMax to be effective in addressing all kinds of risks and ensure the continued stability of merchant payment services.

Chaos Engineering covers PayerMax's 16 core subsystems, covering potential failure scenarios such as networks, databases, container platforms, microservice systems, and Amazon Web Services availability zone levels. The implementation of Chaos Engineering helped PayerMax's core business system further enhance its market competitiveness and effectively supported the construction of subsequent satellite stations. In the end, there were no major failures in PayerMax's stability throughout the year, and the customer complaint rate dropped by more than 50% year on year, winning more trust from the clients.

Furthermore, through the implementation of this project, the concept of Chaos Engineering was introduced to PayerMax. Amazon Web Services trained the PayerMax technical team on resiliency architecture construction to understand and strengthen the acceptance of Chaos Engineering throughout the company. More importantly, it has raised technicians' design awareness for failure and a stronger sense of risk prevention. Technicians are now comfortable with failure - first identify various risks and hidden dangers from a pessimistic perspective, then carefully conduct risk assessments, optimistically explore solutions after risk exposure, and verify whether risks have been effectively addressed.

Next, PayerMax will continue to adapt Chaos Engineering construction with Amazon Web Services, improve risk point identification mechanisms, expand coverage scenarios, and ramp up the degree of randomness and automation of chaos engineering exercises. PayerMax will also explore a “global network” based on Amazon Web Services, support business expansion in more regions of the World through a unified cloud infrastructure architecture, achieve mutual complementarity between multiple sites around the world, ensure equal emphasis on compliance and efficiency, and further enhance the payment experience for global merchants. In addition, generative AI is also PayerMax's focus. In the future, they will work with Amazon Web Services to explore innovative applications of generative AI in the field of fintech.

About PayerMax

PayerMax is a fintech company rooted in emerging markets with a global business layout. It is committed to providing professional global cross-border payment solutions. Its services cover the fields of global acquisition, payment, payout, etc., and aims to provide a safer and more convenient one-stop payment experience for global customers.

Adapted Amazon Web Services

AWS FIS

FIS provided the team with the controls and protective mechanisms needed to conduct experiments in production, such as automatic roll back or stopping experiments when specific conditions are met.

Amazon CloudWatch

The Amazon CloudWatch service can monitor applications, respond to changes in performance, optimize resource usage, and provide insight into operations.

Amazon EventBridge

Build event-driven apps at scale, covering AWS, existing systems, or SaaS applications.

Amazon SNS

A fully managed Pub/Sub service for A2A and A2P messaging.

*The specific Amazon Web Services generative AI-related services mentioned above are only available in Amazon Web Services overseas regions, and Amazon Web Services China only recommends this service to help you growing your overseas business and/or understanding cutting-edge technology choices in the industry.