Analyze AWS Microservices architecture to identify and address performance issues

Amazon Payment Services (APS) is a payment service provider in the Middle East and North Africa. With its secure and seamless payment experience, it empowers businesses to build their online presence.
Amazon Payment Services is based on a broad and complex microservice based architecture that are dependent on multiple AWS services, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon CloudWatch and Amazon Simple Storage Service (Amazon S3). With the expansion of APS functionality, the dependency graph of services is ever growing. The team encountered problems tracing which microservices are used for a particular flow, as well as analyzing the microservices performance and health as well. In this blog post, we explain the process of identifying performance issues in the customer request flow within microservices architecture.

APS System Architecture

The team handles seven merchant integration flows, such as redirection page, custom merchant payment page and mobile SDK. As will as managing the integrations with worldwide payment processors and third party payment service providers. The team is responsible for addressing customer performance and latency reported issues. Insuring overall system performance improvement, business grow and scalability.

APS microservices architecture helps break down the payment system components into smaller and loosely coupled system based on the payment functionality. This helps the team to achieve customer needs and address customer problems faster making the system flexible with less operational effort. Microservices architecture empowered the team to deliver highly scalable and flexible solutions with the ability of scaling up or down as needed on demand to insure optimal performance, maintainability and reliability for APS customers.

APS’s challenges with a microservice architecture

When running a customer requests, multiple microservices may be invoked, the number of invoked microservices depends on the functionality intended by the flow. With the number of complex use cases that APS addresses, there is a consistent dependency across vast number of services for the high number of flows.
These problems caused a delay in finding the root cause of customer performance issues and impacting delivering new features. This delay negativity impact customer satisfaction ratio. Moreover, this can lead to system outages and impacting the overall system stability.

Some barriers they faced were as follows:

The team had a hard time measuring the latency metrics of API calls between invoked microservices continuously to detect any performance or latency issues.
They lacked a systematic approach for troubleshooting and assessing latency at the microservice level. As a result, they had to emit metrics manually and add additional logs when addressing performance issues, leading to increased overhead and operational effort.
There were no tools available for analyzing code block performance within each microservice.
Without a holistic tool for tracking the end-to-end flow of the request as it traversed through the microservices, the team lacked visibility into performing their applications and infrastructure which affected both operational and business outcomes.

Solution Overview

In exploring solutions, the team sought solutions that minimized up-front costs and reduced overhead management. Due to the native nature of the application architecture, the team preferred a managed service that is reliable, efficient, affordable and provided results without a long integration and setup process. AWS X-Ray was selected as the ideal solution for supporting operations by the team and meeting their challenges. AWS X-Ray is a distributed application tracing service consisting of an Application Programming Interface (API), Software Development Kit (SDK), dashboard, and integrations with other AWS monitoring tools. It supports multiple programming languages to show a map of your application’s underlying components. Typically, operations teams use it in conjunction with Amazon CloudWatch serviceLens, and Amazon CloudWatch Synthetics in order to isolate the cause of slowdowns or outages in AWS-hosted applications.

With AWS X-Ray, the team collected metrics for latency, throttle, error rates, and faults across the microservices architecture in seconds. This helps them to identify the root cause of performance issues and misbehaving components, whether it was an internal microservice or an external service provider.

Three days later, not only had AWS X-Ray been integrated, but the team also started seeing real benefits and leveraging its functionality. Setup was easy, the team used AWS documentation for AWS X-Ray and followed the steps provided below:

Setup AWS X-Ray daemon.
Setup AWS X-Ray Amazon CloudWatch metrics.
Configure spring Aspect-Oriented programming (AOP) using AWS X-Ray Java SDK

See following the high level of AWS X-Ray setup architecture diagram.

How AWS X-Ray solves our challenges and improves end user experiences

As a result of using AWS X-Ray, the team gained deep insights into our underlying systems and identified 10 area of improvements, 3 unneeded network calls during the request lifecycle and 4 misbehaving flows. Which resulted reducing request latency by 25 percent and improved overall customer experiences and gain customer trust. here is how.

HTTP Request Service map: AWS X-Ray service map provides detailed end-to-end graphical representations of customer requests. It shows the connections between backend, frontend, and downstream services. As a result, performance bottlenecks, API response time, and the number of network calls for each node can be identified. Request service maps are obtained by clicking on the request trace ID in the traces tab of the AWS X-ray console. Following are some examples of trace lists and request service map illustrating the connections between the nodes. It includes telemetry data such as response time, number of requests and service node name.

Code Base End-to-End tracing and segmented Timeline: AWS X-Ray can provide a real-time trace for inline code block methods that run during the request life cycle by annotating the intended code class with @XRayEnabled. AWS X-Ray SDK leverages Aspect Oriented Programming (AOP) to emit telemetry metric at the code method level, such as response time consumed and HTTP status. Through AWS X-Ray traces, the team were able to identify the code methods that consume high response times and count the number of methods that are consumed during a request. By quantifying these metrics, the team was able to drive latency and performance improvement initiatives using data.

AWS X-Ray traces helped in defining the code methods that consumes high response time and count the number of methods being consumed during request. The ability to quantify these metrics helped the team follow a data backed approach to drive latency and performance improvement initiatives.
Amazon CloudWatch Dashboards with API performance metrics: With AWS X-Ray, teams can gain insights and metrics across multiple API performance dimensions, including response time, API request count, error/fault rate, and throttle rate. Since these metrics can be accessed via Amazon CloudWatch, the team decided to utilize its powerful visualization features. For data visualization, the team used features such as statistics, time periods, and data bucketing. Amazon CloudWatch visualization features such as bar charts, line charts help to quickly focus on areas of latency or performance concerns by observing P50, P90, and P99 latencies metrics. This helps is in identify the potential bottlenecks, optimize user experience and auto detect latency issue by utilizing the powerful of Amazon CloudWatch metrics alarms.

Conclusion

To recap, the team was able to identify the root causes of requests experiencing high response times. The team established SLAs (Service Level Agreements) for API latency in microservice APIs. Following the gathering of raw segment data from downstream services, and the configuration of alarms using metrics published through AWS X-Ray. The team created a latency dashboard for the downstream services. Customer requests latency was reduced by 25 percent by identifying and addressing areas for improvement.

AWS Cloud Operations & Migrations Blog