Migrating a self-managed message broker to Amazon SQS
Amazon Payment Services is a payment service provider that operates across the Middle East and North Africa (MENA) geographic regions. Our mission is to provide online businesses with an affordable and trusted payment experience. We provide a secure online payment gateway that is straightforward and safe to use.
Amazon Payment Services has regional experts in payment processing technology in eight countries throughout the Gulf Cooperation Council (GCC) and Levant regional areas. We offer solutions tailored to businesses in their international and local currency. We are continuously improving and scaling our systems to deliver with near-real-time processing capabilities. Everything we do is aimed at creating safe, reliable, and rewarding payment networks that connect the Middle East to the rest of the world.
Our use case of message queues
Our business built a high throughput and resilient queueing system to send messages to our customers. Our implementation relied on a self-managed RabbitMQ cluster and consumers. Consumer is a software that subscribes to a topic name in the queue. When subscribed, any message published into the queue tagged with the same topic name will be received by the consumer for processing. The cluster and consumers were both deployed on Amazon Elastic Compute Cloud (Amazon EC2) instances. As our business scaled, we faced challenges with our existing architecture.
Challenges with our message queues architecture
Managing a RabbitMQ cluster with its nodes deployed inside Amazon EC2 instances came with some operational burdens. Dealing with payments at scale, managing queues, performance, and availability of our RabbitMQ cluster introduced significant challenges:
- Managing durability with RabbitMQ queues. When messages are placed in the queue, they persist and survive server restarts. But during a maintenance window they can be lost because we were using a self-managed setup.
- Back-pressure mechanism. Our setup lacked a back-pressure mechanism, which resulted in flooding our customers with huge number of messages in peak times. All messages published into the queue were getting sent at the same time.
- Customer business requirements. Many customers have business requirements to delay message delivery for a defined time to serve their flow. Our architecture did not support this delay.
- Retries. We needed to implement a back-off strategy to space out multiple retries for temporarily failed messages.
The previous architecture shown in Figure 1 was able to process a large load of messages within a reasonable delivery time. However, when the message queue built up due to network failures on the customer side, the latency of the overall flow was affected. This required manually scaling the queues, which added significant human effort, time, and overhead. As our business continued to grow, we needed to maintain a strict delivery time service level agreement (SLA.)
Using Amazon SQS as the messaging backbone
The Amazon Payment Services core team designed a solution to use Amazon Simple Queue Service (SQS) with AWS Fargate (see Figure 2.) Amazon SQS is a fully managed message queuing service that enables customers to decouple and scale microservices, distributed systems, and serverless applications. It is a highly scalable, reliable, and durable message queueing service that decreases the complexity and overhead associated with managing and operating message-oriented middleware.
Amazon SQS offers two types of message queues. SQS standard queues offer maximum throughput, best-effort ordering, and at-least-once delivery. SQS FIFO queues provide that messages are processed exactly once, in the exact order they are sent. For our application, we used SQS FIFO queues.
In SQS FIFO queues, messages are stored in partitions (a partition is an allocation of storage replicated across multiple Availability Zones within an AWS Region). With message distribution through message group IDs, we were able to achieve better optimization and partition utilization for the Amazon SQS queues. We could offer higher availability, scalability, and throughput to process messages through consumers.
This serverless architecture provided better scaling options for our payment processing services. This helps manage the MENA geographic region peak events for the customers without the need for capacity provisioning. Serverless architecture helps us reduce our operational costs, as we only pay when using the services. Our goals in developing this initial architecture were to achieve consistency, scalability, affordability, security, and high performance.
How Amazon SQS addressed our needs
Migrating to Amazon SQS helped us address many of our requirements and led to a more robust service. Some of our main issues included:
Losing messages during maintenance windows
While doing manual upgrades on RabbitMQ and the hosting operating system, we sometimes faced downtimes. By using Amazon SQS, messaging infrastructure became automated, reducing the need for maintenance operations.
Different customers handle messages differently. We needed a way to customize the concurrency by customer. With SQS message group ID in FIFO queues, we were able to use a tag that groups messages together. Messages that belong to the same message group are always processed one by one, in a strict order relative to the message group. Using this feature and a consistent hashing algorithm, we were able to limit the number of simultaneous messages being sent to the customer.
Message delay and handling retries
When messages are sent to the queue, they are immediately pulled and received by customers. However, many customers ask to delay their messages for preprocessing work, so we introduced a message delay timer. Some messages encounter errors that can be resubmitted. But the window between multiple retries must be delayed until we receive delivery confirmation from our customer, or until the retries limit is exceeded. Using SQS, we were able to use the ChangeMessageVisibility operation, to adjust delay times.
Scalability and affordability
To save costs, Amazon SQS FIFO queues and Amazon ECS Fargate tasks run only when needed. These services process data in smaller units and run them in parallel. They can scale up efficiently to handle peak traffic loads. This will satisfy most architectures that handle non-uniform traffic without needing additional application logic.
Our service delivers messages to the customers via host-to-host secure channels. To secure this data outside our private network, we use Amazon Simple Notification Service (SNS) as our delivery mechanism. Amazon SNS provides HTTPS endpoint delivery of messages coming to topics and subscriptions. AWS enables at-rest and/or in-transit encryption for all architectural components. Amazon SQS also provides AWS Key Management Service (KMS) based encryption or service-managed encryption to encrypt the data at rest.
To quantify our product’s performance, we monitor the message delivery delay. This metric evaluates the time between sending the message and when the customer receives it from Amazon payment services. Our goal is to have the message sent to the customer in near-real time once the transaction is processed. The new Amazon SQS/ECS architecture enabled us to achieve 200 ms with p99 latency.
In this blog post, we have shown how using Amazon SQS helped transform and scale our service. We were able to offer a secure, reliable, and highly available solution for businesses. We use AWS services and technologies to run Amazon Payment Services payment gateway, and infrastructure automation to deliver excellent customer service. By using Amazon SQS and Amazon ECS Fargate, Amazon Payment Services can offer secure message delivery at scale to our customers.