Queue Integration with Third-party Services on AWS
Commercial off-the-shelf software and third-party services can present an integration challenge in event-driven workflows when they do not natively support AWS APIs. This is even more impactful when a workflow is subject to unpredicted usage spikes, and you want to increase decoupling and fault tolerance. Given the third-party nature of services, polling an Amazon Simple Queue Service (SQS) queue and having built-in AWS API handling logic may not be an immediate option.
In such cases, AWS Lambda helps out-task the Amazon SQS queue integration and AWS API handling to an additional layer. The success of this depends on how well exception handling is implemented across the different interacting services. In this blog post, we outline issues to consider when adopting this design pattern. We also share a reusable solution.
Design pattern for third-party integration with SQS
With this design pattern, one or more services (producers) asynchronously invoke other third-party downstream consumer services. They publish messages to an Amazon SQS queue, which acts as buffer for requests. Producers provide all commands and other parameters required for consumer service execution with the message.
As messages are written to the queue, the queue is configured to invoke a message broker (implemented as AWS Lambda) for each message. AWS Lambda can interact natively with target AWS services such as Amazon EC2, Amazon Elastic Container Service (ECS), or Amazon Elastic Kubernetes Service (EKS). It can also be configured to use an Amazon Virtual Private Cloud (VPC) interface endpoint to establish a connection to VPC resources without traversing the internet. The message broker assigns the tasks to consumer services by invoking the RunTask API of Amazon ECS and AWS Fargate (see Figure 1.)
The message broker asynchronously invokes the API in ‘fire-and-forget’ mode. Therefore, error handling must be built in to respond to API invocation errors. In an event-driven scenario, an error will be invoked if you asynchronously call the third-party service hundreds or thousands of times and reach Service Quotas. This is a potential issue with RunTask API actions, or a large volume of concurrent tasks running on AWS Fargate. Two mechanisms can help implement troubleshooting API request errors.
- API retries with exponential backoff. The message broker retries for a number of times with configurable sleep intervals and exponential backoff in-between. This enforces progressively longer waits between retries for consecutive error responses. If the RunTask API fails to process the request and initiate the third-party service, the message remains in the queue for a subsequent retry. The AWS General Reference provides further guidance.
- API error handling. Error handling and consequent logging should be implemented at every step. Since there are several services working together in tandem, crucial debugging information from errors may be lost. Additionally, error handling also provides opportunity to define automated corrective actions or notifications on event occurrence. The message broker can publish failure notifications including the root cause to an Amazon Simple Notification Service (SNS) topic.
SNS topic subscription can be configured via different protocols. You can email a distribution group for active monitoring and processing of errors. If persistence is required for messages that failed to process, error handling can be associated directly with SQS by configuring a dead letter queue.
Reference implementation for third-party integration with SQS
We implemented the design pattern in Figure 1, with Broad Institute’s Cell Painting application workflow. This is for morphological profiling from microscopy cell images running on Amazon EC2. It interacts with CellProfiler version 3.0 cell image analysis software as the downstream consumer hosted on ECS/Fargate. Every invocation of CellProfiler required approximately 1,500 tasks for a single processing step.
Resource constraints determined the rate of scale-out. In this case, it was for an Amazon ECS task creation. Address space for Amazon ECS subnets should be large enough to prevent running out of available IPs within your VPC. If Amazon ECS Service Quotas provide further constraints, a quota increase can be requested.
Exceptions must be handled both when validating and initiating requests. As part of the validation workflow, exceptions are captured as follows, also shown in Figure 2.
1. Invalid arguments exception. The message broker validates that the SQS message contains all the needed information to initiate the ECS task. This information includes subnets, security groups and container names required to start the ECS task, and else raises exception.
2. Retry limit exception. On each iteration, the message broker will evaluate whether the SQS retry limit has been reached, before invoking the RunTask API. It will then exit, by sending failure notification to SNS when the retry limit is reached.
As part of the initiation workflow, exceptions are handled as follows, shown in Figure 3:
1. ECS/Fargate API and concurrent execution limitations. The message broker catches API exceptions when calling the API RunTask operation. These exceptions can include:
- When the call to the launch tasks exceeds the maximum allowed API request limit for your AWS account
- When failing to retrieve security group information
- When you have reached the limit on the number of tasks you can run concurrently
With each of the preceding exceptions, the broker will increase the retry count.
2. Networking and IP space limitations. Network interface timeouts received after initiating the ECS task set off an Amazon CloudWatch Events rule, causing the message broker to re-initiate the ECS task.
While we specifically address downstream consumer services running on ECS/Fargate, this solution can be adjusted for third-party services running on Amazon EC2 or EKS. With EC2, the message broker must be adjusted to interact with the RunInstances API, and include troubleshooting API request errors. Integration with downstream consumers on Amazon EKS requires that the AWS Lambda function is associated via the IAM role with a Kubernetes service account. A Python client for Kubernetes can be used to simplify interaction with the Kubernetes REST API and AWS Lambda would invoke the run API.
This pattern is useful when queue polling is not an immediate option. This is typical with event-driven workflows involving third-party services and vendor applications subject to unpredictable, intermittent load spikes. Exception handling is essential for these types of workflows. Offloading AWS API handling to a separate layer orchestrated by AWS Lambda can improve the resiliency of such third-party services on AWS. This pattern represents an incremental optimization until the third party provides native SQS integration. It can be achieved with the initial move to AWS, for example as part of the V1 AWS design strategy for third-party services.
Some limitations should be acknowledged. While the pattern enables graceful failure, it does not prevent the overloading of the ECS RunTask API. By invoking Amazon ECS RunTask API in ‘fire-and-forget’ mode, it does not monitor service execution once a task was successfully invoked. Therefore, it should be adopted when direct queue polling is not an option. In our example, Broad Institute’s CellProfiler application enabled direct queue polling with its subsequent product version of Distributed CellProfiler.
The referenced deployment with consumer services on Amazon ECS can be accessed via AWSLabs.