AWS Architecture Blog
How ERGO built an on-call support solution in a week
ERGO’s Technology & Services S.A. (ET&S) Cloud Solutions Department is a specialist team of cloud engineers who provide technical support for business owners, project managers, and engineering leads. The support team deals with complex issues, such as failed deployments, security vulnerabilities, environment availability, etc.
When an issue arises, it’s categorized as Priority 1 (P1) or Priority 2 (P2). For urgent P1 incidents, users contact the support team directly via phone. For P2 incidents, the workflow sends an issue description to the support team via SMS.
Originally, the SMS and voice forwarding systems were manually updated every Monday. For SMS, an operator manually updated the phone numbers in the system for the assigned support team members. For voice forwarding, support team members used physical phones, which were handed off from engineer to engineer per the support team roster.
These manual processes were time consuming and occasionally error prone. Additionally, with COVID-19 physical distancing measures in place, handing off physical devices was complicated. To keep up with the increasing number of support cases and the growth of their Cloud Solutions Department, ERGO worked with AWS to modernize and automate their manual workflow. We’ll show you how ERGO implemented a production-ready, on-call support solution with SMS and voice features in just one week using Amazon Connect and Amazon Pinpoint.
Automating the SMS on-call system
Let’s look at how we automated the SMS on-call support system, as shown in Figure 1 and summarized as follows:
- We use an open-source orchestration tool, Red Hat Ansible Automation Platform (Ansible), as a frontend to run the template “Assign to On-call SMS”.
- The template sets the parameter to a subset of support team members who are assigned to support P1/P2 cases. The assignment is based on the on-call shift schedule.
- Next, support team members are subscribed to the Amazon Simple Notification Service (Amazon SNS) topic subscriber’s list using an Ansible playbook.
Now the support team will receive SMS alerts.
Next, we integrated the SMS workflow with our ZIS IT monitoring tool to capture critical events and forward them via SMS to the support team, as shown in Figure 2:
- The Amazon Pinpoint phone number is set as the SMS destination in our monitoring tool.
- The monitoring tool then sends the SMS to Amazon Pinpoint, where:
- We extract the messageBody from the payload that Amazon Pinpoint prepared by sending the message to Amazon SNS “Before Processing Message”, which is subscribed by our AWS Lambda function “Extract messageBody”.
- The extracted message is then sent to Amazon SNS as “After Processing Message”, which uses the Amazon Pinpoint “Two-way SMS” feature to send the SMS to support team members who are assigned to the Amazon SNS topic.
Also shown in Figure 2, we track our monthly SMS spending using Amazon CloudWatch. The SMSMonthToDateSpentUSD metric shows the amount spent sending SMS messages during the current month.
Why extract the messageBody before sending the SMS to the support team?
Amazon Pinpoint captures SMS from the monitoring tool in JSON format, which includes additional information, such as the origin and destination numbers, the message ID and related data, as shown in the following example:
The support team only needs the messageBody, and the JSON format makes it difficult to read on a mobile phone. Therefore, we use a Lambda function for the “messageBody” extraction.
Automating the voice forwarding system
The other half of our on-call solution is voice forwarding. As mentioned in the introduction, we had a physical phone and updated the call forwarding every Monday. This allowed us to forward calls to a single number, but this system had two main problems: it wasn’t scalable and it was prone to human errors.
In our automated system, shown in Figure 3, all calls to the physical phone are forwarded to Amazon Connect, so we do not need to change the number of the phone.
This is how it’s set up:
- The assigned phone numbers in Amazon Connect are attached to the Contact Flow “ERGO On-call Forwarding Voice”, which starts at the “Entry point” rectangle on the left side of the diagram.
- In the next step, “Set logging behavior” captures the calling number. This allows us to see the number to return any missed calls.
- Finally, the set working queue contains routing profiles (in this case, we use a main line and secondary line). The main line has support team members who are assigned to address P1 cases. The secondary line is for managers who will take the call if the support team members are not available.
When a customer is in a queue, the Amazon Connect contact flow tries to route the call to a support team member. If there’s no answer, the service re-routes the call to the next available support team member. After 30 seconds, if there is no answer on the first line (and no other support team members have become available), the service tries the secondary line.
To set this up:
- Every support team member requires an Amazon Connect account. You can import their data via CSV to automate provisioning.
- If a support team member is shown as online but does not answer a call, Amazon Connect changes their status to offline. This way, an Amazon Connect admin can see the time and number of the missed call in the Amazon Connect Real-time metrics reports and can return the call when another team member or supervisor is available.
- Figure 3 shows how Amazon Connect and CloudWatch monitor contact center health metrics like “MissedCalls” and generate alerts via Amazon Simple Notification Service (SNS) to send notifications via email to ensure calls are returned promptly. For more details on this integration pattern, refer to the Monitor and trigger alerts using Amazon CloudWatch for Amazon Connect blog post.
After creating an Amazon Connect instance, we claimed a phone number to place or receive calls. Requesting phone numbers from Amazon Connect to serve different customers in different countries was the most time intensive part of the setup. Be aware that some countries have regulatory requirements, and this can increase the time and effort required. For example, requesting a German number and a Polish number will require different documents. To save time, we used international toll-free numbers. This allows us to provide support to people in all other countries without the caller incurring additional charges.
To help you with your implementation, you can find the list of ID requirements by country or AWS Region here and AWS support can provide more information.
Using managed services like Amazon Connect and Amazon Pinpoint allowed us to implement a scalable and pay-as-you-go on-call solution for technical support. The new automated setup is a huge improvement over the previous manual and error-prone workflow and enables us to easily onboard customers from new countries.
Looking ahead, we plan to explore using the Amazon Connect APIs to automate the management of an agent’s online/offline status, as well as building a skills-based routing workflow to accommodate a multi-lingual support team. You can read more about AWS Customer Engagement services here.