Field Notes: Orchestrating and Monitoring Complex, Long-running Workflows Using AWS Step Functions
IHS Markit’s Wall Street Office (WSO) offers financial reports to hundreds of clients worldwide. When IHS Markit completed the migration of WSO’s SaaS software to AWS, it unlocked the power and agility to deliver new product features monthly, as opposed to a multi-year release cycle. This migration also presented a great opportunity to further enhance the customer experience by automating the WSO reporting team’s own Continuous Integration and Continuous Deployment (CI/CD) workflow. WSO then offered the same migration workflow to its on-prem clients, who needed the ability to upgrade quickly in order to meet a regulatory LIBOR reporting deadline. This rapid upgrade was enabled by fully automating regression testing of new software versions.
In this blog post, I outline the architectures created in collaboration with WSO to orchestrate and monitor the complex, long-running reconciliation workflows in their environment by leveraging the power of AWS Step Functions. To enable each client’s migration to AWS, WSO needed to ensure that the new, AWS version of the reporting application produced identical outputs to the previous, on-premises version. For a single migration, the process is as follows:
- spin up the old version of the SQL Server and reporting engine on Windows servers,
- run reports,
- repeat the process with the new version,
- compare the outputs and review the differences.
The problem came with scaling this process. IHS Markit provides financial solutions and tools to numerous clients. To enable these clients to transition away from LIBOR, the WSO team was tasked with migrating over 80 instances of the application, and reconciling hundreds of reports for each migration. During upgrades, customers must manually validate custom extracts created in the WSO Reporting application against current and next version, which limits upgrade frequency and increases the resourcing cost of these validations. Without automation, upgrading all clients would have taken an entire new Operations team, and cost the firm over 700 developer-hours to meet the regulatory LIBOR cessation deadline.
WSO was able to save over 4,000 developer-hours by making this process repeatable so it can be used as an automated regression test as part of the regular Systems Development Lifecycle process The following diagram shows the reconciliation workflow steps enabled as part of this automated process.
The team quickly realized that a Serverless and event-driven solution would be required to make this process manageable. The initial approach was to use AWS Lambda functions to call PowerShell scripts to perform each step in the reconciliation process. They also used Amazon SNS to invoke the next Lambda function when the previous step completed.
The problem came when the Operations team tried to monitor these Lambda functions, with multiple parallel reconciliations running concurrently. The Lambda outputs became mixed together in shared Amazon Cloud Watch log groups, and there was no way to quickly see the overall progress of any given reconciliation workflow. It was also difficult to figure out how to recover from errors.
Furthermore, the team found that some steps in this process, such as database restoration, ran longer than the 15 minute Lambda timeout limit. As a result, they were forced to look for alternatives to manage these long-running steps. Following is an architecture diagram showing the serverless component used to automate and scale the process.
Enter Step Functions and AWS Systems Manager (formerly known as SSM) Automation. To address the problem of orchestrating the many sequential and parallel steps in our workflow, AWS Solutions Architects suggested replacing Amazon SNS with AWS Step Functions.
The Step Functions state machine controls the order in which the steps are invoked, including successful and error state transitions. The service is integrated with 14 other AWS services (Lambda function, SSM Automation, Amazon ECS, and more.), and can invoke them, as well as manual actions. These calls can be synchronous or run via steps that wait for an event. A state machine instance is long-lived and can support processes that take up to a year to complete.
This immediately gave the Development team a holistic, visual way to design our workflow, and offered Operations a graphical user interface (UI) to monitor ongoing reconciliations in real time. The Step Functions console lists out all running and past reconciliations, including their status, and allows the operator to drill down into the detailed state diagram of any given reconciliation. The operator can then see how far it’s progressed or where it encountered an error.
The UI also provides Amazon CloudWatch links for any given step, isolating the logs of that particular Lambda execution, eliminating the need to search through the CloudWatch log group manually. The screenshot below illustrates what an in-progress Step Function looks to an operator, with each step listed out with its own status and a link to its log.
The team also used the Step Function state machine as a container for metadata about each particular reconciliation process instance (like the environment ID and the database and Amazon EC2 instances associated with that environment), reducing the need to pass this data between Lambda functions.
To solve the problem of long-running PowerShell scripts, AWS Solutions Architects suggested using SSM Automation. Unlike Lambda functions, SSM Automation is meant to run operational scripts, with no maximum time limit. They also have native PowerShell integration, which you can use to call the existing scripts and capture their output.
To save time running hundreds of reports, the team looked into the ‘Map State’ feature of Step Functions. Map takes an array of input data, then creates an instance of the step (in this case a Lambda call) for each item in this array. It waits for them all to complete before proceeding.
This is recommended to implement as a fan-out pattern with almost no orchestration code. The Map State step also gives Operations users the option to limit the level of parallelism, in this case letting only 5 reports run simultaneously. This prevents overloading our reporting applications and databases.
To deal with errors in any of the workflow steps, the Development team introduced a manual review step, which you can model in Step Functions. The manual step notifies a mailing list of the error, then waits for a reply to tell it whether to retry or abort the workflow.
The only challenge the Development team found was the mechanism for re-running an individual failed step. At this time, any failure needs to have an explicit state transition within the Step Function’s state diagram. While the Step Function can auto-retry a step, the team wanted to insert a wait-for-human-investigation step before retrying the more expensive and complex steps.
This presented 2 options:
- add wait-and-loop-back steps around every step we may want to retry,
- route all failures to a single wait-for-investigation step.
The former added significant complexity to the state machine, so AWS Solutions Architects raised this as a product feature that should be added to the Step Functions UI.
The proposed enhancement would allow any failed step to be manually rerun or skipped via UI controls, without adding explicit steps to each state machine to model this. In the meantime, the Dev team went with the latter approach, and had the human error review step loop back to the top of the state machine to retry the entire workflow. To avoid re-running long steps, they created a check within the step Lambda function to query the Step Functions API and determine whether that step had already succeeded before the loop-back, and complete it instantly if it had.
Within 6 weeks, WSO was able to run the first reconciliations and begin the LIBOR migration on time. The Step Function Designer instantly gave the Developer team an operator UI and workflow orchestration engine. Normally, this would have required the creation of an entire 3-tier stack, scheduler and logging infrastructure.
Instead, using Step Functions allowed the developers to spend their time on the reconciliation logic that makes their application unique. The report compare tool developed by the WSO team provides clients with automated artifacts confirming that customer report data remained identical between current version and next version of WSO. The new testing artifacts provide clients with robust and comprehensive testing of critical data extracts.
We hope that this blog post provided useful insights to help determine if using AWS Step Functions are a good fit for you.
Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.