How to Migrate Mainframe Batch to Cloud Microservices with AWS Blu Age
By Alexis Henry, Chief Technology Officer at AWS Blu Age
While modernizing customer mainframes, our team at AWS Blu Age discovered that Batch can be a very complex aspect of a mainframe migration to Amazon Web Services (AWS). It often dictates whether a mainframe migration is successful or not.
To succeed in a transition to microservices, it is critical to design your AWS architecture to account for the key Batch stringent performance requirements such as intensive I/Os, large datasets, and short durations.
In this post, I will describe how to migrate mainframe Batch to AWS microservices using AWS Blu Age automated transformation technology.
Customers choose microservices aiming for more agility, innovation, quality, scalability, and availability. Despite all these advantages, a microservices approach introduces operational complexity. AWS has a number of offerings that address important challenges of microservices architectures: Managed Services, service orientation, programming languages polyglot, on-demand resources, infrastructure as code, and continuous delivery, among others.
Experience is still growing on microservices topics. Much of the existing technical literature describe microservices in the context of new applications or peeling monoliths, such as transitioning Java or client/server applications. However, most of the existing worldwide IT relies on mainframe monoliths. Many corporations and public agencies are looking for strategies to migrate their mainframe to cloud microservices, minimizing project risk, duration, and cost.
Batch processing usually involves bulk processing of data that could not be processed in real-time due to the limited capabilities of transactional engines at the time of their initial design and implementation. Batch software design was predicated on the constraints and assumptions of the mainframe environment, such as high CPU power for mono thread application, locking of I/O to data storage—which prevents concurrent processing of Batch and transaction—and higher Total Cost of Ownership (TCO) for provisioning nightly CPU peaks. Those constraints are still directly influencing Million Instructions Per Second (MIPS) estimation, cost, and operational model.
A more efficient and cost optimized architecture is now available with AWS. It can be achieved by transforming legacy Batch processes to real-time microservices, leveraging Amazon Kinesis for data streaming, Amazon API Gateway for service invocation, and AWS Lambda and serverless computing for compute and storage.
In the following example, I will explain how to transition a typical retail banking mainframe application from the dual Online/Batch model toward real-time microservices combining AWS services and AWS Blu Age modernization technology.
Example: Mainframe Legacy Batch Architecture
In Figure 1, we use an example mainframe Batch architecture that we will transform into a microservice architecture in later sections. This typical scenario shows a mainframe legacy system using z/OS CICS, JCL, Cobol, DB2, VSAM files, GDG files, and tapes.
The Batch programs have been designed to avoid multiple user locking and waiting for transaction responses. During the day, loan request transactions append transaction data to a temporary file, with one file per physical office. At night, CICS transactions that share data with Batches are switched off to avoid concurrency, locks, and consistency issues.
There are three Batch programs:
- Every five minutes, the Upload Batch program sends all temporary files to the Batch region via message queuing.
- Every night, the Loan Batch program is triggered by the scheduler. It executes the following logic:
- All files are merged into one
- The merged file is then sorted to improve performance and prepare the next processing steps
- Each record in the sorted file are processed for enrichment (personal information about credit history and other loans are collected from DB2 and injected into the enriched record)
- Each record is enriched a second time by injecting risk assessment information from a VSAM file, resulting in an enriched file with all the information required to perform risk analysis and grant or reject loan request
- Each record in previous output file is processed, and eventually COBOL programs create three outputs: A copy of unprocessed/rejected records which will need further processing (parsing error, missing elements, fails); an update of records in a DB2 table for each customer requesting a loan with current status (rejected, approved, pending) and loan proposal only for approved requests (rates, duration, etc.); and audit information (who, what, when, where) is traced into mainframe Generation Data Groups (GDG) files.
- Every week, the Archiving Batch is triggered to save some GDG data to tape devices (for legal reasons), and to prune GDG (removal of files sent to tapes).
Transforming Batch Logic to Microservices with Blu Age Velocity
Blu Age technology accelerates legacy application modernization with automation for both reverse-engineering of the legacy procedural mainframe applications (code + data) as well as forward-engineering to new microservice-ready object-oriented applications.
When modernizing from mainframe monoliths toward AWS, both the transformation and the definition of the target architecture are automated and standardized for AWS by Blu Age Velocity transformation technology. This execution environment is available off-the-shelf and relies upon two components:
- Blu Age Velocity Framework brings all utilities and services to get rid of former system specificities and anti-patterns: Go To removal, data memory model, execution model, data access model, sort and file management utilities, and more.
- BluSam Server can be seen as a full stack microservice container. Any number of containers may be deployed, with each being the execution unit for locally deployed services and data access service. Each former Batch program becomes a Spring Boot autonomous executable. Microservice containers are distributed. Programs are freely deployed. Data is freely deployed. In-memory read/write cache may be enable on demand or at start-up. All services are available as REST API. All services are registered into a service directory automatically.
Our recommendation for a successful mainframe to microservices project is to separate the technical stack transformation phase from the business split transformation phase in order to keep the microservices transformation complexity manageable for each phase and minimize project risks.
With such an approach, the first technical stack transformation phase focuses on the application code and data isofunctional migration, keeping the same application model with mostly infrastructure teams from both the mainframe and the AWS sides. The later business split transformation phase focuses on creating domain model boundaries for each microservice and will not involve a mainframe team. It does require participants from the Line of Business with an understanding of the business functions and processes.
For the technical stack microservice transformation phase, the mainframe Batch architecture is automatically refactored with Blu Age Velocity in the following way:
- REST APIs: Each service has its REST APIs available and deployed. This enables remote call capability for both business logic services and data access services. Typical integration strategy is made with Kinesis, Lambda, and API Gateway.
- Java Programs: All former programs and scripts (COBOL programs, JCLs, CICS transactions, BMS maps) are transformed into single executables. BMS maps are transformed into Angular single-page application, while server-side services are transformed into Java Spring Boot applications. Each may be freely deployed to the BluSam server of your choice. They show in the Java Information Control System (JICS) layer of the above picture.
- Cache: Persisted data may be loaded into the in-memory cache for optimization of performance. The cache supports write-behind and relies on Amazon ElastiCache. This increases both read and write performance in bulk mode as well. Native write-through is designed for read access but cause delays when refreshing data into the database, while write-behind allows optimal performance in all scenario.
- Persistence Data Layer: Persisted data is managed by BluSam I/O. Any former data storage (VSAM, GDG, DB2 z/OS tables, etc) is now stored in a persistence data store. Any prior data access mode (sequential, indexed sequential, hierarchical, relational) is refactored to fit with a new database.
- Persistence Data Store: Typically, as detailed later, Amazon Aurora is the relational data store of choice for data persistence. Now each BluSam Server has the flexibility to operate its own database choice (relational, Key Value store, No SQL, Graph database).
- Service Directory: All deployed services are published into a central directory for location lookup and integration across microservices.
For the business split transformation phase, a Domain-Driven Design approach is recommended to identify each microservice scope with a Bounded Context. Blu Age Analyzer automates domain discovery by analyzing data and call dependencies between legacy programs. Domain decomposition refactoring using functional input is supported as well by Blu Age Analyzer. Decomposition strategy produced by Blu Age Analyzer is then driving the modernization transformations.
To learn more about this approach, see details about Blu Age Analyzer, Martin Fowler Domain Driven Design, Bounded Context, and Wikipedia Domain Driven Design. Once the microservices scope and Bounded Context have been defined, AWS Blu Age automation can quickly re-factor the application code to separate and create the new microservices application packages.
Example: Resulting Real-Time Microservices
Getting back to our mainframe legacy Batch example, application owners decide to modernize the mainframe with two main goals in mind:
- Enhance customer experience and satisfaction with answers for loan applications in minutes, rather than the following day once the nightly Batch is complete. Enable self-service and loan notifications to mobile application users.
- Agility and the ability to introduce new features or changes with a better time to market by refactoring all business logic.
For this purpose, executives decide to transform their mainframe Batch leveraging AWS Blu Age technology as described in the preceding section. We now detail the resulting real-time microservices architecture on AWS. This example microservices Bounded Contexts split is as follows:
- Retail Banking SPA Portal Microservice: This is a distributed UI system which is localized per region/country (languages, legal specifics).
- Loan Risk Assessment Microservice: This service is in charge of assessing the risk of granting loans and sending a rate and duration proposal based on customer profile, credit history, and risk assessment rules.
- Transactional Retail Microservice: This service handles checks, credit card, and all former desk-facing simple operations.
- Long-term Data Storage Microservice: This becomes a service of its own, which other microservices do not have to be aware of (i.e. they do not have to trigger or do service composition with).
We now describe components of this real-time microservices architecture.
Angular Front End
Angular enriches user experience while ensuring users of the legacy mainframe application do not require retraining for using the modernized system. All surface behaviors of the former mainframe system are mimicked with user experience related processes and screens transformed into a portal.
This architecture automatically distributes user connections to different containers and regions for better availability and reliability. It distributes the local versions of retail application in various countries and offices as well (because of various languages and legislation). As soon as a user submits a request, the Angular single-page application emits data into Kinesis to route and process this request in real-time.
Kinesis is the unified data hub for the real-time system, also called the data transport layer. Requests are pushed into Kinesis, which acts as a central, fast, and robust streaming hub. In addition to basic message queuing, Kinesis continuously sends data in the stream and allows data replay and broadcasting.
Kinesis is fully managed, which means you do not have to manage the infrastructure nor the configuration to adapt to burst or capacity variations. Even though data will be processed on the fly, Kinesis provides a buffer if needed. This is beneficial for mainframe Batch when there is a need to replay or reject prior data processing. Furthermore, Kinesis is used to support the Database per Service design pattern as per Martin Fowler’s microservices description.
Amazon API Gateway
API Gateway identifies which API or service to map to the incoming Kinesis data. It is the perfect fit as the central hub to access all your microservices, whatever their underlying technology and locations. Moreover, API Gateway serves as a service locator to call appropriate microservices, and enables service versioning and Canary deployment strategy. This allows reducing the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.
Another reason for using API Gateway as a Canary strategy is when different business versions exist because large banks typically operate in many countries with different regulations. Using microservices through API Gateway, both solve multichannel and multi regulation issues.
API Gateway uses Lambda as a proxy to call microservices. Lambda initiates context and parameter values injected into the remote service. In the legacy system, JCL receives parameter values from a scheduler or programmatically defined variables. In such context, those parameters are used to set up the Batch runtime environment with the name of the dataset, version of deployed programs, execute Batch in production or test partition.
In the new paradigm, Lambda is used to keep this runtime parameter capability, which the client should not do for decoupling reasons, and triggers the appropriate Groovy script (which replaces z/OS JCL after being transformed with AWS Blu Age). Groovy is preferred to Java to be modified without compilation while sharing the same JVM as the Java classes to be run.
Therefore, mainframe Batch job steps or run units may be reproduced, and modification can be done to Batch setup without compilation. Groovy and Java classes are called via REST and may be either synchronous or asynchronous. Asynchronous is preferred in case of long processing time. Indeed, Lambda lifetime must be kept below 500 seconds, and in such case detached services is the right pattern.
One specificity of mainframe system is I/O capabilities provided both by the underlying file system and non-relational databases built on top of it. Among those, VSAM relies on indexed sequential data store for which modern databases (RDBMS, graph database, column database) do not provide equivalent indexing, at least not preserving performance for all features. Blu Age BluSam uses ElastiCache (Redis implementation) in order to bring equivalent performance while supporting necessary I/O capabilities features:
- In Memory Indexes: VSAM indexes are stored in ElastiCache to support fast and full featured indexed sequential logic.
- Index Persistence: Indexes are saved in real-time to an underlying database to provide the required availability requirements. The default configuration stores into Aurora. BluSam allows using any RDBMS or Key/Value store as well.
- Record Caching: Mainframe dataset records may be uploaded to cache, either at BluSam Server startup using bulk cache insert or on the fly as requests hit the cache.
- Write-behind: In addition to write-through, BluSam adds persistence-specific services to support write-behind. Write-through induces a delay (database acknowledgement), which may cause a performance slowdown when doing bulk processing. For this reason, write-behind has been added to manage transactions only at the cache level. The cache manages persistence to the underlying database with a first-of strategy (first of N-record changed and elapsed time since the last cache saving). This write-behind feature is available for both indexes and individual records.
- Managed Service: ElastiCache is a fully managed service that enables transparent scaling. Even large mainframe Batch systems requiring processing of terabytes of business records per day are handled by ElastiCache with no need to manage capacity or scaling. Data may be uploaded into the cache in burst mode to process reliably any bulk data (warming cache and processing in memory is typically a good strategy for bulk processing).
Aurora is the preferred target AWS database for mainframe modernization because of its performance and equivalence. Legacy mainframe applications rely mostly on VSAM and SQL I/O capabilities, and changing to another data access type such as put/get of a document could be risky and time-consuming as it involves a major rewrite of the application logic disconnecting it from the data access APIs. However, using Aurora with BluSam and ElastiCache allows preserving transformation automation and performance with no need for refactoring for both VSAM’s like indexes (permanent storage in Aurora; live indexes in ElastiCache), and native SQL support.
Scalability is also important when modernizing mainframe because legacy application usage varies over time. For example, payroll systems or tax payment systems have a peak of activity every month/quarter. Aurora is a managed database with storage that can automatically grow to 64 TB per instance.
One challenge with microservices is the Database per Service design pattern. This pattern creates a need for data synchronization within the constraints of the CAP theorem because data is distributed. While specific patterns exist to handle the trade-off between eventual consistency, rollback mechanism, and transactional delay, they were all designed for online transactions. Each of these suffer, however, from network and transactional delay which do not fit with the mainframe Batch latency requirements. This includes a commit time within one milliseconds, whereas patterns such as Saga or API Composition introduce up to 100 milliseconds delay.
Aurora brings a simple yet effective capability for data synchronization with native Lambda integration. Whenever a record is modified, then a Lambda is triggered. The Lambda is used to stream into Kinesis which delivers to API Gateway. Kinesis allows having multiple subscribers to propagate the data change to their local data store, while API Gateway allows doing API management for each domain. Then, all domains are synchronized simultaneously while each implements its private synchronization APIs based on its private choice of languages and databases
Aurora Serverless opens new strategies to achieve high performance and behaves like the regular Aurora service but automatically scales up or down based on your application’s needs while preserving ACID transactions. Because of the rapid scaling, Aurora Serverless is a cost-effective solution for Batch, burst, bulk data transfers, data consolidation, reducing elapse time for long processes such as payroll Batches.
Data Storage Microservice
In the mainframe system, long-term data storage was handled at the application level, with several Batch jobs being responsible for archiving, back up, and pruning. Furthermore, the mainframe archival is costly and complex because it relies on generational GDG files on mainframe itself, and on tapes shipped to off-site storage.
Preserving the overall defined architecture, Kinesis and API Gateway remain the central data hub. AWS Batch is used to schedule data storage actions through a Lambda function. The Lambda function copies Aurora data into Amazon S3 with a command similar to the following:
SELECT INTO OUTFILE S3 <sql statement>, where sql statement selects data to be stored externally to the application database.
Amazon S3 functionalities replace z/OS GDG features, local copies of data, audit trails, medium term archiving and pruning of data, and extra backups. For long-term data archival and to satisfy regulatory requirements, the data is later moved from Amazon S3 into Amazon Glacier based on a Lifecycle Rule.
Blu Age Velocity is a ready-to-use solution that accelerates the migration of mainframe Batch to AWS microservices. Because it is a packaged solution, it minimizes project risk and costs. There are savings coming from transitioning from a mainframe MIPS cost structure to pay-as-you-go AWS Managed Services. Such savings typically finance the modernization project in a short time period and allow for a quicker return on investment.
The target architecture uses AWS Managed Services and serverless technology. As such, each microservice is elastic and minimizes system administrator tasks. It adapts to the client demand, automatically ensuring availability and performance of services while only paying for what you use.
From a design perspective, mainframe Batch applications are migrated to real-time, which improves customer experience and satisfaction. Batch applications are also transformed into microservices that benefit from more flexibility, increased agility, independent business domains, deployment automation, and safe deployment strategy. In short: better, faster, safer capability to deliver and implement new features.
Learn More About Blu Age Velocity
Blu Age Velocity can be used by any customer in any industry, for any mainframe executing languages such as COBOL (including most of its various flavors), JCL, and subsystems such as CICS, IMS, and VSAM. Blu Age Velocity accelerates both the code automated modernization and the target architecture definition.
AWS Blu Age also facilitates the necessary activities from legacy code base inventory and analysis to control of like-for-like business logic testing and compliance with the latest development standards. AWAS Blu Age recommends performing a Proof of Concept with the most complex Batch jobs. This proves the technology robustness and minimizes risk for the other jobs or programs.