Automate High Availability Tests for SAP HANA

Introduction

The software development and operations industry has been modernizing and increasingly, applying DevOps as the standard approach to its processes. However, SAP installation and operations still tend to be very much manually driven. To help evolve this to an automated approach, we’ve demonstrated in our first blog post how to provision the infrastructure for SAP applications using Terraform and native AWS tools like AWS Launch Wizard. Then the second blog post showed the automation of SAP software installation using Systems Manager. And finally, the third blog post went deeper on the automation and performed an end-to-end installation for a full SAP landscape operating in High Availability (HA).

This blog post focuses more on the operations of an SAP landscape. High availability testing is required to understand the resiliency of your applications and to ensure that the deployed applications meet the Recovery Time Objective. Many organizations have to test their high availability configuration from time to time to stay compliant with auditing processes as well.

The solution detailed in this blog has come from the work the AWS SAP Professional Services team has undertaken with a number of customers to automate the deployment and testing of High Availability clusters for SAP HANA workloads, which we are providing as open source (links below) for customers to use and adapt as required.

With this blog post we’re introducing an industry-wide practice known as chaos engineering into the SAP world. It will give you more confidence and more predictability on how your SAP landscape behaves, and also how you can configure it to self-heal after an outage or critical errors.

While applying chaos engineering can improve the resiliency of your SAP landscape by identifying potential problems before they cause an issue, from our experience, manually running the required set of test scenarios can take up to two months in elapsed time and requires a highly skilled professional in most organizations. In this blog post, we will share sample code that you can use as a starting point to automate these test scenarios, save significant amounts of time for your team, and stay compliant with auditing processes.

The benefits brought by this solution, when compared to the typical manual approach, for testing HA configuration are as follows:

Speed: from about 2 months to hours.
Reliability: covers 12 HA testing scenarios (described below) in a repeatable process.
Reduce human errors: since the HA test is turned into a repeatable process, it reduces potential failure rate of the tests from human error.
Audit asset: the final HTML report generated by this solution has all the common information required in audit assets.
Improvement asset: when there’s an error, the final HTML report also highlights what specifically went wrong during the test, and gives a full picture of how was the system before and after the error. This then is handled to an SAP BASIS professional for the fixes on the high availability configuration to be applied.

The solution presented here has some different ways to run. But in the end they all will generate an HTML report as the example shown on section “The report”.

For you to use this solution as a starting point, there’s a public GitHub repository with all the code required. For understanding how to run it, go to section “How to Run“. A quick lab environment can be spun up using this guide. This sample code is to be used as a starting point to reduce the efforts required to automate the HA testing at your company. This sample code is built and tested for running on SAP HANA 1909 on RedHat OS (Operating System).

Prerequisites

You need the following prerequisites before starting this guide and running this code:

Install Ansible on your controller instance. The controller instance can be:
1. Your own instance/laptop/workstation
2. Your CI/CD tool to automate the run of this solution
3. Ansible Tower
SSH access and connectivity established between the controller instance and the SAP landscape instances that you want to run the HA tests on.
One SAP landscape comprised of:
1. Two HANA instances with HA previously configured. Refer to the AWS documentation on configuring Red Hat Enterprise Linux clusters for SAP on AWS for more information.
2. One ASCS instance
3. One PAS instance
One AWS IAM Role with the following permissions configured on your AWS CLI (Command Line Interface). This role has to be configured on the AWS cli on the controller instance. Ansible will use it to interact with your SAP landscape during the tests. You can check how to configure an additional profile for your AWS CLI for more information:
1. ec2:StartInstances on all instances
2. ec2:RebootInstances on all instances
3. ec2:StopInstances on all instances
Create an AMI and capture snapshots of each of the EBS volumes from the instances you’re involving in this scenario as a backup.

Covered high availability test scenarios

For each test scenario, one goal is for them to be useful as standalone tests, in case you don’t want to run all the tests at once. To accomplish this, we have certain tasks that are executed before and after the test scenario, to confirm that the environment has a valid configuration, and the test has completed successfully. These common steps are as follows, shown in the sequence of execution:

Were all the required input parameters informed for it to run?
Is there connectivity between the nodes and the controller node?
Which node is Primary and which is Secondary?
Is a minimum high availability configuration in place?
Is the replication mode set on HANA the same before and after failover?
What is the ASCS enqueue number before the test start? Does it match with the number after failover?
Is PAS connected to the right Database instance? Before and after failover?

The test scenarios available in the GitHub repository are as follows:

“HDB Stop” on Primary database.
“HDB Stop” on new Primary database (after failover).
“HDB Stop” on Secondary database.
“PCS node standby” on Primary database.
“PCS node standby” on new Primary database (after failover).
“kill -9 pid” on Primary database.
Crash instance with “echo ‘b’ > /proc/sysrq-trigger” on Primary database.
Crash instance with “echo ‘b’ > /proc/sysrq-trigger” on new Primary database (after failover).
“HDB kill -9” on Primary database.
“HDB kill -9” on new Primary database (after failover).
Reboot Primary database.
Reboot new Primary database (after failover).

The report

Image 1 – Example of HTML report generated with a failure on the last step

The HTML shown on image 1 presents 2 scenarios run (plus the #1 which is an overall check on the system configuration before running the actual tests, and #7 which is the post-actions required for the system crash on number #6): one of them succeeded (HDB Stop, cases #3, #4 and #5) and the System Crash (#6) failed, as was found on the Post actions step (#7).

How to run the code

Single button alternative: If you’re using the solution described in the third blog post, you have the option of reloading the project (1 – vagrant destroy, 2 – git pull, and 3 – vagrant up again), fill in again all the parameters at Credentials, find the new job listed as “Run HANA HA test” under the folder “SAP Hana+ASCS+PAS 3 Instances” and hit “build now”. Image 2 shows the result you’ll get once the test is finished:

Image 2 – Final output of running HA test using the Jenkins pipeline

And once that’s done you’re able to find the HTML report navigating to:

Click over the build number (it’s 24 in this case),
Navigate to Workspaces
Select the first item listed
Select the Reports folder,
Right click on “report-<current date and time>.html” and save on your local desktop. Remember to save it locally, otherwise the report will not be properly formatted.

Image 3 – Saving the final HTML report locally

Main procedure

Have access to a terminal on a Linux or Mac computer.
Have an AWS CLI configured locally
Clone the github repo
For each of your servers (HANA Primary, HANA Secondary, ASCS and PAS), update the information on the file hosts.yaml
1. ansible_host
2. ansible_user
3. ansible_ssh_private_key_file

Open the var_file.yaml and fill in the required information:

#	Field	Default value	Comments
Information for HANA
1	INPUT_HANA_SID	AD0	Your HANA SID
2	INPUT_HANA_INSTANCE_NUMBER	0	Your HANA instance number
3	INPUT_SYSTEM_USER	SYSTEM	Username for the SYSTEM default user. This will be used to check if a backup is available before starting the tests
4	INPUT_SYSTEM_PASSWORD	P@ssw0rd	Password for the SYSTEM user. This will be used to check if a backup is available before starting the tests
5	INPUT_HANA_SYNC_MODE	SYNC	HANA replication mode
Information for ASCS
6	INPUT_ASCS_SID	AD0	Your ASCS SID
7	INPUT_ASCS_INSTANCE_NUMBER	0	Your ASCS instance number
Information for PAS
8	INPUT_PAS_SID	AD0	Your PAS SID
9	INPUT_PAS_INSTANCE_NUMBER	0	Your PAS instance number
10	INPUT_CHECK_R3_TRANS	TRUE	Whether to check the R3trans command on PAS after database failovers or not
Information for AWS CLI
11	INPUT_AWS_REGION	us-east-1	The region where your instances are
12	INPUT_AWS_CLI_PROFILE	default	The profile you configured for your AWS CLI on Pre requisites, item 2
13	INPUT_PRIVATE_SSH_KEY	/my/path/to/pemFile.pem	Path to the SSH key for Ansible to SSH into your instances

Run file “how_to_run.sh“

This is the quickest path for you to run the solution. If you’re comfortable with BASH, I do encourage you to take a look at the file “how_to_run.sh” to understand its behavior and find your own opportunities of automation for your scenario.

This will take more or less time depending on your HANA size and its configurations. As a benchmark, the full run of the 12 scenarios an empty SAP landscape installation takes around 45min ~ 1 hour.

If you have a big SAP landscape, I do encourage you to read thoughtfully the section “More configurations allowed” to customize how you run the solution, and split the scenarios into different runs. With this you can keep taking a look at the outputs in a smaller time frame and understand possible errors your installation has before running a full load of tests.

Analyzing the final report

Once the tests complete, an HTML report will be generated with the results of each of the tests. In case of a failure, this report provides all the required information for a basis resource to design a remediation plan. It brings informations to understand: (1) which test went wrong? (2) which specifically was the command that failed? and (3) how was the server configured right before that command errored?

Within the report (to help understanding item 3 above) you’ll find some snapshots of the most common SAP HANA-troubleshooting commands to help with the investigation. Examples of commands information found on the report:

crm_mon -A1
HDB proc
hdbnsutil -sr_state
hdbuserstore list
python systemReplicationStatus.py
sapcontrol (…) -function GetProcessList

Following the example report shown on image 1, once you click on “FAIL” on step 7, the screen rolls to the details of that step 7 “Run POST ACTIONS for Scenario CRASH_NODE_PROC_PRE:

Image 4 – Details of errored step

Steps in blue before the ones in red show actions executed on the system before the error happened. You can use them to understand how was the system state before the error happened.

The error itself is presented in red, with additional information of what happened. Image 5 shows an example where it was detected that, after a failover event, PAS is still connected to the old node, and didn’t switch to the new primary one.

Image 5 – Error highlighted

Customizing the amount and sequence of test scenarios

Although the solution comes pre-configured to run all the 12 scenarios, if you feel comfortable with Ansible and want to change how it runs, you can reduce the number of scenarios to fit your purposes. To do so, find the file aws-sap-hana-ha-test/main.yaml and comment the blocks you don’t want to run. First and second blocks (“Check Inputs”, and “Check current HA installation (prep / bridge tasks)”) are mandatory and have to run always.

Let’s say you want to run just the simple “HDB Stop” scenarios to get to know the solution. Then you can comment all the lines from 66 until the end.

Almost all the scenarios are standalone and can be run in different orders, so you can customize them all. The only exception is the “Crash instance” scenario, which has to include a “post actions” scenario after it. So if you’re running the crash scenarios of lines 93 or 111, do keep the scenarios right below them (lines 102 and 120 repectively).

Next steps

Ready to get started? Head straight to the installation automation repo and start testing on your environment.

Once your tests are finished, you are welcome to customize the repo to meet your specific needs. The repo’s folders have READMEs with more instructions about how each of them work to put all the pieces together and have SAP running in the end.

If you are looking for expert guidance and project support as you move your SAP systems to a DevOps model, the AWS Professional Services Global SAP Specialty Practice can help. Increasingly, SAP on AWS customers—including Companies House Services and Phillips 66—are investing in engagements with our team to accelerate their SAP transformation. Please contact our AWS Professional Services team if you would like to learn more about how we can help.

To learn why thousands of customers choose AWS for SAP, visit our SAP page.

Join the SAP on AWS Discussion

In addition to your customer account team and AWS Support channels, you can connect with us through re:Post – A Reimagined Q&A Experience for the AWS Community. Our SAP on AWS Solution Architecture team regularly monitor the SAP on AWS topic for discussion and questions that could be answered to assist our customers and partners. If your question is not support-related, consider joining the discussion over at re:Post and adding to the community knowledge base.

AWS for SAP