Containers
How to run a Multi-AZ stateful application on EKS with AWS FSx for NetApp ONTAP
Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. Organizations often run a mix of stateless and stateful applications on a Kubernetes cluster. When it comes to stateful applications, there is often a trade-off between performance and availability for the external storage. Organizations want to ensure their applications are highly available (available in multiple Availability Zones) but at the same have sub-millisecond low latency and high IOPs.
In this blog post, we look into AWS FSx for NetApp ONTAP and explore its performance of read/write latency and IOPs as a persistent layer for workloads on Amazon EKS. We will demonstrate a sample stateful application on Amazon EKS by using NetApp’s Trident Container Storage Interface (CSI) driver. The CSI driver allows Amazon EKS clusters to manage the lifecycle of storage volumes powered by NetApp ONTAP file systems.
Solution overview
The infrastructure for this solution comprises an Amazon EKS cluster with three EC2 worker nodes and an FSxONTAP file system that spans multiple Availability Zones. The three worker nodes and the FSxONTAP file system sit in the private subnets in the VPC. We will walk through how to use NetApp’s Trident Container Storage Interface (CSI) to create storage volumes powered by FSxONTAP for a MySql database running on an Amazon EKS cluster. The following high-level architecture diagram illustrates the environment:
What is Amazon FSx for NetApp ONTAP?
Amazon FSx for NetApp ONTAP is a fully managed service that provides highly reliable, scalable, performant, and feature-rich file storage built on NetApp’s popular ONTAP file system. It provides the familiar features, performance, capabilities, multi-protocol (iSCIS/NFS/SMB), and APIs of NetApp file systems with the agility, scalability, and simplicity of a fully managed AWS service.
In terms of how Amazon FSx for NetApp ONTAP works, refer to: https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/how-it-works-fsx-ontap.html
Solution walkthrough
Here are the major steps to complete the deployment:
- Clone the code from the GitHub repo.
- Create a VPC environment in your AWS account using AWS CloudFormation (optional).
- Create the FSxONTAP file system using AWS CloudFormation.
- Use eksctl to create an Amazon EKS cluster.
- Create FSxONTAP volumes as the storage layer for a sample application.
- Test pod level failover.
- Use FIO for running performance tests on FSxONTAP from within a K8S pod on Amazon EKS.
- FSxONTAP file system level failover and failback across Availability Zones
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account with necessary permissions to create and manage Amazon VPC, Amazon EKS cluster, Amazon FSx for NetApp ONTAP file system, and CloudFormation stack.
- eksctl, kubectl, and Helm3 installed on your laptop.
- The AWS Command Line Interface (AWS CLI) version 2.7.3 or newer version configured in your working environment. For information about installing and configuring the AWS CLI, see Installing or updating the latest version of the AWS CLI.
1. Clone the Github repository
You can find the CloudFormation template and relevant code in this GitHub repo. Run the following command to clone the repository into your local workstation.
git clone https://github.com/aws-samples/mltiaz-fsxontap-eks.git
There are two folders that you need to reference in the following steps, with the “eks” folder containing all manifests files related to the eks cluster resources and “FSxONTAP” Cloudformation templates for spinning up the VPC environment and FSxONTAP File System.
2. Create a VPC environment for Amazon EKS and FSxONTAP (Optional)
Create a new VPC with two private subnets and two public subnets using CloudFormation. This step is optional, and an existing VPC can be reused for the Amazon EKS cluster and the FSxONTAP file system.
Launch the CloudFormation stack to set up the network environment for both FSxONTAP and EKS cluster:
$ cd mltiaz-fsxontap-eks/FSxONTAP
$ aws cloudformation create-stack --stack-name EKS-FSXONTAP-VPC --template-body file://./vpc-subnets.yaml --region <region-name>
Once the stack has been deployed successfully, take note of the IDs for PrivateSubnet1, PrivateSubnet2, VPCId, and PrivateRouteTable1, as we will need them in the following steps when creating both the EKS cluster and FSx ONTAP file system.
3. Create an Amazon FSx for NetApp ONTAP file system
Run the following CLI command to create the Amazon FSx for NetApp ONTAP file system. (Note that you need to modify the parameters based on your VPC environment created as above.)
This CloudFormation stack will take some time to complete; feel free to move to step 4 while waiting for the file system to be deployed.
After the completion of the deployment, we can verify in the following screenshot that the FSx NetApp ONTAP file system and Storage Virtual Machine (SVM) are created.
Take a look at the details of the FSxONTAP file system; we can see that the file system has a primary subnet and a standby subnet.
SVM is also created.
4. Create an Amazon EKS cluster
In this walkthrough, we are going to create the EKS cluster with a managed node group that contains three worker nodes residing across the two private subnets created in step 2. In the cluster.yaml file, substitute the VPC ID and subnet IDs based on the output of the CloudFormation stack launched in step 2.
Create the EKS cluster by running the following command:
$ cd ../eks
$ eksctl create cluster -f ./cluster.yaml
#Edit svm_secrets.yaml in the repo and substitute "SVMPassword"
$ kubectl apply -f svm_secret.yaml
secret/backend-fsx-ontap-nas-secret created
$ kubectl get secrets -n trident |grep backend-fsx-ontap-nas
backend-fsx-ontap-nas-secret Opaque 2 30s
(2) Create the Trident backend
Change directory to the eks folder of your cloned repo; note the backend-ontap-nas.yaml
file. Replace the managementLIF
and dataLIF
with the correct details and save the file. (Refer to the Trident’s documentation for more details when considering which one to use based on your application.)
Note: ManagementLIF can be found using the Amazon FSx console, as demonstrated in the following image, highlighted as Management DNS name.
Parameter | Description | Remarks |
backendName | Custom name for the storage backend | |
managementLIF | IP address or FQDN of a cluster or SVM management LIF | |
dataLIF | IP address of protocol LIF | When choosing ontap-san driver for backend, dataLIF could be skipped. |
svm | Storage virtual machine to use |
Make sure that the status of the Trident backend configuration deployed is “Success.”
(3) Create storage class
The storage class yaml manifest is located as storage-class-csi-nas.yaml
:
4) Create persistent volume claim
The persistent volume claim manifest is located as pvc-trident.yaml.
Verify that the persistent volume is created successfully and the PersistentClaim status is “Bound.”
And when you navigate back to the FSxONTAP console, select Volumes of your file system, then confirm the corresponding volume has been created:
Now we have finished configuring the Trident Operator and verified that it enables us to provision Kubernetes persistent volume claim successfully. In the next section, we will deploy a stateful application that runs on Amazon EKS and have the PersistentVolume provisioned by Trident.
7. Deploy the stateful application
We now deploy a highly available MySQL cluster onto the Kubernetes cluster using a Kubernetes Statefulset. Kubernetes Statefulsets ensures the original PersistentVolume is mounted on the same pod identity when it’s rescheduled again to retain data integrity and consistency.
Here, we use Kubernetes ConfigMap to separate configurations and pods. In this example, we apply a ConfigMap named mysql. When the Primary and Secondary pods gets deployed, they read the corresponding configurations.
# Create a namespace where MySQL runs
kubectl create namespace mysql
# Create the ConfigMap for MySQL
kubectl create -f mysql-configmap.yaml -n mysql
Kubernetes Service defines a logical set of pods and a policy by which to access them. StatefulSet currently requires a headless service to control the domain of its pods, directly reaching each pod with stable DNS entries. By specifying “None” for the clusterIP, you can create a headless service.
#Create mysql headless service
kubectl create -f ./mysql/mysql-services.yaml
#Verify mysql headless service is created successfully.
kubectl get service -n mysql
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mysql ClusterIP None <none> 3306/TCP 7h48m
Next, we need to deploy the StatefulSet for MySQL. You may find that the mysql pod contains two init containers (init-mysql and clone-mysql) and two app containers (mysql
and xtrabackup
), and the pod will be bound to the persistent volume provided by FSxONTAP volumes via Trident CSI in the PersistentVolumeClaim.
We can confirm “data-mysql-0"
and ”data-mysql-1"
have persistent volumes mounted.
Let’s pay attention that mapping between the pod and PersistentVolume:
data-mysql-0 → pvc-fca2676b-8024-474c-9d8f-041e3f5f307b
8. Failing over MySQL pod on Kubernetes
In this step, we demonstrate how the same pod name gets rescheduled onto another K8S worker node, is recreated, and has the original persistent volume mounted to ensure data consistency.
Populating sample data
Let’s quickly populate the database with some sample data. For that, we will spin up a container that connects with the MySQL primary node to insert the data:
kubectl -n mysql run mysql-client --image=mysql:5.7 -i --rm --restart=Never -- \
mysql -h mysql-0.mysql <<EOF
CREATE DATABASE test;
CREATE TABLE test.messages (message VARCHAR(250));
INSERT INTO test.messages VALUES ('hello, from mysql-client');
EOF
And we can run the following to test that the follower node mysql-1 received the data successfully.
kubectl -n mysql run mysql-client --image=mysql:5.7 -it --rm --restart=Never -- mysql -h mysql-1.mysql -e "SELECT * FROM test.messages"
+--------------------------+
| message |
+--------------------------+
| hello, from mysql-client |
+--------------------------+
pod "mysql-client" deleted
Simulating node failure
Now, let’s simulate the node failure by cordoning off the node on which MySQL is running.
# Check the pods distribution in worker nodes.
kubectl get pod -n mysql -o wide -l app=mysql
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mysql-0 2/2 Running 0 56m 10.0.1.177 ip-10-0-1-110.ap-southeast-2.compute.internal <none> <none>
mysql-1 2/2 Running 0 58m 10.0.0.152 ip-10-0-0-187.ap-southeast-2.compute.internal <none> <none>
# Cordon the worker node where mysql-0 pod runs on
kubectl cordon ip-10-0-1-110.ap-southeast-2.compute.internal
# Check node status
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-187.ap-southeast-2.compute.internal Ready <none> 26h v1.21.5-eks-bc4871b
ip-10-0-1-110.ap-southeast-2.compute.internal Ready,SchedulingDisabled <none> 26h v1.21.5-eks-bc4871b
ip-10-0-1-131.ap-southeast-2.compute.internal Ready <none> 26h v1.21.5-eks-bc4871b
Next, let’s go ahead and delete the MySQL pod.
kubectl delete pod mysql-0 -n mysql
pod "mysql-0" deleted
To maintain the number of replicas of the StatefulSet, we reschedule the pod onto another EKS worker node, which resides in another Availability Zone. And we can verify that the mysql-0 pod has been rescheduled to another worker node.
kubectl get pods -n mysql -l app=mysql -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mysql-0 2/2 Running 0 42s 10.0.1.56 ip-10-0-1-131.ap-southeast-2.compute.internal <none> <none>
mysql-1 2/2 Running 0 15m 10.0.0.152 ip-10-0-0-187.ap-southeast-2.compute.internal <none> <none>
And we now know that the original PersistentVolume (pvc-fca2676b-8024-474c-9d8f-041e3f5f307b) has been remounted to mysql-0 pod:
sh-4.2# kubectl get pvc -n mysql
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-mysql-0 Bound pvc-87166a6d-0395-4c88-8818-ec27b7f444cb 5Gi RWO basic-csi 24h
data-mysql-1 Bound pvc-648238a3-0b96-4859-98da-7f09f3af254a 5Gi RWO basic-csi 24h
Finally, let’s verify the data in the database created still persists after the pod is rescheduled onto another worker node.
kubectl -n mysql run mysql-client --image=mysql:5.7 -it --rm --restart=Never -- mysql -h mysql-0.mysql -e "SELECT * FROM test.messages"
+--------------------------+
| message |
+--------------------------+
| hello, from mysql-client |
+--------------------------+
pod "mysql-client" deleted
9. Performance test with FIO and IOping
In this section, we look at two very most important parameters of the storage performance, IOPS and latency, to measure the performance of FSx NetApp ONTAP file system provisioned by the Trident CSI. We use FIO (Flexible I/O), the popular storage benchmarking tool, and IOping, a tool to monitor I/O latency in real time, to test the performances on FSx NetApp ONTAP drive from the EKS pod.
9.1 EKS pod and FSx NetApp ONTAP in the same Availability Zone.
As the FSx NetApp ONTAP file system preferred subnet is in ap-southeast-2a
, in this test, we will deploy the EKS pod in the same Availability Zone to check the performance data.
In step 6, the storage class trident-csi
has been created.
$ kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
trident-csi csi.trident.netapp.io Retain Immediate true 11d
gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 47d
(1) Change to the directory where pod_performance_same_AZ.yaml
resides
cd mltiaz-fsxontap-eks/eks
(2) Deploy the yaml file to provision the pod and the 10 GB storage on FSx NetApp ONTAP
$ kubectl apply -f pod_performance_same_AZ.yaml
(3) Log in to the container and perform FIO and IOping testing
# Logon to the testubg container
$ kubectl exec -it task-pv-pod -- /bin/bash
# Install FIO and IOping
root@task-pv-pod:/# apt-get update
root@task-pv-pod:/# apt-get install fio ioping -y
# Go to the mounted storage on /usr/share/trident-nas/
root@task-pv-pod:/# cd /usr/share/trident-nas/
# Run FIO command, writing 8GB
root@task-pv-pod:/usr/share/trident-nas/# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
Starting 1 process
.....
# Use IOping to test the latency
root@task-pv-pod:/usr/share/trident-nas# ioping -c 100 .
4 KiB <<< . (nfs4 svm-0518719800dfbb809.fs-04a7fcc8e8b5ac4f2.fsx.ap-southeast-2.amazonaws.com:/trident_pvc_f776bbf1_b0af_450b_95e7_b295a9337fd2 10 GiB): request=1 time=185.4 us (warmup)
....
# Exit the container and delete the Pod
root@task-pv-pod:/usr/share/trident-nas/# exit
$ kubectl delete -f pod_performance_same_AZ.yaml
9.2 EKS pod and FSx NetApp ONTAP in the different Availability Zone
Let’s do the test where the pod sits in a different Availability Zone of the storage
(1) Deploy the yaml file to provision the pod and the 10 GB storage on FSx NetApp ONTAP
$ kubectl apply -f pod_performance_different_AZ.yaml
(2) Log in to the container and perform FIO and IOping testing
9.3 Performance Summary
The specific amount of throughput and IOPS that your workload can drive on your FSxONTAP file system depends on the throughput capacity, storage capacity configuration of your file system, and the nature of your workload. In this example, we provisioned 1024 GB as storage capacity and 512 MB as throughput.
The performance of the same Availability Zone and different Availability Zones are as below:
Scenario | Average IOPS (read) | Average IOPS (write) | Average throughput (read) | Average throughput (write) | Average latency |
Same Availability Zone | 37.5K | 12.5K | 154 MB/s | 51.3 MB/s | 483.8 us |
Different Availability Zone | 33.4 K | 11.1 K | 137 MB/s | 45.6 MB/s | 1.03 ms |
As the table above indicates, the IOPS performance is very similar in both scenarios, while the average latency is around 0.5 ms when the pod and the storage are in the same Availability Zone, and 1 ms when they are not. In both scenarios, the performance data shows that the AWS FSx for NetApp ONTAP on EKS could support low-latency applications running at 1 ms latency and over 30 K read IOPS and 10 K write IOPS.
10. File system failover and failback across Availability Zones
Business-critical production workloads normally require high availability across Availability Zones. Availability Zones consist of one or more discrete data centers, each with redundant power, networking, and connectivity, housed in separate facilities.
In this section, you will simulate an AZ failure while constantly performing a FIO write operation for 45 minutes to the multi-AZ FSx for NetApp ONTAP file system. During the process you can monitor if the write requests are interrupted during the Availability Zone failover.
In the last section, you deployed an pod named task-pv-pod, we will logon to the container
(1) Log in to the container and use FIO to write to the FSx for NetApp ONTAP volume for 45 minutes
kubectl exec -it task-pv-pod -- /bin/bash
Use FIO command to continue write for 45 minutes.
cd /usr/share/trident-nas/
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75 --time_based=1 --runtime=2700s
(2) Navigate back to FSx for NetApp ONTAP console, take notes of the network interface for both Preferred subnet and Standby subnet
(3) Click the Route table that’s associated with the FSx for NetApp ONTAP file system and confirm traffic is being routed to the network interface for the the preferred subnet
(4) Modify the FSx for NetApp ONTAP filesystem throughput Capacity to 2048MB/s to trigger a failover
(5) Check the route table again, now we see that the traffic is routed to the network interface for the standby subnet
(6) Review the FIO command output after Availability Zone failover and failback.
You can see that FIO tool keeps writing the file system without any errors. Failover and fail back are transparent to the file system writing test
root@task-pv-pod:/usr/share/trident-nas/# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75 --time_based=1 --runtime=2700s
Starting 1 processfiotest: Laying out IO file (1 file / 8192MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=158MiB/s,w=53=53.1MiB/s][r=40.4k,w=13.6k IOPS][eta 00m:00s]
fiotest: (groupid=0,jobs=1): err=0: pid=730: Sun Nov 14 15:44:28 2022
.....
(7) Summary
You are able to simulate an FSx for NetApp ONTAP system failover and failback, and you could verify that the file system failover did not cause any IO errors. This showcases the FSx for NetApp ONTAP Multi-AZ capabilities on Amazon EKS.
Cleaning up
To avoid unnecessary cost, make sure you clean up the resources that we just created for this demo.
Delete the EKS cluster:
Delete the FSxONTAP file system:
Delete the VPC CloudFormation stack:
Conclusion
This blog post presented a brief introduction of Amazon FSx for NetApp ONTAP service and illustrated how to use NetApp Trident CSI to provision persistent volumes that span across multiple Availability Zones. As demonstrated from the performance test, MySql pod failover and FSxONTAP file system AZ-level failover in the demo, Amazon FSx for NetApp ONTAP provides high storage performance with sub-millisecond file operation latencies with solid state drive (SSD) storage and provides multi-AZ availability, which makes it a good fit for use cases where AWS customers need to run business-critical stateful applications on Amazon EKS.