Containers

How to run a Multi-AZ stateful application on EKS with AWS FSx for NetApp ONTAP

Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. Organizations often run a mix of stateless and stateful applications on a Kubernetes cluster. When it comes to stateful applications, there is often a trade-off between performance and availability for the external storage. Organizations want to ensure their applications are highly available (available in multiple Availability Zones) but at the same have sub-millisecond low latency and high IOPs.

In this blog post, we look into AWS FSx for NetApp ONTAP and explore its performance of read/write latency and IOPs as a persistent layer for workloads on Amazon EKS. We will demonstrate a sample stateful application on Amazon EKS by using NetApp’s Trident Container Storage Interface (CSI) driver. The CSI driver allows Amazon EKS clusters to manage the lifecycle of storage volumes powered by NetApp ONTAP file systems.

Solution overview

The infrastructure for this solution comprises an Amazon EKS cluster with three EC2 worker nodes and an FSxONTAP file system that spans multiple Availability Zones. The three worker nodes and the FSxONTAP file system sit in the private subnets in the VPC. We will walk through how to use NetApp’s Trident Container Storage Interface (CSI) to create storage volumes powered by FSxONTAP for a MySql database running on an Amazon EKS cluster. The following high-level architecture diagram illustrates the environment:

high-level architecture diagram

What is Amazon FSx for NetApp ONTAP?

Amazon FSx for NetApp ONTAP is a fully managed service that provides highly reliable, scalable, performant, and feature-rich file storage built on NetApp’s popular ONTAP file system. It provides the familiar features, performance, capabilities, multi-protocol (iSCIS/NFS/SMB), and APIs of NetApp file systems with the agility, scalability, and simplicity of a fully managed AWS service.

In terms of how Amazon FSx for NetApp ONTAP works, refer to: https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/how-it-works-fsx-ontap.html

Solution walkthrough

Here are the major steps to complete the deployment:

  1. Clone the code from the GitHub repo.
  2. Create a VPC environment in your AWS account using AWS CloudFormation (optional).
  3. Create the FSxONTAP file system using AWS CloudFormation.
  4. Use eksctl to create an Amazon EKS cluster.
  5. Create FSxONTAP volumes as the storage layer for a sample application.
  6. Test FSxONTAP failover.
  7. Use FIO for running performance tests on FSxONTAP from within a K8S pod on Amazon EKS.

Prerequisites

For this walkthrough, you should have the following prerequisites:

1. Clone the Github repository

You can find the CloudFormation template and relevant code in this GitHub repo. Run the following command to clone the repository into your local workstation.

git clone https://github.com/aws-samples/mltiaz-fsxontap-eks.git

There are two folders that you need to reference in the following steps, with the “eks” folder containing all manifests files related to the eks cluster resources and “FSxONTAP” Cloudformation templates for spinning up the VPC environment and FSxONTAP File System.

2. Create a VPC environment for Amazon EKS and FSxONTAP (Optional)

Create a new VPC with two private subnets and two public subnets using CloudFormation. This step is optional, and an existing VPC can be reused for the Amazon EKS cluster and the FSxONTAP file system.

Launch the CloudFormation stack to set up the network environment for both FSxONTAP and EKS cluster:

$ cd mltiaz-fsxontap-eks/FSxONTAP
$ aws cloudformation create-stack --stack-name EKS-FSXONTAP-VPC --template-body file://./vpc-subnets.yaml --region <region-name>

UI showing Outputs with columns labeled Key, Value, Descirption, Export name

Once the stack has been deployed successfully, take note of the IDs for PrivateSubnet1, PrivateSubnet2, VPCId, and PrivateRouteTable1, as we will need them in the following steps when creating both the EKS cluster and FSx ONTAP file system.

3. Create an Amazon FSx for NetApp ONTAP file system

Run the following CLI command to create the Amazon FSx for NetApp ONTAP file system. (Note that you need to modify the parameters based on your VPC environment created as above.)

$ aws cloudformation create-stack \
  --stack-name EKS-FSXONTAP \
  --template-body file://./FSxONTAP.yaml \
  --region <region-name> \
  --parameters \
  ParameterKey=Subnet1ID,ParameterValue=[your_preferred_subnet1] \
  ParameterKey=Subnet2ID,ParameterValue=[your_preferred_subnet2] \
  ParameterKey=myVpc,ParameterValue=[your_VPC] \
  ParameterKey=FSxONTAPRouteTable,ParameterValue=[your_routetable] \
  ParameterKey=FileSystemName,ParameterValue=EKS-myFSxONTAP \
  ParameterKey=ThroughputCapacity,ParameterValue=512 \
  ParameterKey=FSxAllowedCIDR,ParameterValue=[your_allowed_CIDR] \
  ParameterKey=FsxAdminPassword,ParameterValue=[Define password] \
  ParameterKey=SvmAdminPassword,ParameterValue=[Define password] \
  --capabilities CAPABILITY_NAMED_IAM  

This CloudFormation stack will take some time to complete; feel free to move to step 4 while waiting for the file system to be deployed.

After the completion of the deployment, we can verify in the following screenshot that the FSx NetApp ONTAP file system and Storage Virtual Machine (SVM) are created.

Take a look at the details of the FSxONTAP file system; we can see that the file system has a primary subnet and a standby subnet.

SVM is also created.

4. Create an Amazon EKS cluster

In this walkthrough, we are going to create the EKS cluster with a managed node group that contains three worker nodes residing across the two private subnets created in step 2. In the cluster.yaml file, substitute the VPC ID and subnet IDs based on the output of the CloudFormation stack launched in step 2.

Create the EKS cluster by running the following command:

$ cd ../eks
$ eksctl create cluster -f ./cluster.yaml

5. Deploy the Trident Operator

There are three ways to deploy the Trident CSI Driver (helm, Trident Operator, and tridentctl). In this blog post, we will use helm to deploy the driver (these instructions were tested with version 21.07).

(1) Create namespace called “Trident”

$ kubectl create ns trident
namespace/trident created

(2) Download the installer bundle

Download the Trident 21.07 (or latest version) installer bundle from the Trident Github page. The installer bundle includes the Helm chart in the /helm directory.

# Download the file
$ curl -L -o trident-installer-21.10.1.tar.gz https://github.com/NetApp/trident/releases/download/v21.10.1/trident-installer-21.10.1.tar.gz

# Extract the tar file downloaded
$ tar -xvf ./trident-installer-21.10.1.tar.gz

(3) Use the Helm install command and specify a name for your deployment

# Go to the trident folder downloaded from last step
$ cd trident-installer/helm

# Install via the helm command to the trident namespace
$ helm install trident -n trident trident-operator-21.10.1.tgz

(4)  Check the status of the Trident Operator

$ helm status trident -n trident
...
NAMESPACE: trident
STATUS: deployed
...

6.  Provision Trident volumes and storage class

(1) Create Kubernetes Secret to store the SVM username and password

Create this file and put in the SVM username and admin password, and save it as svm_secret.yaml.

Note: the SVM username and its admin password have been created via step 3. If you do not recall, you can retrieve it from the AWS Secrets Manager as shown from the following screenshot.


#Edit svm_secrets.yaml in the repo and substitute "SVMPassword"
$ kubectl apply -f svm_secret.yaml
secret/backend-fsx-ontap-nas-secret created

$ kubectl get secrets -n trident |grep backend-fsx-ontap-nas
backend-fsx-ontap-nas-secret Opaque 2 30s

(2) Create the Trident backend

Change directory to the eks folder of your cloned repo; note the backend-ontap-nas.yaml file. Replace the managementLIF and dataLIF with the correct details and save the file. (Refer to the Trident’s documentation for more details when considering which one to use based on your application.)

Note: ManagementLIF can be found using the Amazon FSx console, as demonstrated in the following image, highlighted as Management DNS name.

Parameter Description Remarks
backendName Custom name for the storage backend
managementLIF IP address or FQDN of a cluster or SVM management LIF
dataLIF IP address of protocol LIF When choosing ontap-san driver for backend, dataLIF could be skipped.
svm Storage virtual machine to use

Make sure that the status of the Trident backend configuration deployed is “Success.”

$ cd eks
$ kubectl apply -f backend-ontap-nas.yaml
tridentbackendconfig.trident.netapp.io/backend-fsx-ontap-nas created
$ kubectl get tbc -n trident
NAME                   BACKEND NAME  BACKEND UUID                          PHASE  STATUS
backend-fsx-ontap-nas  fsx-ontap     6329459a-55e9-4606-881d-f83e34f558db  Bound  Success

(3) Create storage class

The storage class yaml manifest is located as storage-class-csi-nas.yaml:

$ kubectl apply -f storage-class-csi-san.yaml
storageclass.storage.k8s.io/trident-csi create

# Check the status of the storage class
$ kubectl get sc
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
trident-csi     csi.trident.netapp.io   Retain          Immediate              true                   42s

4) Create persistent volume claim

The persistent volume claim manifest is located as pvc-trident.yaml.

Verify that the persistent volume is created successfully and the PersistentClaim status is “Bound.”

$ kubectl create -f pvc-trident.yaml
persistentvolumeclaim/basic created

#Check the status of the persistent volume created
$ kubectl get pv

NAME                                     CAPACITY  ACCESS MODES   RECLAIM POLICY  STATUS    CLAIM      STORAGECLASS  REASON  AGE
pvc-23d6b9f3-17ce-4845-a433-56b27a996435  10Gi    RWX             Retain          Bound default/basic  trident-csi           25s

And when you navigate back to the FSxONTAP console, select Volumes of your file system, then confirm the corresponding volume has been created:

Now we have finished configuring the Trident Operator and verified that it enables us to provision Kubernetes persistent volume claim successfully. In the next section, we will deploy a stateful application that runs on Amazon EKS and have the PersistentVolume provisioned by Trident.

7. Deploy the stateful application

We now deploy a highly available MySQL cluster onto the Kubernetes cluster using a Kubernetes Statefulset. Kubernetes Statefulsets ensures the original PersistentVolume is mounted on the same pod identity when it’s rescheduled again to retain data integrity and consistency.

Here, we use Kubernetes ConfigMap to separate configurations and pods. In this example, we apply a ConfigMap named mysql. When the Primary and Secondary pods gets deployed, they read the corresponding configurations.

# Create a namespace where MySQL runs
kubectl create namespace mysql

# Create the ConfigMap for MySQL
kubectl create -f mysql-configmap.yaml -n mysql

Kubernetes Service defines a logical set of pods and a policy by which to access them. StatefulSet currently requires a headless service to control the domain of its pods, directly reaching each pod with stable DNS entries. By specifying “None” for the clusterIP, you can create a headless service.

#Create mysql headless service
kubectl create -f ./mysql/mysql-services.yaml

#Verify mysql headless service is created successfully.
kubectl get service -n mysql
NAME TYPE         CLUSTER-IP  EXTERNAL-IP  PORT(S)   AGE
mysql  ClusterIP  None        <none>       3306/TCP  7h48m

Next, we need to deploy the StatefulSet for MySQL. You may find that the mysql pod contains two init containers (init-mysql and clone-mysql) and two app containers (mysql and xtrabackup), and the pod will be bound to the persistent volume provided by FSxONTAP volumes via Trident CSI in the PersistentVolumeClaim.

#Create the mysql Statefulset
kubectl create -f ./mysql/mysql-statefulset.yaml

#Verify all the pods spin up successfully. 
kubectl get pod -l app=mysql -n mysql
NAME     READY  STATUS RESTARTS  AGE
mysql-0  2/2    Running  0       7m11s
mysql-1  2/2    Running  0       6m21s

We can confirm “data-mysql-0" and ”data-mysql-1" have persistent volumes mounted.

$ kubectl get pv -n mysql
NAME                                     CAPACITY  ACCESS MODES  RECLAIM POLICY  STATUS  CLAIM               STORAGECLASS  REASON  AGE
pvc-d71cb057-6ccb-4769-95a0-15301d6db363  30Gi     RWX           Retain          Bound   mysql/data-mysql-1  trident-csi           7m26s
 pvc-fca2676b-8024-474c-9d8f-041e3f5f307b  30Gi     RWX           Retain          Bound   mysql/data-mysql-0  trident-csi           8m15s

Let’s pay attention that mapping between the pod and PersistentVolume:

data-mysql-0 → pvc-fca2676b-8024-474c-9d8f-041e3f5f307b

8. Failing over MySQL pod on Kubernetes

In this step, we demonstrate how the same pod name gets rescheduled onto another K8S worker node, is recreated, and has the original persistent volume mounted to ensure data consistency.

Populating sample data

Let’s quickly populate the database with some sample data. For that, we will spin up a container that connects with the MySQL primary node to insert the data:

kubectl -n mysql run mysql-client --image=mysql:5.7 -i --rm --restart=Never -- \
mysql -h mysql-0.mysql <<EOF
CREATE DATABASE test;
CREATE TABLE test.messages (message VARCHAR(250));
INSERT INTO test.messages VALUES ('hello, from mysql-client');
EOF

And we can run the following to test that the follower node mysql-1 received the data successfully.

kubectl -n mysql run mysql-client --image=mysql:5.7 -it --rm --restart=Never -- mysql -h mysql-1.mysql -e "SELECT * FROM test.messages"

+--------------------------+
| message |
+--------------------------+
| hello, from mysql-client |
+--------------------------+
pod "mysql-client" deleted

Simulating node failure

Now, let’s simulate the node failure by cordoning off the node on which MySQL is running.

# Check the pods distribution in worker nodes.
kubectl get pod -n mysql -o wide -l app=mysql
NAME     READY  STATUS RESTARTS  AGE  IP          NODE                                         NOMINATED NODE  READINESS GATES
mysql-0  2/2    Running  0       56m  10.0.1.177  ip-10-0-1-110.ap-southeast-2.compute.internal    <none>           <none>
mysql-1  2/2    Running  0       58m  10.0.0.152  ip-10-0-0-187.ap-southeast-2.compute.internal    <none>           <none>

# Cordon the worker node where mysql-0 pod runs on
kubectl cordon ip-10-0-1-110.ap-southeast-2.compute.internal

# Check node status
kubectl get nodes
NAME                                          STATUS                   ROLES   AGE  VERSION
 ip-10-0-0-187.ap-southeast-2.compute.internal      Ready                    <none>  26h   v1.21.5-eks-bc4871b
 ip-10-0-1-110.ap-southeast-2.compute.internal      Ready,SchedulingDisabled <none>  26h   v1.21.5-eks-bc4871b
 ip-10-0-1-131.ap-southeast-2.compute.internal      Ready                    <none>  26h   v1.21.5-eks-bc4871b

Next, let’s go ahead and delete the MySQL pod.

kubectl delete pod mysql-0 -n mysql
pod "mysql-0" deleted

To maintain the number of replicas of the StatefulSet, we reschedule the pod onto another EKS worker node, which resides in another Availability Zone. And we can verify that the mysql-0 pod has been rescheduled to another worker node.

kubectl get pods -n mysql -l app=mysql -o wide
NAME     READY  STATUS   RESTARTS  AGE  IP          NODE                                         NOMINATED NODE  READINESS GATES
mysql-0  2/2    Running  0         42s  10.0.1.56   ip-10-0-1-131.ap-southeast-2.compute.internal      <none>         <none>
mysql-1  2/2    Running  0         15m  10.0.0.152  ip-10-0-0-187.ap-southeast-2.compute.internal     <none>         <none>

And we now know that the original PersistentVolume (pvc-fca2676b-8024-474c-9d8f-041e3f5f307b) has been remounted to mysql-0 pod:

sh-4.2# kubectl get pvc -n mysql
NAME          STATUS  VOLUME                                    CAPACITY  ACCESS MODES  STORAGECLASS  AGE
data-mysql-0  Bound   pvc-87166a6d-0395-4c88-8818-ec27b7f444cb  5Gi       RWO           basic-csi     24h
data-mysql-1  Bound   pvc-648238a3-0b96-4859-98da-7f09f3af254a  5Gi       RWO           basic-csi     24h

Finally, let’s verify the data in the database created still persists after the pod is rescheduled onto another worker node.

kubectl -n mysql run mysql-client --image=mysql:5.7 -it --rm --restart=Never -- mysql -h mysql-0.mysql -e "SELECT * FROM test.messages"

+--------------------------+
| message |
+--------------------------+
| hello, from mysql-client |
+--------------------------+
pod "mysql-client" deleted

9. Performance test with FIO and IOping

In this section, we look at two very most important parameters of the storage performance, IOPS and latency, to measure the performance of FSx NetApp ONTAP file system provisioned by the Trident CSI. We use FIO (Flexible I/O), the popular storage benchmarking tool, and IOping, a tool to monitor I/O latency in real time, to test the performances on FSx NetApp ONTAP drive from the EKS pod.

9.1 EKS pod and FSx NetApp ONTAP in the same Availability Zone.

As the FSx NetApp ONTAP file system preferred subnet is in ap-southeast-2a, in this test, we will deploy the EKS pod in the same Availability Zone to check the performance data.

In step 6, the storage class trident-csi has been created.

$ kubectl get sc
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
trident-csi     csi.trident.netapp.io   Retain          Immediate              true                   11d
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  47d

(1) Change to the directory where pod_performance_same_AZ.yaml resides

cd mltiaz-fsxontap-eks/eks

(2) Deploy the yaml file to provision the pod and the 10 GB storage on FSx NetApp ONTAP

$ kubectl apply -f pod_performance_same_AZ.yaml

(3) Log in to the container and perform FIO and IOping testing


# Logon to the testubg container
$ kubectl exec -it task-pv-pod -- /bin/bash

# Install FIO and IOping
root@task-pv-pod:/# apt-get update
root@task-pv-pod:/# apt-get install fio ioping -y
# Go to the mounted storage on /usr/share/trident-nas/
root@task-pv-pod:/# cd /usr/share/trident-nas/

# Run FIO command, writing 8GB
root@task-pv-pod:/usr/share/trident-nas/# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
Starting 1 process
.....

# Use IOping to test the latency
root@task-pv-pod:/usr/share/trident-nas# ioping -c 100 .
4 KiB <<< . (nfs4 svm-0518719800dfbb809.fs-04a7fcc8e8b5ac4f2.fsx.ap-southeast-2.amazonaws.com:/trident_pvc_f776bbf1_b0af_450b_95e7_b295a9337fd2 10 GiB): request=1 time=185.4 us (warmup)
....

# Exit the container and delete the Pod
root@task-pv-pod:/usr/share/trident-nas/# exit
$ kubectl delete -f pod_performance_same_AZ.yaml

9.2 EKS pod and FSx NetApp ONTAP in the different Availability Zone

Let’s do the test where the pod sits in a different Availability Zone of the storage

(1) Deploy the yaml file to provision the pod and the 10 GB storage on FSx NetApp ONTAP

$ kubectl apply -f pod_performance_different_AZ.yaml

(2) Log in to the container and perform FIO and IOping testing

# Logon to the testubg container
$ kubectl exec -it task-pv-pod -- /bin/bash
# Install FIO and IOping
root@task-pv-pod:/# apt-get update
root@task-pv-pod:/# apt-get install fio ioping -y

# Go to the mounted storage on /usr/share/trident-nas/
root@task-pv-pod:/# cd /usr/share/trident-nas/

# Run FIO command, writing 8GB
root@task-pv-pod:/usr/share/trident-nas# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=fiotest --filename=testfio --bs=4k --iodepth=64 --size=8G --readwrite=randrw --rwmixread=75
fiotest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.25
....
# Use IOping to test the latency
root@task-pv-pod:/usr/share/trident-nas# ioping -c 100 .
4 KiB <<< . (nfs4 svm-0518719800dfbb809.fs-04a7fcc8e8b5ac4f2.fsx.ap-southeast-2.amazonaws.com:/trident_pvc_2ee2111b_15ec_4862_87f9_e457e6f03aa0 10 GiB): request=1 time=732.1 us (warmup)
....

# Exit the container and delete the Pod
root@task-pv-pod:/usr/share/trident-nas/# exit

9.3 Performance Summary
The specific amount of throughput and IOPS that your workload can drive on your FSxONTAP file system depends on the throughput capacity, storage capacity configuration of your file system, and the nature of your workload. In this example, we provisioned 1024 GB as storage capacity and 512 MB as throughput.

The performance of the same Availability Zone and different Availability Zones are as below:

Scenario Average IOPS (read) Average IOPS (write) Average throughput (read) Average throughput (write) Average latency
Same Availability Zone 37.5K 12.5K 154 MB/s 51.3 MB/s 483.8 us
Different Availability Zone 33.4 K 11.1 K 137 MB/s 45.6 MB/s 1.03 ms

As the table above indicates, the IOPS performance is very similar in both scenarios, while the average latency is around 0.5 ms when the pod and the storage are in the same Availability Zone, and 1 ms when they are not. In both scenarios, the performance data shows that the AWS FSx for NetApp ONTAP on EKS could support low-latency applications running at 1 ms latency and over 30 K read IOPS and 10 K write IOPS.

Cleaning up

To avoid unnecessary cost, make sure you clean up the resources that we just created for this demo.

Delete the EKS cluster: 

eksctl delete cluster --name=FSxONTAP-eks --region ap-southeast-2

Delete the FSxONTAP file system:

aws cloudformation delete-stack --stack-name EKS-FSxONTAP --region ap-southeast-2

Delete the VPC CloudFormation stack: 

aws cloudformation delete-stack --stack-name EKS-FSXONTAP-VPC --region ap-southeast-2

Conclusion

This blog post presented a brief introduction of Amazon FSx for NetApp ONTAP service and illustrated how to use NetApp Trident CSI to provision persistent volumes that span across multiple Availability Zones. As demonstrated from the performance test and MySql pod failover in the demo, Amazon FSx for NetApp ONTAP provides high storage performance with sub-millisecond file operation latencies with solid state drive (SSD) storage and provides multi-AZ availability, which makes it a good fit for use cases where AWS customers need to run business-critical stateful applications on Amazon EKS.

Benson Kwong

Benson Kwong

Benson Kwong is an Enterprise Solutions Architect based in AWS Hong Kong. With 10 years of IT experience and 5 years as a cloud professional, Benson has been working with companies of all sizes and verticals to architect their workloads on AWS and aspired to help customers adopt new AWS services that bring tangible business benefits. In his spare time, he enjoy spending time with his wife and two kids, playing basketball and snooker.

Haofei Feng

Haofei Feng

Haofei is a Cloud Architect at AWS with 15+ years experiences in Containers, DevOps and IT Infrastructure. He enjoys helping customers with their cloud journey. He is also keen to assist his customers to design and build scalable, secure and optimized container workloads on AWS. In his spare time, he spent time with his family and his lovely Border Collie. Haofei is based in Sydney, Australia.