Using Quality of Service in Amazon FSx for NetApp ONTAP

When building file systems in the cloud, one major advantage is that administrators are no longer constrained by the traditional large storage controller. This makes the entry point much lower, which means that they can provision individual file systems for each workload. However, customers may occasionally have to combine workloads into a central file system. Some examples of this include simplifying administration, reducing cost, and enabling multiple, smaller workloads to utilize the performance of a larger file system for bursting purposes.

In this case, you’ll want to take care to make sure that particular workloads are not taking more than their fair share of performance. This can impact other workloads on the file system, often referred to as the “noisy neighbor.” Other examples include cloned workloads on the same file system, where the production workload is cloned to development and test environments. How do we make sure that developers don’t impact the performance of the production workload on which their clone is based?

Amazon FSx for NetApp ONTAP is a storage service that allows customers to launch and run fully managed ONTAP file systems in the cloud. FSx for ONTAP has built-in Quality of Service (QoS) features that enable you to set performance limits for objects sharing the file system, thereby making sure that each workload gets the performance it needs. QoS lets customers easily create policies that avoid the noisy neighbor issue, while still providing a multi-workload or even multi-tenant file system. It can also be used to make sure that a workload never consumes more than a certain amount of resources even if there is no contention. QoS also offers consistent and predictable performance for applications, restricting unpredictable IO patterns and enabling service providers to provide SLAs around performance.

In this blog post, I walk through three specific use cases for QoS. In each use case I examine the workload, the impact of the workload on file systems, and the effect of QoS when applied to that workload. I also provide you with the commands to implement QoS to alleviate performance impacts, and discuss using them with your own workloads.

Use case 1: the noisy neighbor

Using a QoS policy, you can define a “ceiling” in terms of either IOPS or throughput that a given workload can run. You can also enable a system that functions similarly to Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Store’s (Amazon EBS) burst credit mechanisms. This enables a workload to build credits when operating below its burst limit, which can then be used to operate above its burst limit temporarily. You can define performance ceilings on volumes, files, LUNs, SVMs, or qtrees, and multiple workloads can be added to a single policy. Let’s examine the effect of the ceiling in the following figures.

Figure 1 shows an example of three workloads, with Workload 2 monopolizing the available performance of the entire file system. This could be due to the workload receiving priority for being the first one to start, or because it is experiencing high demand. Workload 2’s performance needs result in insufficient performance for Workloads 1 and 3.

Figure 1 - Workloads without QoS

Figure 1: Workloads without QoS

In Figure 2, a QoS ceiling is instituted for Workload 2, preventing it from consuming all available file system performance. The QoS ceiling makes sure that even though Workload 2 is taking additional resources, Workloads 1 and 3 don’t suffer.

Figure 2 - Workloads with QoS

Figure 2: Workloads with QoS

Let’s look at our test setup on FSx for ONTAP without any QoS policies. We have three EC2 instances all mounting their own individual volumes on an Amazon FSx for ONTAP file system.

Figure 3 - EC2 instances

Figure 3: Amazon EC2 instances

Here we will demonstrate the situation in Figure 1 by generating load on each of the three servers to their individual 100-GB volumes they have mounted by running the following command:

sudo fio --name=fio --filename=/mnt/fsx/fs1/file1 --size=16GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120000 --numjobs=2 --time_based --group_reporting --name=IOPS-test-job --eta-newline=1

This will create two workers on each server doing random read/writes with 4KB block size.

Figure 4 - Before noisy neighbor

Figure 4: Before noisy neighbor

As each workload is generating the same level of requests, each workload gets approximately 7K IOPS, as illustrated in the graph above, no workload is taking an unfair share. However, we can change server 1, which is attached to Vol1, to create 32 threads. This replicates the noisy neighbor situation from Figure 1, and Vol1 will monopolize the file system by doing 16x the IO requests.

sudo fio --name=fio --filename=/mnt/fsx/fs1/file1 --size=16GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=256 --runtime=120000 --numjobs=32 --time_based --group_reporting --name=IOPS-test-job --eta-newline=1

Figure 5 shows how we changed server 1 to create 32 threads, while leaving servers 2 and 3 the same. After the initial ramp up, we see the performance of both servers 2 and 3 drop significantly around 20:30. Although their workload profile has remained the same, they are now experiencing the noisy neighbor effect from server 1. Server 1 is effectively taking all the performance from the file system leaving Server 2 and 3 with performance issues.

Figure 5 – After noisy neighbor

Figure 5: After noisy neighbor

Let’s apply ceiling QoS to these workloads using this ONTAP CLI command:

qos policy-group create policy1 -vserver svm01 -max-throughput 10000IOPS,300MB/s -is-shared false

In this example, we created a “none shared group” using:

-is-shared false

This means that any object we attached to that policy is individually limited. In contrast, when using a shared group, all of the workloads that are placed into the policy share the throughput specified by the policy (such as the workloads can’t exceed the ceiling).

We can also see a summary of policies with the following command:

qos policy-group show

::> qos policy-group show -policy-group policy1

              Policy Group Name: policy1
                        Vserver: svm01
                           Uuid: 63ab2b0a-bc09-11ec-81f5-abba8ceb7752
             Policy Group Class: user-defined
                Policy Group ID: 1664
             Maximum Throughput: 10000IOPS,300MB/s
             Minimum Throughput: 0
            Number of Workloads: 0
              Throughput Policy: 0-10000IOPS,300MB/s
                      Is Shared: false
       Is Policy Auto Generated: -

The preceding command creates the policy inside svm01 that has a maximum throughput of 300MB/sec, the policy can do 10000 IOPS and is not shared. We have yet to apply this policy to any objects. To do this we run the following command to set it on a volume, which can also be applied at a file, a LUN, or a qtree:

volume modify -volume <volumename> -qos-policy-group policy1

To remove this policy, we simply set the policy back to none using this command:

volume modify -volume <volumename> -qos-policy-group none

We can even adjust the IOPs and throughput in a policy that is currently attached:

qos policy-group modify policy1 -max-throughput 1000IOPS,200MB/s

Let’s attach this to Workload 1 to limit the IOPS to 1000:

volume modify -volume vol1 -vserver svm01 -qos-policy-group policy1

View the QoS statistics with the following command:

qos statistics volume performance show

In Figure 6, you can see Vol1 has now been limited to 1000 IOPS. We originally set policy 1 to 10000 IOPS, but then we used the modify to set it down to 1000 IOPS. Vol2 and Vol3 returned to around 7000 IOPS as originally seen in Figure 4 before the noisy neighbor.

Figure 6 QoS statistics

Figure 6: QoS statistics

Let’s adjust the policy while it’s attached with the modify command:

qos policy-group modify policy1 -max-throughput 2000IOPS,200MB/s

Within a few seconds. we can see the impact:

Figure 7 QoS statistics after increase

Figure 7: QoS statistics after increase

By applying a QoS policy to the “noisy neighbor” volume, we stopped it from impacting the other workloads on the file system. This makes sure that they continue to get the performance they need.

Use case 2: Development or clone workloads

A useful feature of FSx for ONTAP is cloning workloads. Using NetApp FlexClones lets us give developers a zero-space efficient copy of the production workload for them to test against. However, when you have 30 developers each creating a new clone (clones don’t inherit limits set on the parent), how do you make sure that their combined testing doesn’t impact production? Or limit the total amount of IOPS available to a specific group of objects?

Let’s take our original none shared policy and apply this to Vol2:

volume modify -volume vol2 -vserver svm01 -qos-policy-group policy1

Figure 8 – QoS statistics after applying non-shared policy to Vol2

Figure 8: QoS statistics after applying non-shared policy to Vol2

We can see both Vol1 and Vol2 can now consume 2000 IOPS each. Going back to the example, we may have had 30 developers and 30 volumes that could overtake our other workloads.

We must create a new policy, as the shared value can’t be modified on an existing policy. However, we can then change the volumes to point to this new policy.

qos policy-group create policy1-shared -vserver svm01 -max-throughput 3000IOPS,200MB/s -is-shared true
volume modify -volume vol1 -vserver svm01 -qos-policy-group policy1-shared
volume modify -volume vol2 -vserver svm01 -qos-policy-group policy1-shared

Figure 9 QoS statistics after applying the shared policy to Vol1 and Vol2

Figure 9: QoS statistics after applying the shared policy to Vol1 and Vol2

This has now limited both Vol1 and Vol2 to a shared 3000 IOPS between the two volumes. If we had placed in another workload, then the IOPS would be shared with that workload, too. This lets us create a group of volumes that can burst within the group to the set limit of the group. This is the case as long as other workloads in the group aren’t using their resources, or we can limit a group of workloads from affecting another workloads.

Use case 3: Storage performance tiers

Traditional storage arrays often have multiple storage performance tiers. We can utilize QoS to create additional storage performance tiers on FSx for ONTAP. The question you may be asking is “why would we want to?” When applications’ storage requirements change, often their performance requirements adjust with it. Furthermore, when using central service providers, they usually provide tiers of storage, including an IOPS-per-TB figure. You may think “Why would I want to do that even when there is no contention?” or “If the resource is free why not use it?” First, let’s consider a central storage provider who sells or cross charges specific classes of storage with specific performance profiles. This makes sure that each tier delivers what is specified. Another example would be a new deployment. Without limiting the performance of specific tiers, the first workload will become used to having the entire performance of the storage file system. As additional workloads are added, contention occurs and the first workload drops below the original baseline when the system was empty.

Up to this point, we have been looking at static QoS. Therefore, if the workload size is increased or decreased, then the QoS value would remain the same. However, adaptive QoS adjusts the value based on the size of the object to which it is applied. This can be based on the allocated or used size. With Adaptive QoS, we also set a maximum block size so that we can limit the throughput if necessary. Moreover, adaptive QoS can’t be applied at the SVM-level, but it can be applied at the volume, file, lun, or FlexGroup-level.

In examining the following adaptive QoS command, we set a baseline and peak IOPS that can be reached if the volume has generated credits by being below its baseline. You can also define a block size to use – this, in combination with an IOPS limit, can limit throughput. For example, even if a client is using a 64K block size, if you set the block size to 4K, then clients will be limited to a 4K IO size. Combine this with an IOPS limit of 1,000-per-TB, and you get 1,000 IOPS * 4K IO size, 4MBps throughput limit. Finally, we can set the IOPS based on the allocated or used space.

qos adaptive-policy-group create -policy-group adaptive-policy2 -vserver svm01 -expected-IOPS 1000 -block-size 4k -expected-IOPS-allocation allocated-space -peak-IOPS 1500 -peak-IOPS-allocation allocated-space

Figure 10 Adaptive QoS policies

Figure 10: Adaptive QoS policies

We can assign an adaptive QoS policy similarly to the static QoS policies:

volume modify -volume <volumename> -qos-policy-group -qos-adaptive-policy-group adaptive-policy2

To remove these, we simply set the policy back to none:

volume modify -volume <volumename> -qos-adaptive-policy-group none

Now that we have an adaptive QoS policy, let’s assign it to Vol2 using the following command:

vol modify -vserver svm01 -volume vol2 -qos-adaptive-policy-group adaptive-policy2

Note that it can take up to five minutes for adaptive QoS policies to take effect. As the volume is 100GB in size, and the peak IOPS is 1500 per TB, the volume will be limited to 150 IOPS:

Figure 11 Adaptive QoS policy on Vol2 at 100GB

Figure 11: Adaptive QoS policy on Vol2 at 100GB

Changing Vol2 to 10TB would increase the maximum to a 15000 IOPS peak and 10000 IOPS baseline. This is above what the file system can accommodate (as seen in Figure 11), due to the load being generated by Vol1 and Vol3. However, we can see that it still never exceeds 15000 IOPS (unlike Vol3, which has no QoS policies attached).

Figure 12 Adaptive QoS policy on Vol2 at 10TB

Figure 12: Adaptive QoS policy on Vol2 at 10TB

In summary, we have examined both static and adaptive QoS features of FSx for ONTAP, as well as applied ceilings on workloads which can be based on IOPS or throughput. Although adaptive QoS changes are dynamically based on the size of workload, static QoS is fixed. Determining which one is most suitable for you will depend on the situation. Adaptive QoS fits the storage class model, while static QoS is useful for making sure that specific workloads don’t monopolize the file system.

Conclusion

In this post, I covered the benefits of using QoS with FSx for NetApp ONTAP. By using QoS, you can deploy and manage multiple workloads on a single file system while still providing a consistent experience for your workloads. Although there are many use cases for QoS, we specifically covered three examples, namely:

Where an individual workload is causing performance issues for other workloads.
Where a group of workloads should have the performance they can request limited so they do not impact others e.g. Test/development or consolidation of workloads.
Creation of different performance tiers that can be used by service providers or central teams to cross charge specific classes of storage.

For more information, refer to the following user guide pages: FSx for ONTAP

Thank you for reading this post. For further information about FSx for NetApp ONTAP, visit the product page.

AWS Storage Blog

Using Quality of Service in Amazon FSx for NetApp ONTAP

Use case 1: the noisy neighbor

Use case 2: Development or clone workloads

Use case 3: Storage performance tiers

Conclusion

Resources

Follow