Enhance your upstream workloads with Amazon FSx for NetApp ONTAP

Geological and Geophysical (G&G) workloads in Upstream Energy have different workflows associated with them, including Reservoir Simulation, Subsurface Interpretation, and Drilling and Completions. Due to the diverse performance and client requirements of these workflows, organizations often face a heavy operational burden of copying their data to multiple solutions for different protocols. Until recently, they faced the challenge of developing their own solutions to achieve multi-protocol access, which provides the ability to access the same file at the same time, from either Windows or Linux. This process was not only time-consuming but also added complexity and maintenance overhead.

Now, organizations can take advantage of Amazon FSx for NetApp ONTAP. This fully managed service provides a seamless solution for multi-protocol access, allowing the organization to focus on their core business activities instead of managing complex infrastructure.

In this post, I examine FSx for ONTAP as a fully managed service and look at additional capabilities that are available. These include advanced performance techniques and active monitoring methods. By exploring these aspects, readers will gain deeper insight into leveraging the full potential of FSx for ONTAP.

Solution overview

As seen in Figure 1, an Amazon Elastic Compute Cloud (Amazon EC2) Network-Attached Storage (NAS) instance is a common deployment as organizations begin their journey to AWS. Two pillars of the AWS Well Architected Framework are Operational Excellence and Reliability. When running EC2 NAS instances, an organization must patch instances, architect its instances for resiliency when failover is required, and self-manage their services. All of this introduces operational overhead.

Figure 1: EC2 NAS instance design

FSx for ONTAP, as seen in Figure 2, provides a fully managed file system offering multiprotocol access. You do not need to provision individual instances, architect for failover, or manage upgrades. NetApp ONTAP has had presence in the Upstream Energy industry for decades. Data is automatically mirrored between the active and standby files servers in the FSx for ONTAP file system. This solution provides high availability to data access with two file servers in an active-standby configuration in a Single Availability Zone configuration.

Figure_2_Single_Availability_Zone_Design

Figure 2: Single Availability Zone design

Along with the ability deploy a Single-AZ solution, Multi-Availability Zone (AZ) file systems, as seen in Figure 3, are available and provide cross-AZ resiliency. Classic ONTAP technologies, such as SnapMirror and SnapVault, are available for additional Disaster Recovery and Business Continuity requirements. In addition to Amazon CloudWatch and Amazon CloudTrail for monitoring, NetApp tools in the NetApp ecosystem, such as BlueXP and NAbox, are available to be leveraged for the operational excellence and telemetry data of the FSx for ONTAP file system.

The FSx for ONTAP service is available with various throughput and IOPS configuration. FSx for ONTAP can be used for an individual project – allowing organizations to only pay for what is needed – or as a persistent file system.

Figure 3: Multi-Availability Zone Design

Figure 3: Multi-Availability Zone Design

Optimizing FSx for ONTAP for Upstream Workloads

Upstream workloads’ I/O pattern often consists of low concurrency and sequential reads. For workloads with these behaviors, FSx for ONTAP solutions include NVMe read cache to accelerate reads. In addition to NVMe read cache, we can create FlexGroups.

Creation of a FlexGroup

A FlexGroup is a volume created from other constituent volumes. There is automatic load distribution and scalability occurring on the ONTAP side that adds benefits to some workloads. One advantage of FlexGroups is that they can grow to be much larger, and contain many more files, than a regular FlexVol. A standard FlexVol has a 100TB limit, where FlexGroups have virtually no limit (NetApp recommends a maximum of 20 PB):

FSxId0::> vol create -vserver svm0 -volume flexg -aggr-list aggr1 -size 400T
Notice: The FlexGroup volume "flexg" will be created with the following number of constituents of size 100TB: 4.
Do you want to continue? {y|n}: y

In the preceding example, ONTAP creates four 100TB constituent volumes of the FlexGroup. Four constituent volumes are the ONTAP defaults, which can be tuned based on workload requirements.

The Flexible IO Tester (FIO) is a common tool to evaluate and profile storage solutions. Clients exist for Linux and Windows operating systems. To explore the benefits of FlexGroups, I created a 512MB throughput file system. Eight FIO runs per row were averaged in the following chart. The FG/VOL and NFS Version columns show various volume types and protocol versions respectively.

In G&G workloads, applications have varied block sizes and numerous clients. Furthermore, most ISV applications still require older operating systems which limit new capabilities, such as nconnect for NFS. There are nuances to every workload, and variations should be profiled based on the application and supporting operating system. Presented in Figure 4, I ran a common workload size of 64KB blocks, with both random and sequential I/O to show these differences from a single client. The read percentage is 70%, while the concurrent write percentage is 30%, for a mixed workload. The following table represents the impact of FlexGroups, where nconnect can assist, and where protocol versions introduce varied overhead.

Figure 4: FlexGroup table

Figure 4: FlexGroup table

One current limitation of FlexGroups today is they are not supported by FSx Backups or AWS Backup. Other techniques, such as SnapMirror, offer a solution to backup and disaster recovery.

SMB-specific optimizations for performance in interpretation applications

Many subsurface applications on Windows read and write a large amount of data for interpretation, analysis, and visualization. To explore some tunings, I start with a read Robocopy of SEGY data from an FSx for ONTAP with 2GB/sec of throughput and default SSD IOPS to a g4ad.4xlarge instance. The g4ad.4xlarge provides up to 10Gb networking with burst benefits, which are seen in the following charts. Afterward, I continued with writes of SEGY and Robocopy to understand the implications of directional I/O.

Two options I chose to explore are the SMB Maximum Transition Unit (MTU) setting on the FSx for ONTAP side, which defaults to Large MTU enabled, and the client-side bandwidth throttling, which is also enabled from a client perspective by default. As seen in Figure 5, client-side bandwidth throttling, enabled or disabled, added no measurable performance increases to these profiles.

Although this is the simple movement of a specific data types, there was enough performance differences to dive deeper on SMB MTU size impacts.

Figure 5: SMB Throttle performance table

Figure 5: SMB Throttle performance table

Large MTU provides SMB blocks to be transferred in up to 1MB in size, and disabling this feature results in a smaller block size of 64KB. This is an application specific requirement, and should be evaluated based on organizational needs. Interpretation applications often have mixes of small and large file sizes yet have low concurrency for data. Large SMB MTU can impact performance regardless of file size. To explore these block sizes, I chose to use an FIO profile of 70% Read and 30% Write, with 64KB as the block size and an I/O depth of 16, along with a corresponding profile with large MTU block sizes of 1MB. This was done to have more than a single process running against the file system. As a load is added to the file system, the results were also in favor of disabling the large MTU for sequential workloads, regardless of block size. For random I/O, the results are mixed. The results are displayed in Figure 6.

For these applications which stream large files into memory for visualization of results, it is worth validating the performance results when disabling SMB large MTU within your application and environment for additional performance gains.

Figure 6: SMB Large MTU comparison performance table

Figure 6: SMB Large MTU comparison performance table

Commands to disable SMB large MTU

To disable large MTUs, follow these steps:

1. Go into Advanced mode:

FSxId::> set -privilege advanced

Warning: These advanced commands are potentially dangerous; use them only when directed to do so by NetApp personnel.
Do you want to continue? {y|n}: y

2. Verify the defaults are set to true:

FSxId::*> cifs options show -vserver svm0 -fields is-large-mtu-enabled
vserver is-large-mtu-enabled
------- --------------------
svm0 true

3. Set large MTU size to false:

FSxId::*> cifs options modify -vserver svm0 -is-large-mtu-enabled false

4. Verify that large MTU is set to false:

FSxId::*> cifs options show -vserver svm0 -fields is-large-mtu-enabled
vserver is-large-mtu-enabled
------- --------------------
svm0 false

For the changes to take effect, disconnect and reconnect your SMB shares on the Windows hosts.

Active monitoring

Active monitoring is running a workload and monitoring the real-time statistics of the workload. This is often used to understand immediately what is occurring on a file system. Although quality of service is a technique to throttle storage objects, in FSx for ONTAP this also includes performance monitoring of volumes with the QoS statistics commands The following QoS commands are only available with the fsxadmin file system administrator account, not the vsadmin SVM account. This is at the file system level.

FSxId::> qos statistics workload latency show

The statistics command is great for active benchmarking and latency – you’ll see Network, Data, Disk, etc. (as seen in an example in Figure 7). This can help pinpoint a potential problem with latency. Moreover, it is possible to send that data into a timeseries database for graphing purposes.

Figure 7: QOS statistics workload latency output example

Figure 7: QOS statistics workload latency output example

Enable refresh display with:

FSxId::> qos statistics workload latency show -refresh-display true

For throughput and aggregate latency, the command below would produce results as seen in Figure 8:

FSxId::> qos statistics workload performance show

Figure 8: QOS statistics workload performance output example

Figure 8: QOS statistics workload performance output example

IOPS, Throughput, and Latency are all important for assessing overall performance. IOPS are Input/Output operations per second, impacted concurrently by applications. Throughput is the measure of bytes transmitted per second, and is a function of IOPS times block size, where block size is determined by an individual application. Latency is the amount of time it takes to fulfill a request, is influenced by the caching and infrastructure associated with the service, and has a direct impact on the number of IOPS driven. All of these are important to dive deep into when exploring performance bottlenecks.

With SVM-level credentials, the amount of commands at your disposal are more limited (as SVMs are isolated virtual file servers, this is done to limit the level of access each SVM has to the overall file system). Available commands include the statistics command. The performance statistics you can measure from an SVM administrator account are documented as follows. For this command, you specify the -iterations (the number of times you want the statistics to be reported), and the -interval (the time between each report being displayed).

svm0::> statistics volume show -iterations 10 -interval 5

svm0 : 10/25/2022 14:52:35

               *Total Read Write Other     Read Write Latency

Volume Vserver    Ops  Ops   Ops   Ops    (Bps) (Bps)    (us)

------ ------- ------ ---- ----- ----- -------- ----- -------

   vol5    svm0   1575 1575     0     0 95835136     0     524
   vol7    svm0   1545 1545     0     0 93981696     0     518
   vol8    svm0   1516 1516     0     0 92125184     0     543
   vol1    svm0   1506 1506     0     0 91600896     0     548
   vol6    svm0   1487 1487     0     0 90232832     0     485
   vol4    svm0   1435 1435     0     0 87359488     0     546
   vol2    svm0   1412 1412     0     0 85709824     0     512
   vol3    svm0   1353 1353     0     0 82245632     0     514

svm0_root

           svm0     0    0      0     0        0     0       0

NetApp Harvest

Although CloudWatch provides high-level metrics, the deeper performance metrics we showed in the ONTAP Command Line Interface are not exposed in CloudWatch. However, tools like NetApp Harvest and Grafana, and NetApp Cloud Insights, offer many more metrics. See the Monitoring file systems section of the FSx for ONTAP user guide for more information.

Monitoring FSx for ONTAP file systems with Harvest and Grafana

NetApp Harvest Github

Cleaning up

To avoid incurring unwanted AWS costs after performing these steps, delete any AWS resources created like Amazon EC2 instances and FSx for ONTAP resources.

Conclusion

In this post, I showed the durability and resiliency benefits of Amazon FSx for NetApp ONTAP over existing Amazon EC2 NAS implementations while benefiting from the advanced features of FSx for ONTAP. Along with the multi-protocol capabilities of FSx for ONTAP, the feature rich tool set and simplify of a managed service, FSx for ONTAP enables existing applications and workloads to run seamlessly in AWS.

Upstream, Mid-stream, and Downstream workloads with multi-protocol requirements (leveraging the benefits of a fully managed service) can access these capabilities today in AWS with FSx for ONTAP. Included are techniques using FlexGroups and SMB tunings for specific SMB workloads. Monitoring techniques both at the Command Line Interface and using alternative methods from the NetApp ecosystem provide robust telemetry data. FSx for ONTAP can be highly tuned to deliver the best performance for your workloads.

Amazon FSx for NetApp ONTAP is available in most AWS Regions. If you have any comments or questions, don’t hesitate to leave them in the comments section.