AWS Storage Blog

Best practices for monitoring Amazon FSx for Lustre clients and file systems

Lustre is a high-performance parallel file system commonly used in workloads requiring throughput up to hundreds of GB/s and sub-millisecond per-operation latencies, such as machine learning (ML), high performance computing (HPC), video processing, and financial modelling. Amazon FSx for Lustre provides fully managed shared storage with the scalability and performance of the popular Lustre file system. It is designed to store and manage large amounts of data across multiple servers and clients.

As with any complex distributed system, it is important to monitor metrics, logs, and boundaries that have a potential to affect its performance. As FSx for Lustre is a managed file-system, AWS monitors the file system and takes proactive actions to keep the system healthy. AWS also performs routine software patching for the Lustre software that it manages.

In this post, we demonstrate what additional monitoring you can implement on FSx Lustre client workloads to monitor and manage your workload performance. We cover how to gather logs from your cluster, extract relevant information and explain how metrics can be interpreted. We explain how to gather metrics from the Lustre Client, your ENA driver, and how to turn error messages into actionable metrics.

Why consider workload monitoring?

Lustre’s network communication protocol, “LNet” is a low-level networking layer used by the Lustre distributed file system including the Lustre client. AWS monitors the file system and takes the necessary actions to keep the system healthy and secured, as FSx for Lustre is a managed service. Under the shared responsibility model, customers are responsible for the performance tuning and monitoring of Lustre clients. Monitoring a Lustre file system on the client side can help detect and diagnose issues that may be affecting the Lustre file system.

In the following, you can find some of the reasons why monitoring a Lustre filesystem on the client side is important:

  • Improving performance: Monitoring metrics such as storage capacity, dropped packets, read/write throughput, and latency can help identify issues that may affect your applications if not actioned. This information can be used to identify bottlenecks and take action accordingly. Example actions include scaling-up the client, increasing the storage capacity, or re-balancing your filesystem before your end-users start experiencing issues.
  • Identifying errors and failures: Monitoring the Lustre client for errors and failures, such as slow replies, timeouts, and server connection failures can help identify issues, diagnose the root cause, and take appropriate action to resolve it, such as gracefully un-mounting the file system.
  • Identifying and managing client node failures: Monitoring and maintaining the availability of the Lustre file system is AWS’s responsibility. Client side issues (such as resource exhaustion) might arise that prevent clients from communicating with a Lustre file system (or network subset). When these issues are detected, it is important to diagnose and remove those nodes from your cluster to prevent impact on other clients.

What you should know about Lustre monitoring

In Lustre, an Object Storage Device (OSD) is a storage device that holds data and metadata objects. One of the key features of Lustre is its ability to distribute and, depending on configuration, “stripe” data across multiple Object Storage Targets (OSTs). This is to improve the overall performance of file reads and writes.

Meta Data is stored on a MetaData Target (MDT). As explained in the following picture, a MetaData Server (MDS) backs an MDT, while an Object Storage Server (OSS) backs one or more OSTs.

Finally, an Object-Based Device (OBD) connection, describes the connections between clients and servers for communication and data exchange.

FSx for Lustre Components

FSx for Lustre components

To observe the performance of a distributed system, you must collect logs and metrics from the components that make up the complete system. Furthermore, you must parse and extract fields from logs. The FSx for Lustre service publishes relevant metrics and logs to Amazon CloudWatch natively without requiring any additional setup. On the clients, it is recommended to collect kernel logs as well as add certain custom client metrics.

Lustre client logs typically include identifiers such as Client, Target (Object Storage Target or MetaData Target), LNet node identifier, and protocol. By parsing the logs, you can extract identifiers. This lets you build a map (as shown in the following) to identify where errors are occurring and which connections have high latency; or find requests that were unsuccessful due to time-outs.

Lustre OBD client server mapping

Lustre OBD client server mapping

Getting a cluster wide view using Lustre client logs

In this section, we cover getting a cluster wide view using Lustre client logs. First, we set up log collection. Then, we set up Amazon CloudWatch Logs metrics.

1. Setting up log collection

First and foremost, you must collect and aggregate logs from the Lustre clients. The Lustre client kernel module (kmod-lustre) generates log messages. When you collect and aggregate these log messages across all our clients, it helps you answer questions such as:

  • Which clients are impacted?
  • Is it a cluster wide issue? Or is it a client side issue?
  • Does it affect the entire file system or only a subset of OSTs?

As the Lustre Client runs on the Kernel, typically these logs are stored in /var/log/messages and you can leverage the CloudWatch agent to ingest these logs. An example of the configuration for the CloudWatch agent is shown in the following:

{
  "agent": {
    "metrics_collection_interval": 60,
    "logfile": "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log",
    "debug": false
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/messages",
            "log_group_name": "/ec2/linux/var/log/messages",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

Example configuration of CloudWatch agent for the logs aggregation 

2. Setting up Amazon CloudWatch Logs metric filters

After the logs are ingested into CloudWatch, you can convert certain log events to metrics using a filter pattern. One simple pattern to get you started is “LustreError”.

Note that the Lustre Client is verbose, and having a number of these messages per client is to be expected. Short spikes also show during maintenance or recovery events. These are not a cause for concern.

However, extracting these metrics is useful for anomaly detection. For example, in the following graph you can clearly see the number of errors increases as of 12/28.

Count of messages per second with ‘LustreError’ prefix from Lustre client logs

Count of messages per second with ‘LustreError’ prefix from Lustre client logs

From just this graph you can’t draw any conclusions. An alarm should trigger further investigation into the logs. In the previous example, we found that clients were unable to reach a certain filesystem due to a network configuration error on the client side.

The next sections explain how you can automatically extract more relevant data points from the logs.

Parsing the Lustre client logs through CloudWatch logs insights

Amazon CloudWatch Logs Insights queries offer the ability to parse log messages and extract relevant information.

Before we dive in, let’s look go over a few example log entries that you could find on Lustre client logs.

Lustre: 11569:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1640237618/real 1640237618] req@ffff9e8d4dd51e00
x1719858836278496/t0(0) o103->dzfevbmv-MDT0000-mdc-ffff9efe38280000@10.59.142.118
@tcp:17/18 lens 328/224 e 0 to 1 dl 1640237625 ref 1 fl Rpc:X/0/ffffffff rc 0/-1

Lustre: dzfevbmv-MDT0000-mdc-ffff9efe38280000: Connection to dzfevbmv-MDT0000
(at 10.59.142.118@tcp) was lost; in progress operations using this service will
wait for recovery to complete

LustreError: 11-0: dzfevbmv-MDT0000-mdc-ffff96cc000c4800: operation
ldlm_enqueue to node 10.30.11.90@tcp failed

LustreError: 11-0: dzfevbmv-MDT0000-mdc-ffff9fc0f2a00000: operation
obd_ping to node 10.30.11.221@tcp failed: rc = -107

LustreError: 6503:0:(import.c:379:ptlrpc_invalidate_import())
dzfevbmv-OST003b_UUID: Unregistering RPCs found (1).
Network is sluggish? Waiting them to error out.

Lustre: 2404:0:(client.c:2116:ptlrpc_expire_one_request()) @@@
Request sent has failed due to network error:

Lustre: Evicted from MGS (at 198.19.49.19@tcp1)

Example messages from Lustre client logs (non-exhaustive list)

Looking at the common examples provided above, you can see a few patterns. As all messages start with aLustreorLustreErrorprefix, it’s easy for us to extract only Lustre messages.

Note that the logs may include a reference to Lustre targets. FSx for Lustre uses the following pattern for Lustre identifiers:

  • Mount name identifier (e.g.,dzfevbmv). A combination of lowercase letters and numbers, eight in total, which could be extracted using the following regular expression:[a-z0-9]{8}
  • An OST or MDT identifier (e.g.,OST003b). A prefix (OST or MDT) followed by four hexadecimal characters. This could be extracted using the following regular expression:(OST|MDT)[0-9a-f])

Now that you have a better understanding of the information that is logged, let’s dive in and understand how you could use CloudWatch Logs Metric Filters and CloudWatch Logs Insights to parse and extract fields log messages.

1. Extracting client identifiers

To better analyze your system, one interesting insight is the number of log messages per Lustre clientThis info helps you identify which subset of your system is potentially affected and act accordingly.

Although the log messages don’t include the instance-id or hostname, you can easily store and extract that when using the CloudWatch agent for log ingestion. The Cloud Watch agent lets you configure {instance_id}, {hostname}, {local_hostname}, and {ip_address} as variables within the name of the log stream ( log_stream_name in the configuration file ).

Using a CloudWatch Logs Insights query, you can calculate aggregate statistics for thelogStreamvalues. The following example queries extract certain messages from the client logs and calculate how frequently they occur per client.

fields @timestamp, @message, @logStream
| filter ( 
    @message like
    "/Lustre.*(Error|slow reply|Connection.*lost|operation.*failed/).*"
  )
| stats count (*) as count by @logStream
| sort count desc

Example query to showcasing the count of specific Lustre messages in the logs

These queries yield a table as result. The table shows you how frequently a message is written in the clients kernel log. Note that Lustre logging is verbose, these messages should be expected to show, and spikes can occur during maintenance windows.

If you see a single instance which stands-out in your cluster, then that instance is likely to have a client side issue that must be addressed.

Query output showcasing client and number of specific lustre messages in the logs

Query output showcasing client and number of specific lustre messages in the logs

2. Parsing the CloudWatch Log group to extract FSx Lustre service identifiers

You can also extract service side identifiers, such as the OST or MDT ID. For example, you can use the queries as shown in the following example to get the number of requests sent that have timeout for slow reply.

fields @timestamp, @message
| filter @message like /Request sent has timed out for slow reply/
| parse @message /(?<@filesystemTarget>(\w{8}\-(OST|MDT)[0-9a-f]{7})/
| stats count(*) as count by @filesystemTarget
| sort count desc

The output from these queries can be visualized as follows within the CloudWatch dashboards.

Count of requests sent that has timed out for slow reply

Count of requests sent that has timed out for slow reply

A large, short spike ~10 min spike on a small set of OSTs or an MDT as seen in the previous example typically indicates FileSystem maintenance has occurred and is expected behavior.

A continuous stream of “slow reply” messages on a small set of OSTs can be an indicator of hotspotting.

To extract just the target from all LustreError messages, consider using the following example query:

fields @timestamp, @message
| filter @message like /LustreError/
| parse @message /(?<@filesystemTarget>(\w{8}\-(OST|MDT)[0-9a-f]{7})/
| stats count(*) as count by @filesystemTarget
| sort count desc

Monitoring key custom metrics through Amazon CloudWatch

In this section, we cover monitoring key custom metrics through Amazon Cl0udWatch.

1. OBD Device Connections

You must monitor the amount of concurrent OBD device connections as there is a maximum number of OBD devices per node. Today, that limit is 8192 OBD devices, as defined by MAX_OBD_DEVICES. Exceeding that limit would result in failure to mount filesystems or partially mounted file-systems. This would prevent your applications from accessing objects stored on a subset of OSTs. In turn, this would show in your kernel logs as follows:

LustreError: 5745:0: (lov_ea.c:227:1sme_unpack()) dzfevbmv-clilov_UUID:
OST index 57 more than OST count 38

To prevent the above failure from happening, you should actively monitor the OBD connections and take preventative actions once a limit is breached. You can easily count the number of active connections and publish the count as a custom metric to CloudWatch.

The following is an example of the code which can publish the number of active Lustre connections:

# Run as cron (example : every 5 minutes)
# Get concurrent connection count
CONN=$(lctl get_param -n devices | wc -l)
# Get hostname and instance ID as dimensions
INSTANCE_ID=$( wget -q -O – http://169.254.169.254/latest/meta-data/instance-id)
HOSTNAME=$(hostname)
# Publish to CloudWatch
aws cloudwatch put-metric-data –metric-name ActiveLustreConnections –namespace FSxCustom –unit Count –value $CONN –dimensions InstanceId=$INSTANCE_ID,Hostname=$HOSTNAME

Number of active Lustre connections

Number of active Lustre connections

With the above metric collected, you can setup an Amazon Cloud Watch alarm at a level that works for you (e.g., ~6200 active connections). At that point, you should prevent your host from accepting new work which would mount additional filesystems.

2. OST Monitoring

Lustre has two allocation methods to distributed objects over OSTs. By default, round-robin is used (faster). A weighted allocator is used when any two OSTs are imbalanced by more than 17%. The weighted allocator fills the emptier OSTs faster, but it uses a weighted random algorithm. Therefore, the OST with the freest space is not necessarily chosen each time.

Your file system may still become unbalanced. For example, when you storage scale a filesystem, new (and therefore empty) OSTs are added to the filesystem. Similarly, appends to existing files might cause an OST to grow faster than others. As a result, some of the OSTs for your file system can run full while other OSTs remain relatively empty. This may cause performance bottlenecks or even “ENOSPC no space left on device” errors if an OST runs full.

To avoid these issues, you should monitor the usage of the OSTs for your FSx for Lustre file system. The usage can be easily shown by running following command:

lfs df -h /fsx-mount-point

This provides the usage of the OSTs and MDT for the file system, as well as the storage capacity of the file system itself, as shown –in the following:

lfs df -h /fsx-mount-point/
UUID                       bytes        Used   Available Use% Mounted on
dzfevbmv-MDT0000_UUID        2.9T      464.1G        2.5T  16% /fsx-mount-point[MDT:0]
dzfevbmv-OST0000_UUID        1.1T      318.0G      796.2G  29% /fsx-mount-point[OST:0]
dzfevbmv-OST0001_UUID        1.1T      263.7G      850.5G  24% /fsx-mount-point[OST:1]
dzfevbmv-OST0002_UUID        1.1T      270.2G      844.0G  24% /fsx-mount-point[OST:2]
dzfevbmv-OST0003_UUID        1.1T      272.1G      842.1G  24% /fsx-mount-point[OST:3]
dzfevbmv-OST0004_UUID        1.1T      322.9G      791.3G  29% /fsx-mount-point[OST:4]
dzfevbmv-OST0005_UUID        1.1T      333.2G      781.0G  30% /fsx-mount-point[OST:5]

filesystem_summary:         9.5T       2.2T        7.3T  24% /fsx-mount-point

You can publish this output to CloudWatch metrics using CloudWatch agent or using AWS Command Line Interface (AWS CLI) commands as custom metrics periodically. This lets you monitor the trends in the OST and MDT usage for the file system, as well as enable you to set up a CloudWatch alarm in case the usage goes above a certain threshold (e.g., above 80%).

Trend of OSTs usage

Trend of OSTs usage

When the OST usage exceeds the defined threshold, the CloudWatch alarm triggers and you can check the OST usage for the impacted OSTs for the file system. You can follow the steps listed in the documentation to troubleshoot unbalanced storage on OSTs.

3. Monitoring Elastic Network Adapter metrics

Another important component to collect metrics from is Elastic Network Adapter (ENA) for the Lustre clients, as dropped packets or exceeding the instance bandwidth limits can significantly affect the performance of your workload.

Therefore, monitoring the ENA metrics such as `pps_allowance_exceeded`, `bw_out_allowance_exceeded` and `bw_in_allowance_exceeded` is recommended. These metrics show when packets are queued and/or dropped because your workload is exceeding the capabilities of the selected Amazon Elastic Compute Cloud (Amazon EC2) instance type.

If your workload is exceeding these limits, then you must take an action to scale up the instance type. You can read more about monitoring network performance for EC2 instances in the AWS documentation.

What you should know about hotspotting

In the previous sections, we explained the core concepts of Lustre and gave you some pointers to extract relevant data from logs and metrics. Now that you have the data, you can avoid common performance issues such as hotspotting.

In Lustre, hotspotting refers to a situation where a specific part of the system becomes a bottleneck due to a high concentration of input/output (I/O) requests. This can lead to degraded performance, as the affected storage device becomes overwhelmed with I/O requests.

When multiple clients or processes access the same area of the filesystem concurrently or repeatedly, it can create a hotspot, resulting in increased latency.

Lustre hotspot illustration (OST0001)

Lustre hotspot illustration (OST0001)

To avoid hotspotting, take the following actions:

  • Monitor OST metrics. Avoid OSTs from running full. Make sure the utilization of the lowest and highest utilized OST stay within 20% of each-other.
  • Monitor and analyze logs for relevant Lustre messages, such as `timed out for slow reply`. Extract identifiers for Instance ID, FileSystem ID, and OST/MDT ID’s to isolate the source of the event.
  • Re-balance data when needed. Distributing data evenly across OSTs avoids overloading any single storage device.
  • Adjust striping configurations: Properly configure striping to distribute data across multiple OSTs, which can help spread the I/O load and alleviate hotspots.

Operational excellence

Achieving operational excellence involves more than simply identifying what to monitor, how to monitor it, and the tools to use. It is crucial to align these activities with the appropriate operational processes through careful process design. This begins by taking the necessary steps to enhance operations accordingly.

To do this effectively, clearly define roles and responsibilities within the team. Begin with first incident responders, define necessary actions, and create playbooks.

Operational teams must also develop processes to address observed events, incidents requiring intervention, and recurring or currently unresolvable problems. Make sure that each alert-triggering event has a well-defined response (runbook or playbook) and a designated owner (individual, team, or role) accountable for execution.

For example, implement an automated response to rebalance the filesystem when OSTs are imbalanced by more than 17%. Second, automate a storage scale event when a filesystem exceeds a certain maximum utilization threshold (e.g., 75%). This makes sure that storage balance and capacity doesn’t become a bottleneck for end-user activities.

This is an iterative process. By implementing monitoring, alarms, and increasing automation around remediation for observed events, teams can reduce response and resolution times. Gradually expand the scope of automation to encompass additional scenarios, ultimately streamlining operational activities surrounding observability.

Conclusion

In summary, monitoring Lustre file systems from the client’s perspective is instrumental for optimizing performance, boosting client availability and evenly distributing load over a filesystem.

You can harness the insights gained from monitoring to optimize your file systems’ performance by tuning your workload, through rebalancing, right-sizing, or scaling-up your EC2 instances. Taking measures based on monitoring insights makes sure that your Lustre file systems operate at peak performance levels.

Thanks for reading this blog post! If you have any comments or questions, don’t hesitate to leave them in the comments section.

Javy de Koning

Javy de Koning

Javy de Koning is a Senior Solutions Architect for AWS. He has 15 years of experience designing enterprise scale distributed systems. His passion is designing microservice architectures that can take full advantage of the cloud and DevOps toolchains.

Ajinkya Farsole

Ajinkya Farsole

Ajinkya Farsole is a Cloud Infrastructure Architect at AWS Professional Services. He helps customers achieve their goals by architecting secure and scalable solutions on the AWS Cloud. When he is not working, Ajinkya enjoys watching physics and history documentaries.