AWS Cloud Operations Blog

Deliver Java JMX statistics to Amazon CloudWatch using the CloudWatch Agent and CollectD

A common problem customers face is alerting when their Java-based workloads experience performance issues, such as heap constraints. In this post, I’ll illustrate how relevant metrics from the Java Virtual Machine (JVM) can be collected and sent to Amazon CloudWatch, where customers can define alerts that fire when workloads are in jeopardy.

Overview

Let’s consider a situation where a critical Java application suddenly uses up too much memory and risks running into a java.lang.OutOfMemoryError exception. It would be great if customers could receive a CloudWatch alert before this memory threshold is reached, so that they can take action in time.

Luckily, the JVM exposes runtime metrics, such as the heap memory usage, thread count, and classes, through a standard API interface called the Java Management Extension (JMX). Although it’s possible to collect metrics from the JVM and send them to CloudWatch using the AWS Java SDK (an approach that works great for debugging activities and development), the metrics can’t be namespaced. Without the ability to namespace metrics, the statistics collected from multiple Java applications would overwrite each other’s data points in CloudWatch. As a consequence, this wouldn’t scale well outside of a development setting.

The approach I am presenting in this post is to leverage CollectD – an open-source system statistics collection daemon – which can collect metrics from JMX and send them to the Amazon CloudWatch agent, as well as allows namespacing metrics per application. The CloudWatch agent then forwards the metrics to CloudWatch and records them as custom metrics.

This approach described here can be applied to Java workloads running on the AWS cloud in Amazon Elastic Compute Cloud (Amazon EC2)  instances, on hosts and virtual machines that run on-premises, or elsewhere (for container-based workloads, the Prometheus JMX Exporter can be used instead).

Solution Architecture

The following figure illustrates the overall architecture and the interactions between CollectD, the CloudWatch agent, and CloudWatch. The Java application, CollectD and CloudWatch agent are running on the same host (for example on an EC2 instance on AWS or a virtual machine running on-premises).

Services used and architecture

Figure 1, Services used and architecture.

Process flow:

  1. CollectD interrogates the JVM to collect JMX statistics (using CollectD Java and GenerixJMX plugin)
  2. CollectD sends JMX statistics to the CloudWatch agent (using CollectD Network plugin)
  3. Custom metrics are recorded in CloudWatch

The JVM exposes metrics through the JMX API. The JMX API provides a simple, standard way of managing and monitoring Java applications and services. JMX metrics include information like heap memory usage, number of threads, and CPU usage.

CollectD is an open-source system statistics collection daemon that can gather metrics from various sources and external devices, and then store this information or make it available over the network. CollectD features a comprehensive list of plugins – the focus of this post will be on the Java and GenericJMX plugins, which will query relevant metrics from the JVM and send them to the CloudWatch agent.

The CloudWatch agent can be deployed to EC2 instances and also on-premises hosts and virtual machines. The agent can collect internal system-level metrics and also receive metrics from the CollectD statistics collection daemons.

Step one: configure CollectD to push data to the CloudWatch agent

The following steps describe the installation and configuration for Amazon Linux 2. Package names, location of configuration files, and commands may differ on other Linux distributions and systems.

Notes on the notation: shell commands are prefixed with a $ to symbolize the terminal prompt. Enter the commands without the $ into the terminal.

Install CollectD on your system. On Amazon Linux 2, the following packages must be installed:

$ sudo yum -y install collectd collectd-java collectd-generic-jmx

In the CollectD configuration file (/etc/collectd/collectd.conf), enable and configure the network and Java plugins as follows:

LoadPlugin network
<Plugin network>
    Server "127.0.0.1" "25826"
</Plugin>

LoadPlugin java
<Plugin java>
# JVMArg "-verbose:jni"
  JVMArg "-Djava.class.path=/usr/share/collectd/java/collectd-api.jar:/usr/share/collectd/java/generic-jmx.jar"
  LoadPlugin "org.collectd.java.GenericJMX"

  <Plugin "GenericJMX">
    <MBean "Memory">
      ObjectName "java.lang:type=Memory"
      InstancePrefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_memory_heapmemoryusage_"
        Table true
        Attribute "HeapMemoryUsage"
      </Value>
    </MBean>

    <MBean "OperatingSystem">
      ObjectName "java.lang:type=OperatingSystem"
      InstancePrefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_operatingsystem_maxfiledescriptorcount"
        Attribute "MaxFileDescriptorCount"
      </Value>
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_operatingsystem_openfiledescriptorcount"
        Attribute "OpenFileDescriptorCount"
      </Value>
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_operatingsystem_freephysicalmemorysize"
        Attribute "FreePhysicalMemorySize"
      </Value>
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_operatingsystem_freeswapsizespace"
        Attribute "FreeSwapSpaceSize"
      </Value>
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_operatingsystem_committedvirtualmemorysize"
        Attribute "CommittedVirtualMemorySize"
      </Value>
    </MBean>

    <MBean "Threading">
      ObjectName "java.lang:type=Threading"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_threading_threadcount"
        Attribute "ThreadCount"
      </Value>
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_threading_daemonthreadcount"
        Attribute "DaemonThreadCount"
      </Value>
    </MBean>

    <MBean "ClassLoading">
      ObjectName "java.lang:type=ClassLoading"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_classloading_loadedclasscount"
        Attribute "LoadedClassCount"
      </Value>
    </MBean>

    <MBean "GCCollectiontimeCopy">
      ObjectName "java.lang:name=Copy,type=GarbageCollector"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_garbagecollector_collectiontime_copy"
        Attribute "CollectionTime"
      </Value>
    </MBean>

    <MBean "GCCollectiontimePSScavenge">
      ObjectName "java.lang:name=PS Scavenge,type=GarbageCollector"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_garbagecollector_collectiontime_ps_scavenge"
        Attribute "CollectionTime"
      </Value>
    </MBean>

    <MBean "GCCollectiontimeParNew">
      ObjectName "java.lang:name=ParNew,type=GarbageCollector"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_garbagecollector_collectiontime_parnew"
        Attribute "CollectionTime"
      </Value>
    </MBean>

    <MBean "GCCollectiontimeMarkSweepCompact">
      ObjectName "java.lang:name=MarkSweepCompact,type=GarbageCollector"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_garbagecollector_collectiontime_marksweepcompact"
        Attribute "CollectionTime"
      </Value>
    </MBean>

    <MBean "GCCollectiontimePSMarkSweep">
      ObjectName "java.lang:name=PS MarkSweep,type=GarbageCollector"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_garbagecollector_collectiontime_ps_marksweep"
        Attribute "CollectionTime"
      </Value>
    </MBean>

    <MBean "GCCollectiontimeConcurrentMarkSweep">
      ObjectName "java.lang:name=ConcurrentMarkSweep,type=GarbageCollector"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_garbagecollector_collectiontime_concurrentmarksweep"
        Attribute "CollectionTime"
      </Value>
    </MBean>
 
    <MBean "GCCollectiontimeG1Young">
      ObjectName "java.lang:name=G1 Young Generation,type=GarbageCollector"
      Instanceprefix "java"
      <Value>
        Type "derive"
        InstancePrefix "java_lang_garbagecollector_collectiontime_g1_young_generation"
        Attribute "CollectionTime"
      </Value>
    </MBean>
 
    <MBean "GCCollectiontimeG1Old">
      ObjectName "java.lang:name=G1 Old Generation,type=GarbageCollector"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_garbagecollector_collectiontime_g1_old_generation"
        Attribute "CollectionTime"
      </Value>
    </MBean>

    <MBean "GCCollectiontimeG1Mixed">
      ObjectName "java.lang:name=G1 Mixed Generation,type=GarbageCollector"
      Instanceprefix "java"
      <Value>
        Type "gauge"
        InstancePrefix "java_lang_garbagecollector_collectiontime_g1_mixed_generation"
        Attribute "CollectionTime"
      </Value>
    </MBean>

    <Connection>
      Host "localhost"
      ServiceURL "service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi"
      Collect "Memory"
      Collect "OperatingSystem"
      Collect "Threading"
      Collect "ClassLoading"
      Collect "GCCollectiontimeCopy"
      Collect "GCCollectiontimePSScavenge"
      Collect "GCCollectiontimeParNew"
      Collect "GCCollectiontimeMarkSweepCompact"
      Collect "GCCollectiontimePSMarkSweep"
      Collect "GCCollectiontimeConcurrentMarkSweep"
      Collect "GCCollectiontimeG1Young"
      Collect "GCCollectiontimeG1Old"
      Collect "GCCollectiontimeG1Mixed"
    </Connection>
  </Plugin>
</Plugin>

Restart the service to reload the configuration:

$ sudo service collectd restart
Stopping collectd:                                         [  OK  ]
Starting collectd:                                         [  OK  ]

And review the log to verify that there are no errors:

$ sudo tail /var/log/messages
...
May  10 14:32:05 collectd[1195]: plugin_load: plugin "network" successfully loaded.
May  10 14:32:05 collectd[1195]: plugin_load: plugin "java" successfully loaded.
May  10 14:32:05 collectd[1195]: Initialization complete, entering read-loop.

If there’s an issue with opening libjvm.so, follow the instructions in this issue report to soft link the expected shared library to the right location:

$ ldd /usr/lib64/collectd/java.so 
        linux-vdso.so.1 =>  (0x00007ffdec5f7000)
        libjvm.so => not found
        libc.so.6 => /lib64/libc.so.6 (0x00007f335fda8000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f3360380000)
$ find /usr/lib -name 'libjvm.so'
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.261.x86_64/jre/lib/amd64/server/libjvm.so
/usr/lib/jvm/java-11-amazon-corretto/lib/server/libjvm.so
$ sudo ln -s /usr/lib/jvm/java-11-amazon-corretto/lib/server/libjvm.so /usr/lib64/libjvm.so

Verify that libjvm.so can now be resolved:

$ ldd /usr/lib64/collectd/java.so
        linux-vdso.so.1 =>  (0x00007fff8e1e5000)
        libjvm.so => /usr/lib64/libjvm.so (0x00007fba63728000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fba6335a000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fba63156000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fba62f3a000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fba62c38000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fba64efb000)

Step 2 : Run the Java application with JMX enabled

To enable JMX, run the application with the following additional arguments:

-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false

Review the startup scripts and configuration files of your application on the best place to add these additional arguments. When simply running a .jar from the command line, the invocation could look like this:

$ java -jar -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false pet-search.jar

Step 3 e: Configure the CloudWatch agent

Install and setup the CloudWatch agent, if it isn’t yet available on your host.

Add the collectd block to metrics in the CloudWatch configuration and define the desired namespace as name_prefix. The name_prefix will appear in the CloudWatch as a prefix to each custom metric and can be used to distinguish these JMX metrics from other custom metrics.

{
    "agent": {
        "metrics_collection_interval": 60,
        "run_as_user": "root"
    },
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [
                    {
                            "file_path": "/var/log/messages",
                            "log_group_name": "messages",
                            "log_stream_name": "{instance_id}"
                    }
                ]
            }
        }
    },
    "metrics": {
        "aggregation_dimensions": [
            [
                "InstanceId"
            ]
        ],
        "append_dimensions": {
            "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
            "ImageId": "${aws:ImageId}",
            "InstanceId": "${aws:InstanceId}",
            "InstanceType": "${aws:InstanceType}"
        },
        "metrics_collected": {
            "collectd": { "metrics_aggregation_interval": 60,
                "name_prefix": "petsearch_", "collectd_security_level": "none"
            },
            "disk": {
                "measurement": [
                    "used_percent"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ]
            },
            "mem": {
                "measurement": [
                    "mem_used_percent"
                ],
                "metrics_collection_interval": 60
            }
        }
    }
}

Assuming that you have stored the CloudWatch configuration in the Amazon Systems Manager Parameter Store, pass the configuration to the agent by running:

$ sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c ssm:AmazonCloudWatch-linux
...
Configuration validation succeeded
amazon-cloudwatch-agent stop/waiting
amazon-cloudwatch-agent start/running, process 6700

Verify that in the generated configuration file (amazon-cloudwatch-agent.toml), the following [[inputs.socket_listener]] is defined in the [inputs] section.

  [[inputs.socket_listener]]
    collectd_auth_file = "/etc/collectd/auth_file"
    collectd_security_level = "none"
    collectd_typesdb = ["/usr/share/collectd/types.db"]
    data_format = "collectd"
    name_prefix = "myapp_"
    service_address = "udp://127.0.0.1:25826"
    [inputs.socket_listener.tags]
      "aws:AggregationInterval" = "60s"
      metricPath = "metrics"

Furthermore, verify that the CloudWatch agent is listening on UDP 127.0.0.1:25826 (last column, the program name, should mention amazon-cloudwatch):

$ sudo netstat -lpun                                                                     
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
udp        0      0 0.0.0.0:894                 0.0.0.0:*                               1991/rpcbind        
udp        0      0 127.0.0.1:916               0.0.0.0:*                               2012/rpc.statd      
udp        0      0 127.0.0.1:25826             0.0.0.0:*                               1541/amazon-cloudwa 
udp        0      0 0.0.0.0:40560               0.0.0.0:*                               2012/rpc.statd      
udp        0      0 0.0.0.0:68                  0.0.0.0:*                               1783/dhclient       
udp        0      0 0.0.0.0:111                 0.0.0.0:*                               1991/rpcbind        
udp        0      0 10.10.10.99:123             0.0.0.0:*                               2272/ntpd           
udp        0      0 127.0.0.1:123               0.0.0.0:*                               2272/ntpd           
udp        0      0 0.0.0.0:123                 0.0.0.0:*                               2272/ntpd           
udp        0      0 fe80::39:ebff:fee2:96b2:546 :::*                                    1874/dhclient       
udp        0      0 :::894                      :::*                                    1991/rpcbind        
udp        0      0 :::111                      :::*                                    1991/rpcbind        
udp        0      0 :::55580                    :::*                                    2012/rpc.statd 

Step 4 : Review JMX custom metrics in CloudWatch

Navigate to CloudWatch in the AWS Console, then Metrics and All metrics. In the tab Browse, select CWAgent and then ImageId, InstanceId, InstanceType, instance, type, type_instance. Metrics from CollectD/GenericJMX can be identified in the column Metric name having the prefix petsearch_GenericJMX_value. Select the metrics, and in the drop-down menu at the top, choose Number as the graphing method.

JMX statistics as custom metrics in CloudWatch

Figure 2: JMX statistics as custom metrics in CloudWatch

Step 5: Create alarms for your JMX statistics

cloudwatch-alarms

Steps:

  1. In CloudWatch, navigate to Alarms, then Create alarm. Alternatively, from your All Metrics > Graphed Metrics view, select the alarm bell next to the relevant metric.
  2. Select a relevant metric and select Next.
  3. Enter a threshold that should trigger the alert, and select Next.
  4. Define a notification method, e.g., don’t send an SNS topic. Confirm with Next.
  5. Set a name and description for the alarm. Confirm with Next.
  6. Confirm with Create alarm.

Conclusion

In this post, I showed how JMX statistics can be sent to CloudWatch by implementing an approach using CollectD. Using the same approach, given CollectD’s vast set of plugins, many more data sources can be integrated. The architecture isn’t limited to the cloud and can be extended to on-premises workloads (where both CollectD and the CloudWatch agent can be run). As an example, using the CollectD SNMP plugin, metrics from network devices could be sent to CloudWatch and alerts could be created to alert customers early of potential issues. There are many possibilities, so have a look at the CollectD plugin page for more inspiration.

About the author:

Daniel Lorch

Daniel Lorch is a DevOps Architect at AWS Professional Services, supporting customers with their challenges in DevOps by leveraging automation, Continuous Integration and Continuous Delivery, and engineering best practices.