Monitor real-time Amazon RDS OS metrics with flexible granularity using Enhanced Monitoring

Amazon Relational Database Service (Amazon RDS) provides access to real-time metrics for your operating system, enabling you to monitor how different processes or threads use RDS resources. You can manage the metrics you want to monitor for each instance on the Amazon RDS console. Amazon CloudWatch is the native monitoring tool of AWS. It is widely used by customers as the primary monitoring tool for all AWS services and workloads.

However, RDS offers an add-on feature called Enhanced Monitoring to be used in conjunction with CloudWatch. It provides an additional layer of telemetry, which can be useful during investigations that require highly granular monitoring data. Troubleshooting approaches that rely solely on CloudWatch metrics may miss these critical indicators. Therefore, using Enhanced Monitoring’s granular metrics along with CloudWatch metrics can provide a more comprehensive and precise understanding of an instance’s performance, enabling effective decision-making for issue resolution. Enhanced Monitoring is disabled by default when you create an RDS instance. However, you can choose to enable it when creating an instance or by modifying an existing instance. In this post, we explain the various features that Enhanced Monitoring offers.

Enhanced Monitoring is available for the following database engines:

Enhanced Monitoring is available for all DB instance classes except for the db.m1.small instance class. In this post, we demonstrate the functionalities of Enhanced Monitoring using Amazon RDS for MySQL.

Overview of Enhanced Monitoring

Enhanced Monitoring provides the following benefits when monitoring RDS instances:

It adds another layer of monitoring to use alongside CloudWatch metrics and Amazon RDS Performance Insights

Note: Here, CloudWatch metrics refers to the default, standard metrics the core RDS service publishes to CloudWatch at no additional cost and at 1-minute granularity

It provides subcomponent-level metrics graphs for monitoring like CPU Nice, Free Memory, Active Memory, loadAverageMinute (for 1, 5, and 15 minutes), Swaps In, Swaps Out, and so on
It provides higher granularity compared to CloudWatch’s standard of 1-minute, with 1, 5, 10, 15, 30, or 60 seconds, which can be helpful in investigating and troubleshooting performance and resource-related issues
It provides physical device graphs to show metrics for each one of the disks for the DB instance’s data storage volume
It delivers the logs to your Amazon CloudWatch Logs account, where metrics for cpuUtilization, memory, diskIO, network, physicalDeviceIO, and more can be viewed and examined
It provides an operating system process list with details for the processes running on your DB instance

In this post, we discuss the following:

How to configure Enhanced Monitoring with 1-second granularity on Amazon RDS for MySQL
How to introduce some test load to an RDS for MySQL instance
How to analyze metrics using the CloudWatch Console and RDS Enhanced Monitoring
Additional features of Enhanced Monitoring
Viewing Enhanced Monitoring logs from the CloudWatch log group RDSOSMetrics
Viewing the OS process list for monitoring

We run a workload on the RDS instance and monitor both CloudWatch metrics as well as Enhanced Monitoring metrics. Enhanced Monitoring metrics are stored in CloudWatch Logs instead of in CloudWatch metrics. The cost of Enhanced Monitoring depends on factors like the amount of data transferred from your RDS instance, granularity of Enhanced Monitoring, log file retention policy, and more. Refer to Cost of Enhanced Monitoring for more information.

Prerequisites

We used the following instance and storage configurations in our tests:

Amazon RDS for MySQL engine version – 8.0.32
Instance class – db.t2.medium
Storage type – gp2
Storage size – 20gb
Client – MySQL Workbench (version 8.0) tool hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance in the same VPC as the RDS instance

Configure Enhanced Monitoring in Amazon RDS

Enhanced Monitoring requires an AWS Identity and Access Management (IAM) role to send OS metric information to CloudWatch Logs. You can either create this role when you enable Enhanced Monitoring or create it beforehand. To enable or disable Enhanced Monitoring for an RDS instance, complete the following steps:

On the Amazon RDS console, select the instance and choose Modify if the instance already exists.

The following steps are the same if creating a new DB instance as well.

Expand Additional configuration.
In the Monitoring section, select Enable Enhanced monitoring for your DB instance.

For Monitoring Role, choose the IAM role that you created to permit Amazon RDS to communicate with CloudWatch Logs for you, or choose default to have Amazon RDS create a role for you named rds-monitoring-role.

For Granularity, choose an interval in seconds between the points when metrics are collected for your DB instance or read replica. You can set this parameter to 1, 5, 10, 15, 30, or 60 seconds.

Alternatively, you can enable Enhanced Monitoring using AWS Command Line Interface (AWS CLI) commands. For more information, refer to Turning Enhanced Monitoring on and off.

Run a workload on the RDS for MySQL instance

For this test, we insert 10,000 records in two steps (5,000 each) with a delay of 5 seconds in-between. The intention is to simulate a spiky workload in the RDS for MySQL instance and later look for spikes in metrics from Enhanced Monitoring that you may not see in the default metrics the core RDS service publishes to CloudWatch by default.

Connect to Amazon RDS for MySQL from a client such as MySQL Workbench.

For instructions on connecting to Amazon RDS for MySQL from MySQL Workbench, see Connecting from MySQL Workbench or Create and Connect to a MySQL Database with Amazon RDS.

Create a database in MySQL Workbench or in another preferred client by running the following command:

Create database testdb;

Create a table:

use testdb;
create table test1 (
id int not null auto_increment primary key,
col1 varchar(10),
col2 varchar(10)
) engine = InnoDB;

Run the following script to insert data into the table every 5 seconds:

Drop procedure if exists testdata;
delimiter //
create procedure testdata (in num int)
begin
declare i int default 0;
while i < num do
insert into test1 (col1,col2) values ('col1_value','col2_value');
set i = i + 1;
end while;
end //
delimiter ;

call testdata (5000);

DO SLEEP(5);

Drop procedure if exists testdata;
delimiter //
create procedure testdata (in num int)
begin
declare i int default 0;
while i < num do
insert into test1 (col1,col2) values ('col1_value','col2_value');
set i = i + 1;
end while;
end //
delimiter ;

call testdata (5000);

Compare metrics

In this section, we inspect some key metrics from the preceding workload comparing the default metrics published to CloudWatch and the metrics produced by RDS Enhanced Monitoring.

WriteIOPS: CloudWatch

When 10,000 records were inserted into the RDS for MySQL instance in two steps with a 5-second delay, CloudWatch metrics showed WriteIOPS to be approximately 530, as shown in the following CloudWatch graph. It’s important to note that Amazon RDS sends metric data to CloudWatch at 1-minute intervals, which is why the graph doesn’t show any sub-minute spikes if there were any.

WriteIOPS: Enhanced Monitoring

For the same workload, Enhanced Monitoring shows approximately 1600 WriteIOPS. This is because Enhanced Monitoring has a granularity of 1 second set in this demo and therefore we have the highest level of granular data provided by the Enhanced Monitoring add-on. The following screenshot shows write IO/s from Enhanced Monitoring with sub-minute spikes due to Enhanced Monitoring’s higher granularity. When you are viewing aggregated Disk I/O and File system graphs, the rdsdev device relates to the /rdsdbdata file system, where all database files and logs are stored. The filesystem device relates to the / file system (also known as root), where files related to the operating system are stored.

CPUUtilization: CloudWatch

The same trend can also be seen for CPU utilization from CloudWatch for the same workload. It reports a maximum value of approximately 12% with 1-minute granularity. The following screenshot shows CPUUtilization from CloudWatch.

CPU Total: Enhanced Monitoring

Enhanced Monitoring, on the other hand, reports CPU utilization to be around 50%, due to its smaller 1-second granular feature. The following screenshot shows CPU total from Enhanced Monitoring.

CPU Nice: Enhanced Monitoring

If you want to check metrics like CPU Nice, which means the CPU usage by the workload on the instance, Enhanced Monitoring provides the following graph. Similarly, it also provides graphs for Active Memory, Free Memory, Free Swap, Swaps In, Swap Out, and more.

Summary of observations

Enhanced Monitoring is useful for observing OS-level granular sub-minute metrics. This granular feature of Enhanced Monitoring is convenient when you want to monitor sub-minute spiky workloads, which aren’t captured in CloudWatch because Amazon RDS sends metric data to CloudWatch averaged in 1-minute intervals. For example, we observe that the high WriteIOPS usage (which peaked at 1,600) wasn’t seen in CloudWatch from our test. The same phenomenon would be expected if ReadIOPS were to rise to similar levels. Therefore, use of Enhanced Monitoring in conjunction with CloudWatch default metrics helps in a comprehensive and detailed troubleshooting to be performed.

Additionally, as seen in the CPU Nice graph, you can also observe subcomponent graphs for different metrics from Enhanced Monitoring. You choose which graphs to show on your Enhanced Monitoring dashboard by opening the Manage Graphs tab and selecting the subcomponents.

Additional features of Enhanced Monitoring

There might be cases where you need to know into how many volumes your RDS instance has striped the data. This information can help you troubleshoot throttling, latency, and other performance-related issues.

In the following screenshot, we can see that four volumes are attached to the RDS instance. Amazon RDS for SQL Server has just one volume attached irrespective of its size. Depending on the type of the volume attached, respective throughput and IOPS limitations are applied, which play a huge role in Amazon RDS performance.

The following screenshot shows physical write IO/s from Enhanced Monitoring.

Log groups in CloudWatch

Enhanced Monitoring retains logs generated in the log group RDSOSMetrics. These logs contain the data in JSON format and can be viewed via the CloudWatch console and AWS CLI. Every log in the log stream contains an array of data related to the RDS instance. Use the following steps to view RDS instance logs generated in the log group RDSOSMetrics:

On the CloudWatch console, choose Log groups in the navigation pane.
Choose a log stream corresponding to the RDS instance (log streams are named the same as the RDS DB identifier).
Choose the log required for investigation based on the time filter.

The following example contains detailed values of all the metrics like cpuUtilization:nice, memory:buffers, and more:

{
    "engine": "MYSQL",
    "instanceID": "yyyy",
    "instanceResourceID": "xxxx",
    "timestamp": "2023-05-01T20:36:02Z",
    "version": 1,
    "uptime": "00:27:36",
    "numVCPUs": 4,
    "cpuUtilization": {
        "guest": 0,
        "irq": 0,
        "system": 0.3,
        "wait": 0,
        "idle": 99.3,
        "user": 0.3,
        "total": 0.9,
        "steal": 0,
        "nice": 0.3
    },
    "loadAverageMinute": {
        "one": 0.04,
        "five": 0.03,
        "fifteen": 0.04
    },
    "memory": {
        "writeback": 0,
        "hugePagesFree": 0,
        "hugePagesRsvd": 0,
        "hugePagesSurp": 0,
        "cached": 716380,
        "hugePagesSize": 2048,
        "free": 13573744,
        "hugePagesTotal": 0,
        "inactive": 2026324,
        "pageTables": 7744,
        "dirty": 848,
        "mapped": 107644,
        "active": 267504,
        "total": 16069100,
        "slab": 72004,
        "buffers": 140440
    },
    "tasks": {
        "sleeping": 111,
        "zombie": 0,
        "running": 0,
        "stopped": 0,
        "total": 111,
        "blocked": 0
    },
    "swap": {
        "cached": 0,
        "total": 16776188,
        "free": 16776188,
        "in": 0,
        "out": 0
    },
    "network": [
        {
            "interface": "eth0",
            "rx": 910,
            "tx": 12785
        }
    ],
    "diskIO": [
        {
            "writeKbPS": 16,
            "readIOsPS": 0,
            "await": 1,
            "readKbPS": 0,
            "rrqmPS": 0,
            "util": 0.8,
            "avgQueueLen": 0,
            "tps": 4,
            "readKb": 0,
            "device": "rdsdev",
            "writeKb": 16,
            "avgReqSz": 8,
            "wrqmPS": 0,
            "writeIOsPS": 4
        },
        {
            "writeKbPS": 4,
            "readIOsPS": 0,
            "await": 1,
            "readKbPS": 0,
            "rrqmPS": 0,
            "util": 0.4,
            "avgQueueLen": 0,
            "tps": 1,
            "readKb": 0,
            "device": "filesystem",
            "writeKb": 4,
            "avgReqSz": 8,
            "wrqmPS": 0,
            "writeIOsPS": 1
        }
    ],
    "physicalDeviceIO": [
        {
            "writeKbPS": 16,
            "readIOsPS": 0,
            "await": 1,
            "readKbPS": 0,
            "rrqmPS": 0,
            "util": 0.8,
            "avgQueueLen": 0,
            "tps": 3,
            "readKb": 0,
            "device": "nvme1n1",
            "writeKb": 16,
            "avgReqSz": 10.67,
            "wrqmPS": 1,
            "writeIOsPS": 3
        }
    ],
    "fileSys": [
        {
            "used": 372316,
            "name": "",
            "usedFiles": 251,
            "usedFilePercent": 0,
            "maxFiles": 13107200,
            "mountPoint": "/rdsdbdata",
            "total": 205270252,
            "usedPercent": 0.18
        },
        {
            "used": 2233064,
            "name": "",
            "usedFiles": 39420,
            "usedFilePercent": 6.02,
            "maxFiles": 655360,
            "mountPoint": "/",
            "total": 10230600,
            "usedPercent": 21.83
        }
    ]

The logs can then be passed on to any other AWS service that supports log stream data consumption or a third-party tool for processing and visualization tools like Amazon Kinesis Data Firehose and Amazon QuickSight respectively.

OS process list for monitoring

If you want to see details for the processes running on your DB instance, choose OS process list on the Monitoring tab of the RDS instance on the Amazon RDS console.

The Enhanced Monitoring metrics shown in the process list view are organized as follows: RDS child processes, RDS processes, and OS processes.

You might find differences between the CloudWatch and Enhanced Monitoring measurements. Refer to Differences between CloudWatch and Enhanced Monitoring metrics to learn more.

Enhanced Monitoring provides access to new metrics like network, swap, and processList. These metrics exclusive to Enhanced Monitoring help in getting a deeper level of understanding of how the OS was behaving during the troubleshooting window. To see the metrics available with Enhanced Monitoring, see OS metrics in Enhanced Monitoring.

Considerations for Amazon RDS monitoring tools

Amazon RDS offers three primary tools for database performance monitoring and troubleshooting: Enhanced Monitoring, CloudWatch, and Performance Insights. Each has their own use case for troubleshooting any particular issue with RDS.

Enhanced Monitoring is used to get highly granular metrics about the OS the database engine is running on, segregated metrics (cpuUtilization.nice, memory.free, and more), and logs to CloudWatch for further analysis and integration.

CloudWatch, on the other hand, receives default metrics published by the core RDS service at 1-min granularity and is used extensively to visually see how the instance is behaving. The tool also provides features like overlaying of graphs, axis switches, mathematical calculation, change of granularity (1, 5, or 15 minutes), and seeing historical data (dating back to months) as defined by CloudWatch retention policies, which are very helpful in identifying patterns of resource usage and customer behavior on Amazon RDS.

Enhanced Monitoring and CloudWatch each have their own exclusive metrics. Enhanced Monitoring has physical device metrics in addition to other metrics, whereas CloudWatch has important metrics like BurstBalance, DatabaseConnections, and more that can help in identifying usage patterns.

Performance Insights allows you to assess and optimize the database load. It also provides sub-minute metrics with up to 2 years of retention for 1-second metrics. It allows you to slice and dice using multiple dimensions such as Wait Events or SQL Queries. Performance Insights also enables you to visualize the load that Amazon RDS experiences and helps you identify the types of queries that contribute to it, where the queries are originating from, and which users are causing them.

Clean up

You should delete all the resources used for this demonstration to avoid incurring additional cost.

Delete the RDS instance.
Delete the client tool from localhost.
Delete the log stream related to Enhanced Monitoring for the RDS instance.

Note that RDSOSMetrics are retained for 30 days by default.

Conclusion

In this post, we used a demonstration of a workload with spikes in Amazon RDS for MySQL to highlight the importance of using Enhanced Monitoring’s granular metrics to obtain a more comprehensive and accurate understanding of an instance’s performance.

If you have any questions or comments about this post, leave them in the comments section.

About the Authors

Abdul Sarker is a Cloud Support Engineer working with AWS for 2 years. With a focus on providing excellent customer experience in the AWS Cloud, he works with external customers to handle a variety of scenarios, such as troubleshooting Amazon RDS infrastructure, assisting with AWS DMS projects, as well as authoring and improving internal documentations and articles.

Nirupam Datta is a Cloud Support DBA at AWS. He has been with AWS for around 3.5 years. With over 11 years of experience in database engineering and infra-architecture, Nirupam is also a subject matter expert in the Amazon RDS core systems and Amazon RDS for SQL Server. He provides technical assistance to customers, guiding them to migrate, optimize, and navigate their journey in the AWS Cloud.