Monitoring Amazon DynamoDB for operational awareness

Amazon DynamoDB is a serverless database, and is responsible for the undifferentiated heavy lifting associated with operating and maintaining the infrastructure behind this distributed system. As a customer, you use APIs to capture operational data that you can use to monitor and operate your tables. This post describes a set of metrics to consider when building out your dashboards and alarms to operationalize DynamoDB.

You can use Amazon CloudWatch metrics published by DynamoDB to help you understand the interaction of your evolving workload with DynamoDB in the context of your data model. The metrics are separated into the following three categories, based on the resource level at which they apply:

Metrics that are provided out of the box with DynamoDB (noted as “Out of the Box”).
Metrics that require computation via metric math (noted as “Requires metric math”).
Metrics that must be self-published to Amazon CloudWatch using a custom AWS Lambda function.

As you move toward production, you can also get recommendations on achieving operational excellence with DynamoDB.

To download the code to publish the custom metrics that you need in this example, see the GitHub repo. The Lambda function for publishing the custom CloudWatch metrics accepts a number of environment variables for overriding default settings, check the README for details. At the time of publication of this post, these are:

CLOUDWATCH_CUSTOM_NAMESPACE – By default, the AWS Lambda function publishes metrics to the “Custom_DynamoDB” namespace. If you’d like to change it, set the CLOUDWATCH_CUSTOM_NAMESPACE environment variable.
DYNAMODB_ACCOUNT_TABLE_LIMIT – By default, the AWS Lambda function assumes your DynamoDB account table limit is 256. There is no API call to determine your account table limit, so if you’ve asked AWS to increase this limit for your account you must set the DYNAMODB_ACCOUNT_TABLE_LIMIT to that value for the AWS Lambda function to calculate the AccountTableLimitPct custom metric properly.

The AWS CloudFormation examples in this post assume that an SNS topic, referred to as DynamoDBMonitoringSNSTopic, exists for alarms to send notifications to. It also assumes that the template contains parameters such as DynamoDBProvisionedTableName, DynamoDBOnDemandTableName, DynamoDBGlobalTableName, and DynamoDBGlobalTableReceivingRegion. Additionally, the global secondary indexes (GSIs) are named the same as the table, but with -gsi1 added. For example, dynamodb-monitoring-gsi1.

The alarm thresholds provided in each section are recommendations of a reasonable starting point, which you could adjust based on requirements and workload patterns.

Metrics for each account and Region

There are a few account-level metrics for each AWS Region within an account that you must monitor. These are particularly important if you have multiple teams deploying DynamoDB tables into the same account, and one team’s change can impact the ability for another team’s table to auto scale, for example, and the account administrator might need to take action to raise an account’s limits. The following table summarizes the DynamoDB metrics and recommended alarm configurations for each Region in your AWS Account.

Description	Metric	Alarm config	Notes
Percentage of account limit read provisioned capacity allocated	`AccountProvisionedReadCapacityUtilization`	MAX > 80%	Out of the box
Percentage of account limit write provisioned capacity allocated	`AccountProvisionedWriteCapacityUtilization`	MAX > 80%	Out of the box
Percentage of read provisioned capacity used by the highest read provisioned table of an account	`MaxProvisionedTableReadCapacityUtilization`	MAX > 80%	Out of the box
Percentage of write provisioned capacity used by the highest write provisioned table of an account	`MaxProvisionedTableWriteCapacityUtilization`	MAX > 80%	Out of the box
Percentage of table count limit in use	`AccountTableLimitPct`	> 80%	Requires custom AWS Lambda function

The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metrics:

  DynamoDBAccountReadCapAlarm:
    Type: 'AWS::CloudWatch::Alarm'
    Properties:
      AlarmName: 'DynamoDBAccountReadCapAlarm'
      AlarmDescription: 'Alarm when account approaches maximum read capacity limit'
      AlarmActions:
        - !Ref DynamoDBMonitoringSNSTopic
      Namespace: 'AWS/DynamoDB'
      MetricName: 'AccountProvisionedReadCapacityUtilization'
      Statistic: 'Maximum'
      Threshold: 80
      ComparisonOperator: 'GreaterThanThreshold'
      Period: 300
      EvaluationPeriods: 1

To use the DynamoDB console to create the alarm, complete the following steps:

On the DynamoDB console, choose Tables.
Within a table, choose Metrics.
Choose Create Alarm.

The following screenshot shows the Create Alarm section. For more information, please see Creating CloudWatch Alarms to Monitor DynamoDB.

Create DynamoDB Cloudwatch Alarm Screenshot

Metrics for each table and GSI

Some metrics need monitoring and alerts for every table and GSI. For example, sustained heavy throttling might indicate a schema design issue or a table misconfiguration with no auto scaling, or auto scaling limits set too low. Such issues might need intervention and either AWS configuration or application code changes to resolve them. Amazon CloudWatch Contributor Insights for DynamoDB can help you explore whether you have frequently accessed items causing sustained throttling.

The following table summarizes the DynamoDB metrics and recommended alarm configurations for each DynamoDB table and GSI, regardless of billing mode.

Description	Metric	Alarm config	Notes
Sustained read throttling	`Sample Count ReadThrottleEvents / (Sample Count ConsumedReadCapacityUnits)`	> 2%	Requires metric math
Sustained write throttling	`Sample Count Write ThrottleEvents / (Sample Count ConsumedWriteCapacityUnits)`	> 2%	Requires metric math
Sustained significant elevation of system errors	`Sample Count SystemErrors / (Sample Count ConsumedReadCapacityUnits + Sample Count ConsumedWriteCapacityUnits)`	> 2%	Requires metric math
Sustained significant elevation of user errors	`Sample Count UserErrors / (Sample Count ConsumedReadCapacityUnits + Sample Count ConsumedWriteCapacityUnits)`	> 2%	Requires metric math
Sustained significant elevation of condition check errors (optional)	`ConditionalCheckFailedRequests`	SUM > 100	Out of the box
Sustained significant elevation of transaction conflicts (optional)	`TransactionConflict`	SUM > 100	Out of the box

The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metrics. This example uses metric math and alarms on read throttling for a GSI, instead of the base table to show how GSI dimensions work. To scale the ratio of throttled events to total read events to the range [0, 100], the code multiplies it by 100.

  DynamoDBGSIReadThrottlingAlarm:
    Type: 'AWS::CloudWatch::Alarm'
    Properties:
      AlarmName: 'DynamoDBGSIReadThrottlingAlarm'
      AlarmDescription: 'Alarm when GSI read throttle requests exceed 2% of total number of read requests'
      AlarmActions:
        - !Ref DynamoDBMonitoringSNSTopic
      Metrics:
        - Id: 'e1'
          Expression: '(m1/m2) * 100'
          Label: GSIReadThrottlesOverTotalReads
        - Id: 'm1'
          MetricStat:
            Metric:
              Namespace: 'AWS/DynamoDB'
              MetricName: 'ReadThrottleEvents'
              Dimensions:
                - Name: 'TableName'
                  Value: !Ref DynamoDBProvisionedTableName
                - Name: 'GlobalSecondaryIndexName'
                  Value: !Join [ '-', [!Ref DynamoDBProvisionedTableName, 'gsi1'] ]
            Period: 60
            Stat: 'SampleCount'
            Unit: 'Count'
          ReturnData: False
        - Id: 'm2'
          MetricStat:
            Metric:
              Namespace: 'AWS/DynamoDB'
              MetricName: 'ConsumedReadCapacityUnits'
              Dimensions:
                - Name: 'TableName'
                  Value: !Ref DynamoDBProvisionedTableName
                - Name: 'GlobalSecondaryIndexName'
                  Value: !Join [ '-', [!Ref DynamoDBProvisionedTableName, 'gsi1'] ]
            Period: 60
            Stat: 'SampleCount'
            Unit: 'Count'
          ReturnData: False
      EvaluationPeriods: 2
      Threshold: 2.0
      ComparisonOperator: 'GreaterThanThreshold'

Metrics for each provisioned throughput table and GSI

As a best practice, you should enable DynamoDB auto scaling on any table using provisioned throughput (“PROVISIONED Billing Mode”), for both the base table and all GSIs. Doing so can reduce costs by scaling down during times of low usage, and minimizes throttling due to under-provisioning during unanticipated load peaks.

The following table shows table and GSI metrics that are scaled either as a percentage of the table’s provisioned throughput settings, or as a percentage of the auto scaling maximums. When a table approaches the configured maximum, you receive an alert so you can increase the maximum or investigate the unusual level of application load.

Description	Metric	Alarm config	Notes
Percentage utilization of auto scaling read maximum	`ProvisionedReadCapacityAutoScalingPct`	> 90%	Requires custom AWS Lambda function
Percentage utilization of auto scaling write maximum	`ProvisionedWriteCapacityAutoScalingPct`	> 90%	Requires custom AWS Lambda function

The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metric. This metric is based on custom metrics published from an AWS Lambda function.

  DynamoDBTableASReadAlarm:
    Type: 'AWS::CloudWatch::Alarm'
    Properties:
      AlarmName: 'DynamoDBTableASReadAlarm'
      AlarmDescription: 'Alarm when table auto scaling read setting approaches table AS maximum'
      AlarmActions:
        - !Ref DynamoDBMonitoringSNSTopic
      Namespace: !Ref DynamoDBCustomNamespace
      MetricName: 'ProvisionedReadCapacityAutoScalingPct'
      Dimensions:
        - Name: 'TableName'
          Value: !Ref DynamoDBProvisionedTableName
      Statistic: 'Maximum'
      Unit: 'Percent'
      Threshold: 90
      ComparisonOperator: 'GreaterThanThreshold'
      Period: 60
      EvaluationPeriods: 2

Metrics for each on-demand capacity table and GSI

Tables using on-demand capacity mode (PAY_PER_REQUEST Billing Mode) have less to monitor because you can’t increase or decrease the current capacity settings. The primary concern is if the table is coming close to the account maximum limits for table level reads and writes. The following table summarizes the DynamoDB metrics and recommended alarm configurations for each DynamoDB table and GSI using the PAY_PER_REQUEST Billing Mode.

Description	Metric	Alarm config	Notes
Read consumption as a percentage of the table limit	`SUM ConsumedReadCapacityUnits / MAXIMUM AccountMaxTableLevelReads`	> 90%	Requires metric math
Write consumption as a percentage of the table limit	`SUM ConsumedWriteCapacityUnits / MAXIMUM AccountMaxTableLevelWrites`	> 90%	Requires metric math

The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metric. The ConsumedCapacity metrics are for requests sent per second accumulated over a minute. Because AccountMaxTableLevelWrites represents the requests per second, you must scale them in the expression formula to keep the value in the range [0, 100].

  DynamoDBOnDemandTableWriteLimitAlarm:
    Type: 'AWS::CloudWatch::Alarm'
    Properties:
      AlarmName: 'DynamoDBOnDemandTableWriteLimitAlarm'
      AlarmDescription: 'Alarm when consumed table reads approach the account limit'
      AlarmActions:
        - !Ref DynamoDBMonitoringSNSTopic
      Metrics:
        - Id: 'e1'
          Expression: '(((m1 / 300) / m2) * 100)'
          Label: TableWritesOverMaxWriteLimit
        - Id: 'm1'
          MetricStat:
            Metric:
              Namespace: 'AWS/DynamoDB'
              MetricName: 'ConsumedWriteCapacityUnits'
              Dimensions:
                - Name: 'TableName'
                  Value: !Ref DynamoDBOnDemandTableName
            Period: 300
            Stat: 'SampleCount'
            Unit: 'Count'
          ReturnData: False
        - Id: 'm2'
          MetricStat:
            Metric:
              Namespace: 'AWS/DynamoDB'
              MetricName: 'AccountMaxTableLevelWrites'
            Period: 300
            Stat: 'Maximum'
          ReturnData: False
      EvaluationPeriods: 2
      Threshold: 90
      ComparisonOperator: 'GreaterThanThreshold'

Monitoring for DynamoDB global tables

DynamoDB global tables replicates data between tables in different Regions in a fully managed, multi-master format. With global tables using provisioned throughput, you must provision the same WCU settings across all the table replicas. Not doing so may result in a replica in one Region falling behind replicating changes from another Region. This could cause your replica to diverge from the other replicas. If your tables use auto scaling, all the replica tables should have the same auto scaling settings to have a consistent experience. Tables using on-demand throughput need not consider this issue.

It’s useful to know the replication latency of each AWS Region and alert if that replication latency increases continually. It might indicate an accidental misconfiguration in which the global table has different WCU settings in different Regions, which leads to replicated requests failing and increase latencies. It could also indicate that there is a Regional disruption. The actual latency depends on which Regions are involved (how far dispersed they are geographically) and is subject to some amount of Regional fluctuation. Replication latencies longer than 3 minutes are generally cause for investigation, however, you should pick a number that makes sense for your use case and requirements.

The following table summarizes the DynamoDB global tables metrics and recommended alarm configurations for each of your global tables. You want to configure the alarm and dashboard in each Region participating in the global table.

Description	Metric	Alarm config	Notes
Elevated replication latency between two Regions	`ReplicationLatency`	AVERAGE > 180,000 milliseconds (3 minutes)	Out of the box

The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metric. This alarm requires you to specify from which Region (referred to by the DynamoDBGlobalTableReceivingRegion parameter) you want to measure the latency. If your global table has more than two Regions participating, you must to set up multiple alarms in each Region.

  DynamoDBGTReplLatencyAlarm:
    Type: 'AWS::CloudWatch::Alarm'
    Properties:
      AlarmName: 'DynamoDBGTReplLatencyAlarm'
      AlarmDescription: 'Alarm when global table replication latency exceeds 3 minutes (180k ms)'
      AlarmActions:
        - !Ref DynamoDBMonitoringSNSTopic
      Namespace: 'AWS/DynamoDB'
      MetricName: 'ReplicationLatency'
      Dimensions:
        - Name: 'TableName'
          Value: !Ref DynamoDBGlobalTableName
        - Name: 'ReceivingRegion'
          Value: !Ref DynamoDBGlobalTableReceivingRegion
      Statistic: 'Average'
      Threshold: 180000
      ComparisonOperator: 'GreaterThanThreshold'
      Period: 60
      EvaluationPeriods: 15

AWS Lambda users of DynamoDB Streams

Users who create Lambda functions triggered by changes on a DynamoDB table should generate alerts when objects sit for too long on the DynamoDB stream without being processed by the Lambda function. This could be evidence of a Lambda function with a defect (such as an unhandled exception), or a Lambda function that can’t handle events quickly enough and therefore causes an ever-deepening queue. When a system is optimized and performing well, you should see DynamoDB Streams events handled within a few seconds. The following table summarizes the Lambda metrics and recommended alarm configurations for each of your Lambda functions that are triggered by DynamoDB Streams events.

Description	Metric	Alarm config	Notes
Elevated age of events on the DynamoDB stream	`IteratorAge`	> 30,000 milliseconds (30 seconds)	Out of the box

The following code is an example AWS CloudFormation template for the preceding metric. This alarm is based on your Lambda function name (referred to by the DynamoDBStreamLambdaFunctionName parameter).

  DynamoStreamLambdaIteratorAgeAlarm:
    Type: 'AWS::CloudWatch::Alarm'
    Properties:
      AlarmName: 'DynamoStreamLambdaIteratorAgeAlarm'
      AlarmDescription: 'Alarm when lambda iterator age exceeds 30 seconds (30k ms)'
      AlarmActions:
        - !Ref DynamoDBMonitoringSNSTopic
      Namespace: 'AWS/Lambda'
      MetricName: 'IteratorAge'
      Dimensions:
        - Name: 'Function'
          Value: !Ref DynamoDBStreamLambdaFunctionName
        - Name: 'Resource'
          Value: !Ref DynamoDBStreamLambdaFunctionName
      Statistic: 'Average'
      Threshold: 30000
      ComparisonOperator: 'GreaterThanThreshold'
      Period: 60
      EvaluationPeriods: 2

Conclusion

Though there are many more metrics that you could monitor and receive alerts on, this post gives you a good starting point on your path to operationalizing DynamoDB.

Want more Amazon DynamoDB how-to content, news, and feature announcements? Follow us on Twitter.

About the Authors

Chad Tindel is a DynamoDB Specialist Solutions Architect based out of New York City. He works with large enterprises to evaluate, design, and deploy DynamoDB-based solutions. Prior to joining Amazon he held similar roles at Red Hat, Cloudera, MongoDB, and Elastic.

Pete Naylor is a DynamoDB Specialist Solutions Architect based in Seattle. Prior to this, he was a Technical Account Manager supporting Amazon as a customer of AWS, with a focus on database migrations and operational excellence at scale. His career background is systems engineering for high availability in geographically diverse tier 1 workloads.

Pratik Agarwal is a Software Development Engineer for Amazon DynamoDB who works on the resource governance team. He focuses primarily on IOPS management, which includes DynamoDB auto scaling, adaptive capacity, and on-demand capacity mode.

Ankur Kasliwal is a Technical Program Manager for Amazon DynamoDB. He helps innovate, simplify project development structure and deliver results effectively and efficiently for our customers. In addition to that, he provides architectural guidance for AWS services to internal and external customers with a deep focus on solutions using Amazon DynamoDB.