Category: Amazon CloudWatch


Amazon CloudWatch Update – Extended Metrics Retention & User Interface Update

Amazon CloudWatch is a monitoring service for your AWS resources and for the applications that you run on AWS. It collects and tracks metrics, monitors log files, and allows you to set alarms and respond to changes in your AWS resources.

Today we are launching several important enhancements to CloudWatch:

  • Extended Metrics Retention – CloudWatch now stores all metrics for 15 months.
  • Simplified Metric Selection – The CloudWatch Console now makes it easier for you to find and select the metrics of interest.
  • Improved Metric Graphing – Graphing of selected metrics is now easier and more flexible.

Let’s take a look!

Extended Metrics Retention
When we launched CloudWatch back in 2009 (New Features for Amazon EC2: Elastic Load Balancing, Auto Scaling, and Amazon CloudWatch), system metrics were stored for 14 days. Later, when we gave you the ability to publish your own metrics to CloudWatch, the same retention period applied. Many AWS customers would like to access and visualize data over longer periods of time. They want to detect and understand seasonal factors, observe monthly growth trends, and perform year-over-year analysis.

In order to support these use cases (and many others that you’ll undoubtedly dream up), CloudWatch now stores all metrics for 15 months at no extra charge. In order to keep the overall volume of data reasonable, historical data is stored at a lower level of granularity, as follows:

  • One minute data points are available for 15 days.
  • Five minute data points are available for 63 days.
  • One hour data points are available for 455 days (15 months).

In order to allow you to understand the value of this new feature first-hand, you can immediately access three months of retained metrics. Over the course of the next 12 months, metrics will continue to be retained up until the point where a full 15 months of history are stored.

As a first-hand illustration of the value of extended metrics retention, this long-term view of the Duration of my AWS Lambda functions tells me that something unusual is happening every two weeks:

Simplified Metric Selection
In order to make it easier for you to find and select the metrics that you would like to examine and graph, the CloudWatch Console now includes a clean, clutter-free card style view with a clean demarcation between AWS metrics and custom metrics:

The next step is to search for metrics of interest. For example, I’m interested in metrics with “CPU” in the name:

From there I can drill down, select a metric, and graph it. Here’s the CPU Utilization metric across all of my EC2 instances:

I can also show multiple metrics on the same graph:

This is not as illuminating as I would like; I’ll show you how to fix that in just a minute.

Improved Metric Graphing
With metrics now stored for 15 months and easier to find and to select, the next step is to make them easier to comprehend. This is where the new tabs (Graphed metrics and Graph options) come in to play.

The Graphed metrics tab gives me detailed control over each metric:

The Actions column on the right lets me create alarms, duplicate a metric, or to delete it from the graph.

I can click on any of the entries in the Statistic column to make a change:

The ability to duplicate a metric and then change a statistic lets me compare, for example, maximum and average CPU Utilization (I also changed the Period to 6 hours):

Now let’s take a look at the Y Axis control on the Graph options tab. In the last section I graphed seven EC2 metrics at the same time. Due to the variation in range, the graph was not as informative as it could be. I can now choose independent ranges for the left and right Y axes, and then assign metrics to the left or to the right:

I can edit the label for a metric by clicking on it:
I can also choose to set ranges for the left and the right Y axes:

I can use the new date picker to choose the time frame of interest to me. I can look at metrics for a desired absolute date range, or relative to the current date. Here’s how I choose an absolute date range:

And here’s how I choose a relative one:

I can rename a graph by clicking on the current name and entering a new one:

After I have fine-tuned the graph to my liking I can add it to an existing dashboard or create a new one:

Here’s one of my existing dashboards with the new graph at the bottom:

Available Now
The extended metric storage and all of the features that I described above are now available in the US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), EU (Frankfurt), South America (São Paulo), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Mumbai), and Asia Pacific (Sydney) Regions.

As I noted before, there is no extra charge for the extended metric storage.

Jeff;

 

New – CloudWatch Plugin for collectd

You have had the power to store your own business, application, and system metrics in Amazon CloudWatch for quite some time (see New – Custom Metrics for Amazon CloudWatch to learn more).  As I wrote way back in 2011 when I introduced this feature, “You can view graphs, set alarms, and initiate automated actions based on these metrics, just as you can for the metrics that CloudWatch already stores for your AWS resources.”

Today we are simplifying the process of collecting statistics from your system and getting them in to CloudWatch with the introduction of a new CloudWatch plugin for collectd. By combining collectd‘s ability to gather many different types of statistics with the CloudWatch features for storage, display, alerting, and alarming, you can become better informed about the state and performance of your EC2 instances and your on-premises hardware and the applications running on them. The plugin is being released as an open source project and we are looking forward to your pull requests.

The collectd daemon is written in C for performance and portability. It supports over one hundred plugins, allowing you to collect statistics on Apache and Nginx web server performance, memory usage, uptime, and much more.

Installation and Configuration
I installed and configured collectd and the new plugin on an EC2 instance in order to see it in action.

To get started I created an IAM Policy with permission to write metrics data to CloudWatch:

Then I created an IAM Role that allows EC2 (and hence the collectd code running on my instance) to use my Policy:

If I was planning to use the plugin to collect statistics from my on-premises servers or if my EC2 instances were already running, I could have skipped these steps, and created an IAM user with the appropriate permissions instead. Had I done this, I would have had to put the user’s credentials on the servers or instances.

With the Policy and the Role in place, I launched an EC2 instance and selected the Role:

I logged in and installed collectd:

$ sudo yum -y install collectd

Then I fetched the plugin and the install script, made the script executable, and ran it:

$ chmod a+x setup.py
$ sudo ./setup.py

I answered a few questions and the setup ran without incident, starting up collectd after configuring it:

Installing dependencies ... OK
Installing python dependencies ... OK
Copying plugin tar file ... OK
Extracting plugin ... OK
Moving to collectd plugins directory ... OK
Copying CloudWatch plugin include file ... OK

Choose AWS region for published metrics:
  1. Automatic [us-east-1]
  2. Custom
Enter choice [1]: 1

Choose hostname for published metrics:
  1. EC2 instance id [i-057d2ed2260c3e251]
  2. Custom
Enter choice [1]: 1

Choose authentication method:
  1. IAM Role [Collectd_PutMetricData]
  2. IAM User
Enter choice [1]: 1

Choose how to install CloudWatch plugin in collectd:
  1. Do not modify existing collectd configuration
  2. Add plugin to the existing configuration
Enter choice [2]: 2
Plugin configuration written successfully.
Stopping collectd process ... NOT OK
Starting collectd process ... OK
$

With collectd running and the plugin installed and configured, the next step was to decide on the statistics of interest and configure the plugin to publish them to CloudWatch (note that there is a per-metric cost so this is an important step).

The file /opt/collectd-plugins/cloudwatch/config/blocked_metrics contains a list of metrics that have been collected but not published to CloudWatch:

$ cat /opt/collectd-plugins/cloudwatch/config/blocked_metrics
# This file is automatically generated - do not modify this file.
# Use this file to find metrics to be added to the whitelist file instead.
cpu-0-cpu-user
cpu-0-cpu-nice
cpu-0-cpu-system
cpu-0-cpu-idle
cpu-0-cpu-wait
cpu-0-cpu-interrupt
cpu-0-cpu-softirq
cpu-0-cpu-steal
interface-lo-if_octets-
interface-lo-if_packets-
interface-lo-if_errors-
interface-eth0-if_octets-
interface-eth0-if_packets-
interface-eth0-if_errors-
memory--memory-used
load--load-
memory--memory-buffered
memory--memory-cached

I was interested in memory consumption so I added one line to /opt/collectd-plugins/cloudwatch/config/whitelist.conf:

memory--memory-.*

The collectd configuration file (/etc/collectd.conf) contains additional settings for collectd and the plugins. I did not need to make any changes to it.

I restarted collectd so that it would pick up the change:

$ sudo service collectd restart

I exercised my instance a bit in order to consume some memory, and then opened up the CloudWatch Console to locate and display my metrics:

This screenshot includes a preview of an upcoming enhancement to the CloudWatch Console; don’t worry if yours doesn’t look as cool (stay tuned for more information on this).

If I had been monitoring a production instance, I could have installed one or more of the collectd plugins. Here’s a list of what’s available on the Amazon Linux AMI:

$ sudo yum list | grep collectd
collectd.x86_64                        5.4.1-1.11.amzn1               @amzn-main
collectd-amqp.x86_64                   5.4.1-1.11.amzn1               amzn-main
collectd-apache.x86_64                 5.4.1-1.11.amzn1               amzn-main
collectd-bind.x86_64                   5.4.1-1.11.amzn1               amzn-main
collectd-curl.x86_64                   5.4.1-1.11.amzn1               amzn-main
collectd-curl_xml.x86_64               5.4.1-1.11.amzn1               amzn-main
collectd-dbi.x86_64                    5.4.1-1.11.amzn1               amzn-main
collectd-dns.x86_64                    5.4.1-1.11.amzn1               amzn-main
collectd-email.x86_64                  5.4.1-1.11.amzn1               amzn-main
collectd-generic-jmx.x86_64            5.4.1-1.11.amzn1               amzn-main
collectd-gmond.x86_64                  5.4.1-1.11.amzn1               amzn-main
collectd-ipmi.x86_64                   5.4.1-1.11.amzn1               amzn-main
collectd-iptables.x86_64               5.4.1-1.11.amzn1               amzn-main
collectd-ipvs.x86_64                   5.4.1-1.11.amzn1               amzn-main
collectd-java.x86_64                   5.4.1-1.11.amzn1               amzn-main
collectd-lvm.x86_64                    5.4.1-1.11.amzn1               amzn-main
collectd-memcachec.x86_64              5.4.1-1.11.amzn1               amzn-main
collectd-mysql.x86_64                  5.4.1-1.11.amzn1               amzn-main
collectd-netlink.x86_64                5.4.1-1.11.amzn1               amzn-main
collectd-nginx.x86_64                  5.4.1-1.11.amzn1               amzn-main
collectd-notify_email.x86_64           5.4.1-1.11.amzn1               amzn-main
collectd-postgresql.x86_64             5.4.1-1.11.amzn1               amzn-main
collectd-rrdcached.x86_64              5.4.1-1.11.amzn1               amzn-main
collectd-rrdtool.x86_64                5.4.1-1.11.amzn1               amzn-main
collectd-snmp.x86_64                   5.4.1-1.11.amzn1               amzn-main
collectd-varnish.x86_64                5.4.1-1.11.amzn1               amzn-main
collectd-web.x86_64                    5.4.1-1.11.amzn1               amzn-main

Things to Know
If you are using version 5.5 or newer of collectd, four metrics are now published by default:

  • df-root-percent_bytes-used – disk utilization
  • memory–percent-used – memory utilization
  • swap–percent-used – swap utilization
  • cpu–percent-active – cpu utilization

You can remove these from the whitelist.conf file if you don’t want them to be published.

The primary repositories for the Amazon Linux AMI, Ubuntu, RHEL, and CentOS currently provide older versions of collectd; please be aware of this change in the default behavior if you install from a custom repo or build from source.

Lots More
There’s quite a bit more than I had time to show you. You can  install more plugins and then configure whitelist.conf to publish even more metrics to CloudWatch. You can create CloudWatch Alarms, set up Custom Dashboards, and more.

To get started, visit AWS Labs on GitHub and download the CloudWatch plugin for collectd.

Jeff;

 

Improvements to CloudWatch Logs & Dashboards

Amazon CloudWatch helps you to see, diagnose, react to, and resolve issues that arise in your AWS infrastructure and in the applications that you run on AWS. Today, I would like to talk about several usability and functionality improvements to CloudWatch Logs (Store and Monitor OS & Application Log Files with Amazon CloudWatch) and to CloudWatch Dashboards (CloudWatch Dashboards – Create & Use Customized Metrics Views).

Usability Improvements to CloudWatch Logs
CloudWatch Logs is a highly available, scalable, durable, and secure service to manage your operating system and application log files. It allows you to ingest, store, filter, search, and archive the logs, reducing your operational burden and allowing you to focus on your application and your business.

In order to help you to stay efficient and productive even as the number and size of your logs grows, we have made several usability improvements to the CloudWatch Logs Console:

  • Improved formatting for log data.
  • Simplified access to lengthy log files.
  • Easier searching within a log group.
  • Simplified collaboration around log files.
  • Better searching within a specific time frame.

Prior to today’s launch we also made some improvements to the CloudWatch Dashboards:

  • Full screen mode.
  • Dark theme.
  • Control over range of the Y axis on charts.
  • Simplified renaming of charts.
  • Persistent storage of chart settings.

CloudWatch Logs Console in Action
Let’s take a look at each of these improvements!

Open up the CloudWatch Logs Console, click on a Log Group, and then on a Log Stream within the group. Find the View options menu on the right:

Click on Expand all in order to see the log messages in expanded, multi-line form like this:

You can also Switch to text view in order to see the logs in their unadorned, plain-text form:

We have also improved the display of log data across all streams within a log group. Once you select a Log Group and click Search Events you can see the log data from all streams with that log group. For example, I can easily identify the Billed Duration for multiple invocations of a single Lambda function:

Even better, we have replaced the original paginated view with an infinite scroll bar. You can now scroll to your heart’s content through log files of any length:

You can now refine your search to a specific time frame or to a custom date range with a single click, like this:

If you are working as part of a team, you can now share the URL of your log analysis session. The URL includes the search parameters and filters, and includes a fragment that looks like this:

group=<log_group_name>_log;stream=<log_stream_name>;filter=<filter_parameter>;start=PT<time_frame>

These improvements to the CloudWatch Logs Console are available now and you can start using them today. To learn more, read Getting Started with CloudWatch Logs.

Recent Improvements to the CloudWatch Dashboards
You may have already noticed the improvements that we recently made to the CloudWatch Dashboards. First, there’s a new full screen mode for Dashboards, accessible by clicking on Enter full screen in the Actions menu:

Once you are in full screen mode, you can click on Dark to switch to the new, night-owl-friendly dark theme:

Here’s a simple Redis dashboard in full screen mode using the dark theme:

Sometimes you want to have more control over how a chart is displayed on your dashboard. As an example, outliers in your data may make your chart less readable, and you may want to keep the dashboard focused on a specific Y axis range. Here’s a chart where that’s the case; the outlier masks the trend that happened after the big spike:

To edit the Y axis, click on the tool selector and select Edit:

Choose Graph Options and then edit the values for the Y axis until you are satisfied with the appearance of the chart, then click on Update widget:

Here’s what the chart looks like after that:

Many of our customers wanted to be able to rename a chart without leaving the dashboard. You can now do that with a click (hover your mouse near the name and then click on the pencil):

Finally, CloudWatch now remembers the time range, timezone preference, refresh interval, and auto-refresh setting for each chart!

Amazon CloudWatch Partner Ecosystem
I’d like to wrap things up by sharing some of the great work that our partners are doing. The following partners are building value-added solutions on top of CloudWatch:

  • Datadog provides integrations to key items in your infrastructure, and gives you the ability to collaborate with your team directly when dealing with incidents.
  • Librato provides integrations across elements of your infrastructure, and supports composite metrics and mathematical transformations to time series data.
  • SignalFx helps provide you with instant visibility into your metrics, and focuses on data analytics and on delivering alerts on service-wide patterns.
  • Splunk offers a platform for operational intelligence that enables you to collect machine data and find insights.
  • Sumo Logic is a machine data analytics service for log management and time series metrics that helps you build, run and secure your applications.
  • CloudCheckr integrates regular and custom metrics across your environment to assist in cost, security, and performance management by identifying right-sizing opportunities and alerting for potential security and performance issues.

If you are a partner and offer something that belongs on this list, let me know and I’ll update it ASAP!

Jeff;

 

 

EC2 Run Command Update – Monitor Execution Using Notifications

We launched EC2 Run Command late last year and have enjoyed seeing our customers put it to use in their cloud and on-premises environments. After the launch, we quickly added Support for Linux Instances, the power to Manage & Share Commands, and the ability to do Hybrid & Cross-Cloud Management. Earlier today we made EC2 Run Command available in the China (Beijing) and Asia Pacific (Seoul) Regions.

Our customers are using EC2 Run Command to automate and encapsulate routine system administration tasks. They are creating local users and groups, scanning for and then installing applicable Windows updates, managing services, checking log files, and the like. Because these customers are using EC2 Run Command as a building block, they have told us that they would like to have better visibility into the actual command execution process. They would like to know, quickly and often in detail, when each command and each code block in the command begins executing, when it completes, and how it completed (successfully or unsuccessfully).

In order to support this really important use case, you can now arrange to be notified when the status of a command or a code block within a command changes. In order to provide you with several different integration options, you can receive notifications via CloudWatch Events or via Amazon Simple Notification Service (SNS).

These notifications will allow you to use EC2 Run Command in true building block fashion. You can programmatically invoke commands and then process the results as they arrive. For example, you could create and run a command that captures the contents of important system files and metrics on each instance. When the command is run, EC2 Run Command will save the output in S3. Your notification handler can retrieve the object from S3, scan it for items of interest or concern, and then raise an alert if something appears to be amiss.

Monitoring Executing Using Amazon SNS
Let’s run up a command on some EC2 instances and monitor the progress using SNS.

Following the directions (Monitoring Commands), I created an S3 bucket (jbarr-run-output), an SNS topic (command-status), and an IAM role (RunCommandNotifySNS) that allows the on-instance agent to send notifications on my behalf. I also subscribed my email address to the SNS topic, and entered the command:

And specified the bucket, topic, and role (further down on the Run a command page):

I chose All so that I would be notified of every possible status change (In Progress, Success, Timed Out, Cancelled, and Failed) and Invocation so that I would receive notifications as the status of each instance chances. I could have chosen to receive notifications at the command level (representing all of the instances) by selecting Command instead of Invocation.

I clicked on Run and received a sequence of emails as the commands were executed on each of the instances that I selected. Here’s a sample:

In a real-world environment you would receive and process these notifications programmatically.

Monitoring Execution Using CloudWatch Events
I can also monitor the execution of my commands using CloudWatch Events. I can send the notifications to an AWS Lambda functioon, an SQS queue, or a Amazon Kinesis stream.

For illustrative purposes, I used a very simple Lambda function:

I created a rule that would invoke the function for all notifications issued by the Run Command (as you can see below, I could have been more specific if necessary):

I saved the rule and ran another command, and then checked the CloudWatch metrics a few seconds later:

I also checked the CloudWatch log and inspected the output from my code:

Available Now
This feature is available now and you can start using it today.

Monitoring via SNS is available in all AWS Regions except Asia Pacific (Mumbai) and AWS GovCloud (US). Monitoring via CloudWatch Events is available in all AWS Regions except Asia Pacific (Mumbai), China (Beijing), and AWS GovCloud (US).

Jeff;

 

Amazon Kinesis Update – Amazon Elasticsearch Service Integration, Shard-Level Metrics, Time-Based Iterators

Amazon Kinesis makes streaming data easy in the cloud.The Amazon Kinesis platform is comprised of three distinct services: Kinesis Streams allows developers to build their own stream processing applications; Kinesis Firehose simplifies the process of loading streaming data into AWS for storage and analytics; Kinesis Analytics supports the analysis of streaming data using standard SQL queries.

Many AWS customers use Kinesis Streams and Kinesis Firehose as a component of their real-time streaming data ingestion and processing systems. They appreciate the ease of use that comes with a fully managed service, and invest their development time in their application instead of spending time managing their own streaming data infrastructure.

Today we are announcing three new features for Amazon Kinesis Streams and Amazon Kinesis Firehose:

  • Elasticsearch Integration – Amazon Kinesis Firehose can now stream data to an Amazon Elasticsearch Service cluster.
  • Enhanced Metrics – Amazon Kinesis now sends shard-level metrics to CloudWatch each minute.
  • Flexibility – Amazon Kinesis now allows you to retrieve records using time-based shard iterators.

Amazon Elasticsearch Service Integration
Elasticsearch is a popular open-source search and analytics engine. Amazon Elasticsearch Service is a managed service that makes it easy for you to deploy, run, and scale Elasticsearch in the AWS Cloud. You can now arrange to deliver your Kinesis Firehose data stream to an Amazon Elasticsearch Cluster. This will allow you to index and analyze server logs, clickstreams, and social media traffic.

The incoming records (Elasticsearch documents) are buffered in Kinesis Firehose according to a configuration that you specify, and then automatically added to the cluster using a bulk request that indexes multiple documents simultaneously. The data must be UTF-8 encoded and flattened into single JSON object before it is sent to Firehose (see my recent blog post, Amazon Kinesis Agent Update – New Data Preprocessing Feature, to learn more about how to do this).

Here’s how to set this up using the AWS Management Console. I choose the destination (Amazon Elasticsearch Service) and set the delivery stream name, then I choose one of my Elasticsearch domains (livedata in this example), set up the index, and choose the index rotation (none, hourly, daily, weekly, or monthly). I also designate an S3 bucket that will receive a backup of either all documents or failed documents (my choice):

Then I set the buffer size, choose some compression and encryption options for the data that will be sent to my S3 bucket, set up logging (if desired), and pick an appropriate IAM role:

The stream will be ready for use in a minute or so:

I can view the delivery metrics in the Console:

Once the data starts to arrive in Elasticsearch I can explore it visually using Kibana or by writing queries in the Elasticsearch query language.

Putting this all together, this integration greatly simplifies the process of capturing and delivering your streaming data to your Elasticsearch cluster. There’s no need to write any code or to build your own data ingestion tools.

Shard-Level Metrics
Each Kinesis stream is composed of one or more shards, each of which provides a fixed amount of read and write capacity. Each time you add a shard to a stream, you increase the capacity of the stream.

In order to provide you with increased visibility into the performance of each shard, you can now enable a set of shard-level metrics. There are 6 metrics per shard, each reported once per minute and charged at the usual per-metric CloudWatch pricing. These metrics will allow you to see if a particular shard is running hotter than the others and to locate and root out any inefficiencies in your end-to-end streaming data delivery pipeline. For example, you can identify the shard(s) that are receiving records at a rate too high too handle and the shard(s) that are being read by applications at lower throughput than expected.

Here are the new metrics:

IncomingBytes – The number of bytes that have been successfully PUT to the shard.

IncomingRecords – The number of records that have been successfully PUT to the shard.

IteratorAgeMilliseconds – The age (in milliseconds) of the last record returned by a GetRecords call against a shard. A value of 0 means that the records being read are completely caught up with the stream.

OutgoingBytes – The number of bytes that have been retrieved from the shard.

OutgoingRecords – The number of records that have been retrieved from the shard.

ReadProvisionedThroughputExceeded -The number of GetRecords calls that have been throttled for exceeding the 5 reads per second or 2 MB per second shard limits.

WriteProvisionedThroughputExceeded – The number of records that have been rejected due to throttling for exceeding the 1000 records per second or 1 MB per second shard limits.

You can enable these metrics by calling the EnableEnhancedMonitoring function. As always, you can use the CloudWatch APIs to aggregate them across any desired time period.

Time-Based Iterators
Your application reads data from a Kinesis stream by creating an iterator on the desired shard using the GetShardIterator function and specifying the desired starting point. In addition to the existing starting point options (at or after a sequence number, oldest record, or newest record) you can now specify a timestamp. The value (specified in Unix epoch format) indicates the timestamp of the oldest record that you would like to read and process.

Jeff;

 

Using Enhanced RDS Monitoring with Datadog

Today’s guest post comes from K Young, Director of Strategic Initiatives at Datadog!

Jeff;


AWS recently announced enhanced monitoring for Amazon RDS instances running MySQL, MariaDB, and Aurora. Enhanced monitoring includes over 50 new CPU, memory, file system, and disk I/O metrics which can be collected on a per-instance basis as frequently as once per second.

AWS and Datadog
AWS worked closely with Datadog to help customers send this new high-resolution data to Datadog for monitoring. Datadog is an infrastructure monitoring platform that is very popular with AWS customers—you can see historical trends with full granularity and also visualize and alert on live data from any part of your stack.

With a few minutes of work your enhanced RDS metrics will immediately begin populating a pre-built, customizable dashboard in Datadog:

Connect RDS and Datadog
The first step is to send enhanced RDS metrics to CloudWatch Logs. You can enable the metrics during instance creation, or on an existing RDS instance by selecting it in the RDS Console and then choosing Instance OptionsModify:

Set Granularity to 1–60 seconds; every 15 seconds is often a good choice. Once enabled, enhanced metrics will be sent to CloudWatch Logs.

The second step is to send the CloudWatch Log data to Datadog. Begin by setting up a Lambda function to process the logs and send the metrics:

  1. Create a role for your Lambda function. Name it something like lambda-datadog-enhanced-rds-collector and select AWS Lambda as the role type.
  2. From the Encryption Keys tab on the IAM Management Console, create a new encryption key. Enter an Alias for the key like lambda-datadog-key. On the next page, add the appropriate administrators for the key. Next you’ll be prompted to add users to the key. Add at least two: yourself (so that you can encrypt the Datadog API key from the AWS CLI in the next step), and the role created above, e.g. lambda-datadog-enhanced-rds-collector (so that it can decrypt the API key and submit metrics to Datadog). Finish creating the key.
  3. Encrypt the token using the AWS Command Line Interface (CLI), providing the Alias of your just-created key (e.g. lambda-datadog-key) as well as your Datadog keys, available here. Use KMS to encrypt your key, like this:
    $ aws kms encrypt --key-id alias/ALIAS_KEY_NAME --plaintext '{"api_key":"DATADOG_API_KEY", "app_key":"DATADOG_APP_KEY"}'

    Save the output of this command; you will need it for the next step.

  4. From the Lambda Management Console, create a new Lambda Function. Filter blueprints by datadog, and select the datadog-process-rds-metrics blueprint.
  5. Choose RDSOSMetrics from the Log Group dropdown, enter the Filter Name of your choice, and go to the next page. If you have not yet enabled enhanced monitoring, you must do so before RDSOSMetrics will be presented an as option (see the instructions under Connect RDS and Datadog above):
  6. Give your function a name like send-enhanced-rds-to-datadog. In the Lambda function code area, replace the string after KMS_ENCRYPTED_KEYS with the ciphertext blob part of the CLI command output above.
  7. Under Lambda function handler and role, choose the role you created in step 2, e.g. lambda-datadog-enhanced-rds-collector. Go to the next page, select the Enable Now radio button, and create your function.

That’s It
Once you have enabled RDS in Datadog’s AWS integration tile, Datadog will immediately begin displaying your enhanced RDS metrics. Your RDS instances will be individually identifiable in Datadog via automatically-created tags of the form dbinstanceidentifier:YOUR_DB_INSTANCE_NAME, as well as any tags you added through the RDS console.

You can clone the pre-built dashboard and customize it however you want: add RDS metrics that are not displayed by default, or start correlating RDS metrics with the performance of the rest of your stack.

— K Young, Director of Strategic Initiatives

 

New CloudWatch Events – Track and Respond to Changes to Your AWS Resources

When you pull the curtain back on an AWS-powered application, you’ll find that a lot is happening behind the scenes. EC2 instances are launched and terminated by Auto Scaling policies in response to changes in system load, Amazon DynamoDB tables, Amazon SNS topics and Amazon SQS queues are created and deleted, and attributes of existing resources are changed from the AWS Management Console, the AWS APIs, or the AWS Command Line Interface (CLI).

Many of our customers build their own high-level tools to track, monitor, and control the overall state of their AWS environments. Up until now, these tools have worked in a polling fashion. In other words, they periodically call AWS functions such as DescribeInstances, DescribeVolumes, and ListQueues to list the AWS resources of various types (EC2 instances, EBS volumes, and SQS queues here) and to track their state. Once they have these lists, they need to call other APIs to get additional state information for each resources, compare it against historical data to detect changes, and then take action as they see fit. As their systems grow larger and more complex, all of this polling and state tracking can become onerous.

New CloudWatch Events
In order to allow you to track changes to your AWS resources with less overhead and greater efficiency, we are introducing CloudWatch Events today.

CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources. Using simple rules that you can set up in a couple of minutes, you can easily route each type of event to one or more targets: AWS Lambda functions, Amazon Kinesis streams, Amazon SNS topics, and built-in targets.

You can think of CloudWatch Events as the central nervous system for your AWS environment. It is wired in to every nook and cranny of the supported services, and becomes aware of operational changes as they happen. Then, driven by your rules,  it activates functions and sends messages (activating muscles, if you will) to respond to the environment, making changes, capturing state information, or taking corrective action.

We are launching CloudWatch Events with an initial set of AWS services and events today, and plan to support many more over the next year or so.

Diving in to CloudWatch Events
The three main components that you need to know about are events, rules, and targets.

Events (represented as small blobs of JSON) are generated in four ways. First, they arise from within AWS when resources change state. For example, an event is generated when the state of an EC2 instance changes from pending to running or when Auto Scaling launches an instance. Second, events are generated by API calls and console sign-ins that are delivered to Amazon CloudWatch Events via CloudTrail. Third, your own code can generate application-level events and publish them to Amazon CloudWatch Events for processing. Fourth, they can be issued on a scheduled basis, with options for periodic or Cron-style scheduling.

Rules match incoming events and route them to one or more targets for processing. Rules are not processed in any particular order; all of the rules that match an event will be processed (this allows disparate parts of a single organization to independently look for and process events that are of interest).

Targets process events and are specified within rules. There are four initial target types: built-in, Lambda functions, Kinesis streams, and SNS topics, with more types on the drawing board. A single rule can specify multiple targets. Each event is passed to each target in JSON form. Each rule has the opportunity to customize the JSON that flows to the target. They can elect to pass the event as-is, pass only certain keys (and the associated values) to the target, or to pass a constant (literal) string.

CloudWatch Events in Action
Let’s go ahead and set up a rule or two! I’ll use a simple Lambda function called SomethingHappened. It will simply log the contents of the event:

Next, I switch to the new CloudWatch Events Console, click on Create rule and choose an event source (here’s the menu with all of the choices):

Just a quick note before going forward. Some of the AWS services fire events directly. Others are fired based on the events logged to CloudTrail; you’ll need to enable CloudTrail for the desired service(s) in order to receive them.

I want to keep tabs on my EC2 instances, so I choose EC2 from the menu. I can choose to create a rule that fires on any state transition, or on a transition to one or more states that are of interest:

I want to know about newly launched instances, so I’ll choose Running. I can make the rule respond to any of my instances in the region, or to specific instances. I’ll go with the first option; here’s my pattern:

Now I need to make something happen. I do this by picking a target. Again, here are my choices:

I simply choose Lambda and pick my function:

I’m almost there! I just need to name and describe my rule, and then click on Create rule:

I click on Create Rule and the rule is all set to go:

Now I can test it by launching an EC2 instance. In fact, I’ll launch 5 of them just to exercise my code! After waiting a minute or so for the instances to launch and to initialize, I can check my Lambda metrics to verify that my function was invoked:

This looks good (the earlier invocations were for testing). Then I can visit the CloudWatch logs to view the output from my function:

As you can see, the event contains essential information about the newly launched instance. Your code can call AWS functions in order to learn more about what’s going on. For example, you could call DescribeInstances to access more information about newly launched instances.

Clearly, a “real” function would do something a lot more interesting. It could add some mandatory tags to the instance, update a dynamic visualization, or send me a text message via SNS. If you want to do any (or all of these things), you would need to have a more permissive IAM role for the function, of course. I could make the rule more general (or create another one) if  I wanted to capture some of the other state transitions.

Scheduled Execution of Rules
I can also set up a rule that fires periodically or according to a pattern described in a Cron expression. Here’s how I would do that:

You might find it interesting to know that this is the underlying mechanism used to set up scheduled Lambda jobs, as announced at AWS re:Invent.

API Access
Like most AWS services, you can access CloudWatch Events through an API. Here are some of the principal functions:

  • PutRule to create a new rule.
  • PutTargets and RemoveTargets to connect targets to rules, and to disconnect them.
  • ListRules, ListTargetsByRule, and DescribeRule to find out more about existing rules.
  • PutEvents to submit a set of events to CloudWatch events. You can use this function (or the CLI equivalent) to submit application-level events.

Metrics for Events
CloudWatch Events reports a number of metrics to CloudWatch, all within the AWS/Events namespace. You can use these metrics to verify that your rules are firing as expected, and to track the overall activity level of your rule collection.

The following metrics are reported for the service as a whole:

  • Invocations – The number of times that target have been invoked.
  • FailedInvocations – The number of times that an invocation of a target failed.
  • MatchedEvents – The number of events that matched one or more rules.
  • TriggeredRules – The number of rules that have been triggered.

The following metrics are reported for each rule:

  • Invocations – The number of times that the rule’s targets have been invoked.
  • TriggeredRules – The number of times that the rule has been triggered.

In the Works
Like many emerging AWS services, we are launching CloudWatch Events with an initial set of features (and a lot of infrastructure behind the scenes) and some really big plans, including AWS CloudFormation support. We’ll adjust our plans based on your feedback, but you can expect coverage of many more AWS services and access to additional targets over time. I’ll do my best to keep you informed.

Getting Started
We are launching CloudWatch Events in the US East (Northern Virginia), US West (Oregon), EU (Ireland), and Asia Pacific (Tokyo) regions. It is available now and you can start using it today!

Jeff;

CloudWatch Dashboards – Create & Use Customized Metrics Views

Amazon CloudWatch monitors your AWS cloud resources and your cloud-powered applications. It tracks the metrics so that you can visualize and review them. You can also set alarms that will fire when a metrics goes beyond a limit that you specified. CloudWatch gives you visibility into resource utilization, application performance, and operational health.

New CloudWatch Dashboards
Today we are giving you the power to build customized dashboards for your CloudWatch metrics. Each dashboard can display multiple metrics, and can be accessorized with text and images. You can build multiple dashboards if you’d like, each one focusing on providing a distinct view of your environment. You can even pull data from multiple regions into a single dashboard in order to create a global view.

Let’s build one!

Building a Dashboard
I open up the CloudWatch Console and click on Create dashboard to get started. Then I enter a name:

Then I add my first “Widget” (a graph or some text) to my dashboard. I’ll display some metrics using a line graph:

Now I need to choose the metric. This is a two step process. First I choose by category:

I clicked on EC2 Metrics. Now I can choose one or more metrics and create the widget. I sorted the list by the Metric Name selected all of my EC2 instances, and clicked on the Create widget button (not shown in the screen shot):

As I noted earlier, you can access and make use of metrics drawn from multiple AWS regions; this means that you create a single global status dashboard for your complex, multi-region applications and deployments.

And here’s my dashboard:

I can resize the graph, and I can also interact with it. For example, I can focus on a single instance with a click (this will also highlight the other metrics from that instance on the dashboard):

I can add several widgets. The vertical line allows me to look for correlation between metrics that are shown in different widgets:

The graphs can be linked or independent with respect to zooming (the Actions menu allows me to choose which option I want). I can click and drag on a desired time-frame and all of the graphs will zoom (if they are linked) when I release the mouse button:

The Action menu allows me to reset the zoom and to initiate many other operations on my dashboards:

I can also add static text and images to my dashboard by using a text widget. The contents of the widget are specified in GitHub Flavored Markdown:

Here’s my stylish new dashboard:

Text widgets can also include buttons and tables. I can link to help pages, troubleshooting guides, internal and external status pages, phone directories, and so forth.

I can create several dashboards and switch between then with a click:

I can also create a link that takes me from one dashboard to another one:

I can also control the time range for the graphs, and I can enable automatic refresh, with fine-grained control of both:

Dashboard Ownership and Access
The individual dashboards are stored at the AWS account level and can be accessed by IAM users within the account. However, in many cases administrators will want to set up dashboards for use across the organization in a controlled fashion.

In order to support this important scenario, IAM permissions on a pair of CloudWatch functions can be used to control the ability to see metrics and to modify dashboards. Here’s how it works:

  • If an IAM user has permission to call PutMetricData, they can create, edit, and delete dashboards.
  • If an IAM user has permission to call GetMetricStatistics, they can view dashboard content.

Available Now
CloudWatch Dashboards are available now and you can start using them today in all AWS regions! You can create up to three dashboards (each with up to 50 metrics) at no charge. After that, each additional dashboard costs $3 per month.

Share Your Dashboards
I am looking forward to seeing examples of this feature in action. Take it for a spin and let me know what you come up with!

Jeff;

New Metrics for EC2 Container Service: Clusters & Services

The Amazon EC2 Container Service helps you to build, run, and scale Docker-based applications. As I noted in an earlier post (EC2 Container Service – Latest Features, Customer Successes, and More), you will benefit from easy cluster management, high performance, flexible scheduling, extensibility, portability, and AWS integration while running in an AWS-powered environment that is secure and efficient.

Container-based applications are built from tasks. A task is one or more Docker containers that run together on the same EC2 instance; instances are grouped in to a cluster. The instances form a pool of resources that can be used to run tasks.

This model creates some new measuring and monitoring challenges. In order to keep the cluster at an appropriate size (not too big and not too small), you need to watch memory and CPU utilization for the entire cluster rather than for individual instances. This becomes even more challenging when a single cluster contains EC2 instances with varied amounts of compute power and memory.

New Cluster Metrics
In order to allow you to properly measure, monitor, and scale your clusters, we are introducing new metrics that are collected from individual instances, normalized based on the instance size and the container configuration, and then reported to Amazon CloudWatch. You can observe the metrics in the AWS Management Console and you can use them to drive Auto Scaling activities.

The ECS Container Agent runs on each of the instances. It collects the CPU and memory metrics at the instance and task level, and sends them to a telemetry service for normalization. The normalization process creates blended metrics that represent CPU and memory usage for the entire cluster. These metrics give you a picture of overall cluster utilization.

Let’s take a look! My cluster is named default and it has one t2.medium instance:

At this point no tasks are running and the cluster is idle:

I ran two tasks (as a service) with the expectation that they will consume all of the CPU:

I took a short break to water my garden while the task burned some CPU and the metrics accumulated! I came back and here’s what the CPU Utilization looked like:

 

Then I launched another t2.medium instance into my cluster, and checked the utilization again. The additional processing power reduced the overall utilization to 50%:

 

The new metrics (CPUUtilization and MemoryUtilization) are available via CloudWatch and can also be used to create alarms. Here’s how to find them:

New Service Metrics
Earlier this year we announced that the EC2 Container Service supports long-running applications and load balancing. The Service scheduler allows you to manage long-running applications and services by keeping them healthy and scaled to the desired level. CPU and memory utilization metrics are now collected and processed on a per-service basis, and are visible in the Console:

The new cluster and server metrics are available now and you can start using them today!

Jeff;

CloudWatch Logs Subscription Consumer + Elasticsearch + Kibana Dashboards

Many of the things that I blog about lately seem to involve interesting combinations of two or more AWS services and today’s post is no exception. Before I dig in, I’d like to briefly introduce all of the services that I plan to name-drop later in this post. Some of this will be review material, but I do like to make sure that every one of my posts makes sense to someone who knows little or nothing about AWS.

The last three items above have an important attribute in common — they can each create voluminous streams of event data that must be efficiently stored, index, and visualized in order to be of value.

Visualize Event Data
Today I would like to show you how you can use Kinesis and a new CloudWatch Logs Subscription Consumer to do just that. The subscription consumer is a specialized Kinesis stream reader. It comes with built-in connectors for Elasticsearch and S3, and can be extended to support other destinations.

We have created a CloudFormation template that will launch an Elasticsearch cluster on EC2 (inside of a VPC created by the template), set up a log subscription consumer to route the event data in to ElasticSearch, and provide a nice set of dashboards powered by the Kibana exploration and visualization tool. We have set up default dashboards for VPC Flow Logs, Lambda, and CloudTrail; you can customize them as needed or create other new ones for your own CloudWatch Logs log groups.

The stack takes about 10 minutes to create all of the needed resources. When it is ready, the Output tab in the CloudFormation Console will show you the URLs for the dashboards and administrative tools:

The stack includes versions 3 and 4 of Kibana, along with sample dashboards for the older version (if you want to use Kibana 4, you’ll need to do a little bit of manual configuration). The first sample dashboard shows the VPC Flow Logs. As you can see, it includes a considerable amount of information:

The next sample displays information about Lambda function invocations, augmented by data generated by the function itself:

The final three columns were produced by the following code in the Lambda function. The function is processing a Kinesis stream, and logs some information about each invocation:

exports.handler = function(event, context) {
    var start = new Date().getTime();
    var bytesRead = 0;

    event.Records.forEach(function(record) {
        // Kinesis data is base64 encoded so decode here
        payload = new Buffer(record.kinesis.data, 'base64').toString('ascii');
        bytesRead += payload.length;

        // log each record
        console.log(JSON.stringify(record, null, 2));
    });

    // collect statistics on the function's activity and performance
    console.log(JSON.stringify({ 
        "recordsProcessed": event.Records.length,
        "processTime": new Date().getTime() - start,
        "bytesRead": bytesRead,
    }, null, 2));

    context.succeed("Successfully processed " + event.Records.length + " records.");
};

There’s a little bit of magic happening behind the scenes here! The subscription consumer noticed that the log entry was a valid JSON object and instructed Elasticsearch to index each of the values. This is cool, simple, and powerful; I’d advise you to take some time to study this design pattern and see if there are ways to use it in your own systems.

For more information on configuring and using this neat template, visit the CloudWatch Logs Subscription Consumer home page.

Consume the Consumer
You can use the CloudWatch Logs Subscription Consumer in your own applications. You can extend it to add support for other destinations by adding another connector (use the Elasticsearch and S3 connectors as examples and starting points).

— Jeff;