AWS Partner Network (APN) Blog

How to Simplify AWS Monitoring with’s Fully Managed ELK Stack and Grafana

By Charlie Klein, Product Marketing Manager at Logo-1 APN Badge-1
Connect with

Often times, metrics and logs that are clearly indicative of oncoming production crises are lost in a sea of telemetry data generated by modern, distributed Amazon Web Services (AWS) environments.

This is especially true in environments powered by technologies like Kubernetes. Each layer in a microservices architecture—such as databases, cloud infrastructures, backend services, and client-side applications—produces different kinds of telemetry data that must be monitored for performance and reliability.

The result is a barrage of logs and metrics that increase in scale and variety as AWS workloads grow. Engineers hoping to identify and investigate production issues must collect, aggregate, index, and visualize this data so it can be analyzed—introducing a serious data analytics challenge.

When addressing this challenge, most engineers turn to the open source community. Popular open source monitoring solutions include the ELK Stack and Grafana, which help teams monitor their logs and metrics, respectively.

While these tools are widely used by DevOps organizations, they can be difficult to build and maintain in growing environments.

Building scalable, resilient, and secure metrics and logging pipelines with the ELK Stack and Grafana requires engineering time and expertise. Additionally, operating multiple tools for observability can create disjointed, siloed workflows that hamper productivity.

The Cloud Observability Platform delivers the ELK Stack and Grafana as a fully-managed service so engineers can use the open source monitoring tools they know on a single solution, without the hassle of maintaining them at scale.

On top of the managed open source, provides additional advanced analytics capabilities to make the ELK Stack and Grafana faster, more integrated, and easier to use.

In this post, I will discuss how to ship log and metrics data to, and explore two strategies for analyzing that data to quickly identify and troubleshoot production issues. is an AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in both Data & Analytics and DevOps.

Shipping Kubernetes Metrics and Logs to

Before exploring two different strategies for monitoring and troubleshooting AWS environments, we’ll need to get the data into If you’re more interested in learning how to analyze log and metrics data in, skip to the next section.

In this case, we’ll ship Kubernetes logs and metrics. For information on how to ship additional log and metrics data to, see the documentation. This is the only configuration necessary to build log and metrics pipelines—everything else is managed by

To set up metrics and log pipelines according to the instructions below, you’ll need a running Kubernetes cluster and account, which you can get for free.

Shipping Kubernetes Logs to

For Kubernetes, a DaemonSet ensures some or all nodes run a copy of a pod. This implementation uses a Fluentd DaemonSet to collect Kubernetes logs. Fluentd is flexible enough and has the proper plugins to distribute logs to different third parties such as

The logzio-k8s image comes pre-configured for Fluentd to gather all logs from the Kubernetes node environment and append the proper metadata to the logs.

Below are the steps needed to ship Kubernetes logs to

Step 1: Store Your Credentials

The code snippet below will save your shipping credentials as a Kubernetes secret.

Replace <<SHIPPING-TOKEN>> with the token of the account you want to ship to.

Then, replace <<LISTENER-HOST>> with your region’s listener host (for example, For more information on finding your account’s region, see Account Region.

kubectl create secret generic logzio-logs-secret \

  --from-literal=logzio-log-shipping-token='<<SHIPPING-TOKEN>>' \

  --from-literal=logzio-log-listener='https://<<LISTENER-HOST>>:8071' \

  -n kube-system

Step 2: Deploy the DaemonSet

To deploy the DaemonSet, simply use the code snippet below. Note there are two options, depending on whether you’d prefer an RBAC or non-RBAC cluster.

For an RBAC cluster:

kubectl apply -f

Or, for a non-RBAC cluster:

kubectl apply -f

Now, give your logs some time to get from your system to ours, and then open Kibana. If you still don’t see your logs, see log shipping troubleshooting.

To suppress Fluentd system messages, set the FLUENTD_SYSTEMD_CONF environment variable to “disable” in your Kubernetes environment.

Shipping Kubernetes Metrics to

Now that we’ve shipped Kubernetes logs to, let’s walk through how to ship Kubernetes metrics to To do this, we are going to use Metricbeat—a lightweight shipper that collects and sends metrics to the desired location.

Follow the steps below to ship Kubernetes metrics to using Metricbeat.

Step 1: Check for ‘kube-state-metrics’ in Your Cluster

Run this code snippet to determine whether there are kube-state-metrics in your cluster:

kubectl get pods --all-namespaces | grep kube-state-metrics

If you see a response, that means kube-state-metrics is installed, and you can move on to Step 2. Otherwise, deploy kube-state-metrics to your cluster with the following code snippet:

git clone \

  && kubectl --namespace=kube-system apply -f kube-state-metrics/kubernetes

Step 2: Store Your Credentials

Save your shipping credentials as a Kubernetes secret.

Make sure to replace:

  • <<SHIPPING-TOKEN>> with the token of the account you want to ship to.
  • <<LISTENER-HOST>> with your region’s listener host (for example, For more information on finding your account’s region, see Account Region.
kubectl --namespace=kube-system create secret generic logzio-metrics-secret \

  --from-literal=logzio-metrics-shipping-token=<<SHIPPING-TOKEN>> \


Step 3: Store Your Cluster Details

Paste the kube-state-metrics namespace and port in your text editor. You can find them by running this command:

kubectl get service --all-namespaces | grep -E 'kube-state-metrics|NAMESPACE'

Next, paste the cluster name in your text editor. You can find it by running this command, or if you manage Kubernetes in AWS, you can find it in your admin console.

kubectl cluster-info

Now, replace <<KUBE-STATE-METRICS-NAMESPACE>>, <<KUBE-STATE-METRICS-PORT>>, and <<CLUSTER-NAME>> in this command to save your cluster details as a Kubernetes secret.

kubectl --namespace=kube-system create secret generic cluster-details \

  --from-literal=kube-state-metrics-namespace=<<KUBE-STATE-METRICS-NAMESPACE>> \

  --from-literal=kube-state-metrics-port=<<KUBE-STATE-METRICS-PORT>> \


Step 4: Deploy Metricbeat

Run the following code snippet to deploy Metricbeat:

kubectl --namespace=kube-system create -f

Give your metrics some time to get from your system to ours, and then open

Simplifying AWS Microservices Monitoring and Troubleshooting

Now that we’ve shipped metrics and log data to, we can learn how to analyze that data with Infrastructure Monitoring and Log Management—two of three products offered on the Cloud Observability Platform.

If you’ve used Kibana or Grafana, should look familiar. Infrastructure Monitoring delivers a fully managed Grafana to analyze incoming metrics and identify production issues. Log Management provides a fully managed ELK Stack to help engineers investigate production issues with their logs.

Using these products, let’s explore two strategies you can use to make monitoring and troubleshooting AWS microservices easier and faster.

Identifying Production Issues with Metrics and Troubleshooting with Logs

After engineers ship their metrics data, Infrastructure Monitoring dashboards will automatically begin populating. You can choose from a list of pre-built dashboards that monitor different metrics depending on your needs.

In the use case below, we’ll open up the “K8S Cluster” dashboard. As you can see, the nginx pod is using up a growing amount of CPU and memory. You can easily filter for a more granular view of specific pods by clicking on them.


Figure 1 – Infrastructure Monitoring (based on Grafana) showing a spike in CPU and memory usage in an nginx pod.

While these metrics provide a clear indicator of a serious production issue, it doesn’t tell us anything about what’s causing the issue or how to solve it. In other words, metrics help us identify that an issue exists, but we still need to diagnose it.

To diagnose the problem, we’ll need more descriptive information of what’s happening, which is where logs come in. The challenge is finding the logs associated with the issue we see above.

In this Kubernetes cluster, there are thousands of logs coming in every hour; we need to find the logs that describe this specific problem.

To find the logs that describe this spike in CPU and memory metrics, you can navigate to the “Explore in Kibana” link on the upper left corner of the visualization (see the blue box in Figure 1 above). This automatically applies the filters in the Grafana dashboard—including the cluster, namespace, pod, and timestamp—and applies them to logs in Kibana.


Figure 2 – Log Management (based on Kibana) showing the issue that caused the spike in memory and CPU usage in Figure 1.

As a result, we see the logs that describe the exact issue that caused the spike in CPU and memory consumption. It’s immediately clear that a Java stack trace exception is causing the problem.

From here, you can triage the issue to the appropriate engineer to ensure the problem is addressed.

Finding the Needle in the Haystack with AI-Powered Log Analytics

Logs are typically not used for identifying production issues because there are so many of them and engineers don’t know what to look for. Rather, logs are normally used to troubleshoot issues once they are identified. makes logs useful for identifying production issues by helping you find the needle in the haystack.’s Cognitive Insights utilizes artificial intelligence (AI) to cross reference incoming logs with online discussion forums like GitHub and Stack Overflow to identify critical logged events, which could have otherwise gone unnoticed.

If Cognitive Insights finds a log from your environment that was discussed as a problem in a forum, for example, then it could be a problem for you as well.

To illustrate this, let’s look at a customer story showing how helps Form3 monitor for critical production issues with Cognitive Insights.

Customer Use Case: Form3

Form3 helps banks and regulated fintechs move money faster by providing a fully managed payment technology service, which combines a powerful AWS Cloud processing platform with multiple payment gateways and workflows. The service is robust and reliable, trusted by the world’s leading financial institutions.

In total, more than 100 Amazon Elastic Compute Cloud (Amazon EC2) instances ship various types of infrastructure and application data to, helping the team at Form3 gain insight into how their development, testing, and production environments are performing.

Using the Fluentd Docker logging driver, Form3 ships what are primarily Java and Go JSON-formatted application logs to They also ship Elastic Load Balancer (ELB) and Application Load Balancer (ALB) logs to by using an AWS Lambda function for shipping data from Amazon CloudWatch.

Collecting, processing, and analyzing log data generated by multiple layers in microservices architecture is a monitoring challenge that many companies find extremely hard to overcome. The amount of data being generated by different sources obscures visibility and makes it extremely difficult to identify and respond to issues quickly enough to ensure they do not have an impact on business.

Form3 is no exception, but helps break down the log data by automatically aggregating, parsing, and indexing their logs so they can be visualized in Kibana dashboards.

Once the logs were easier to understand, helped Form3 reduce time-to-resolution. Using Cognitive Insights, Form3 quickly identified a database issue caused by a crawl exception. In Figure 3 below, you can see the PSQLException identified by Cognitive Insights.


Figure 3 – The PSQLException identified by Cognitive Insights.

Once identified, and using the data provided by—accessible through a link to the relevant Stack Overflow discussion—the issue was resolved quickly before having a critical impact on Form3’s transaction services.

The combination of advanced analysis and visualization features, automated alerting, and machine learning (ML), provides Form3 with a powerful monitoring and troubleshooting solution for their growing distributed architecture.

Sam Owens, a Senior Engineer at Form3, sums things up: “Aggregating logs is an essential part of running a microservices architecture. is the perfect partner to get this job done right. We’ve been using the platform for many years and have built up a lot of trust in the platform and the team behind it.”


From startups to the most advanced DevOps organizations in the world, open source monitoring tools like the ELK Stack and Grafana are popular choices to quickly identify and respond to production issues.

However, expectations versus the reality of using open source monitoring tools can be stark. Engineering teams using open source monitoring tools must build and maintain telemetry data pipelines to ship, process, and analyze the data.

This requires monitoring open source performance as data scales, setting up data parsing and mapping, upgrading components to realize benefits of the open source community, and building data queuing systems, among many other tasks.

By delivering the ELK Stack and Grafana as a fully-managed service, allows teams to enjoy the open source they know without the hassle of building and maintaining open source data pipelines at scale.

On top of a fully managed ELK Stack and Grafana, the Cloud Observability Platform provides additional analytical capabilities to make these tools faster, easier to use, and more integrated. This helps teams reduce mean-time-to-remediation for production issues.

Next Steps

You can start a free trial of and explore sample data, or ship your own data to the platform (up to 1GB/day for as long as you want). Learn more about shipping your data in our documentation.

If you need to ship more than 1GB/day, subscribe to on AWS Marketplace >>

To learn more about, see how we’re helping hundreds of AWS customers monitor and troubleshoot their environments.

The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.

. – APN Partner Spotlight is an AWS Competency Partner. Its intelligent log analysis platform combines ELK as a cloud service with machine learning to derive new insights from machine data.

Contact | Solution Overview | AWS Marketplace

*Already worked with Rate this Partner

*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.