In Kubernetes, the leading container orchestration system, there are three important types of metrics to monitor: resource metrics, cluster state metrics, and control plane metrics.
Resource metrics track the availability of important resources like CPU, memory, and storage. Accessing these metrics is important for ensuring cluster health and the performance of the applications running on K8s.
In this article, we’ll explore all of the available Kubernetes metrics, with a focus on resource metrics. In addition, we’ll highlight the tools and processes available to engineering teams for tracking these metrics on their own.
Kubernetes Resource Metrics | Pipeline & Overview
Fortunately, Kubernetes, an open-source project, has taken a thoughtful approach to resource metrics. Resources like CPU, memory, and storage are critical to ensuring performance and availability. And accessing and monitoring resource metrics is an important early step for most engineering teams.
The above diagram shows how the Kubernetes resource metrics pipeline operates.
cAdvisor is a daemon for collecting container metrics exposed in Kubelet. Kubelet and the Summary API make per-node metrics available through the <terminal inline>‘/metrics/resource’ and ‘/stats’ endpoints<terminal inline>.
The Metrics Server is an important cluster add-on component that allows you to collect and aggregate resource metrics from Kubelet using the Summary API.
Installing the Metrics Servicer is an important step because it allows you to use Kubectl commands, such as <terminal inline>kubectl top node<terminal inline>, which is discussed later in this post. It also allows you to use a separate monitoring toolset like Prometheus or ContainIQ.
Tracking resource metrics and availability is important in ensuring that end users can access your applications. CPU utilization, available vs. used memory, and storage are finite resources, and metrics can be used to determine the load on the servers and whether additional resources must be added to the cluster. These metrics can also indicate when resources are overprovisioned and there could be an opportunity to reduce usage and costs.
Cluster State Metrics | A Quick Overview
In Kubernetes, the API server provides valuable metrics for tracking the count and availability of Kubernetes objects. Often described as cluster state metrics, they’re helpful for identifying issues with nodes and pods. Depending on the type of controller that manages these objects — a deployment or a DaemonSet — there are different types of metrics available.
By using Kubectl, or kube-state-metrics, you can retrieve availability and count metrics for all of the objects in your cluster.
Node status is a popular cluster state metric. For example, in Kubernetes, node conditions might include <terminal inline>Ready<terminal inline>, <terminal inline>MemoryPressure<terminal inline>, <terminal inline>DiskPressure<terminal inline>, and more. In kube-state-metrics, the following metric name returns the status of the node:
It’s also important to track pod availability and the number of unavailable pods. In kube-state-metrics, the following metrics show the number of available pods:
, or if you’re using a DaemonSet,
And these metrics show the number of unavailable pods:
Cluster state metrics are available inside ContainIQ.
Control Plane Metrics | A Quick Overview
As mentioned previously, Kubernetes provides metrics for the core control plane components, including the API server, controller managers, schedulers, and etcd. These control plane components are critical for ensuring cluster management, so tracking the availability and performance of these components is essential.
Here are a few control plane metrics that you should consider monitoring:
- <terminal inline>Etcd_server_has_leader<terminal inline>: An etcd cluster should have a leader, and this metric will return a 1. However, if the etcd cluster does not have a leader, this metric will return a 0, which means that the etcd cluster is down and unable to serve queries.
- <terminal inline>Apiserver_request_latencies_count<terminal inline>: This metric is important because it tracks the amount of time it’s taking for user-initiated commands to create, delete, or query resources to execute. A spike in latency may mean that commands are executing slowly as a result of the API server being overworked.
- <terminal inline>Scheduler_schedule_attempts_total<terminal inline>: This metric is important for monitoring the throughput of the Kubernetes scheduler. It counts the number of attempts made by the scheduler to schedule pods to nodes, and it measures the latency of the ultimate execution of those attempts.
Accessing Resource Metrics with Kubectl
Kubectl is a powerful command-line tool that allows engineers to perform a large number of actions on a Kubernetes cluster, without needing to make API calls directly.
Fortunately, Kubectl offers important functionality for accessing Kubernetes metrics directly from the command line.
Using Kubectl get
After deploying the Metrics Server, you can use the <terminal inline>kubectl get<terminal inline> command to retrieve metrics for pods and nodes.
Using the following command, you can retrieve a pod’s resource metrics:
In the example above, you would replace <POD_NAME> with the actual name of the pod. You can find the names of all of your pods by using the <terminal inline>kubectl get pods<terminal inline> command.
The returning JSON object will include the CPU and memory for the given pod at that specific point in time.
Using Kubectl top
With a properly installed Metrics Server, you can use the <terminal inline>kubectl top<terminal inline> command to pull metrics for pods, nodes, and even individual containers.
To retrieve the metrics for all of your running nodes, use the <terminal inline>kubectl top nodes<terminal inline> command. Below is an example of the output from running this command in a test environment:
Similarly, by using <terminal inline>kubectl top pods<terminal inline>, you can view the CPU and memory metrics for all of the pods running. If you’d like, you can add <terminal inline>–namespace<terminal inline>-level filtering, and drill down to specific containers by using the <terminal inline>–containers<terminal inline> flag. We provide examples in this step-by-step guide to using the <terminal inline>kubectl top<terminal inline> command.
Using Kubectl describe
If you’re interested in knowing more about how resources are allocated within nodes, you can use the <terminal inline>kubectl describe node <NODE_Name><terminal inline> command to learn more.
The <terminal inline>kubectl describe node<terminal inline> command works with or without the Metrics Server. It can be helpful to know which pods are taking up capacity inside a node. The <terminal inline>kubectl describe<terminal inline> command returns the total amount of resources, expressed as a percentage, used by each pod.
Our tutorial on using Kubectl describe provides additional examples and context.
Accessing the Kubernetes Dashboard
The Kubernetes Dashboard is a web-based tool for viewing Kubernetes metrics, monitoring, and for accessing your cluster. The Kubernetes dashboard can be thought of as an extension to Kubectl, in that you can accomplish the same core actions with the dashboard as you would with Kubectl on the command line directly. However, the Kubernetes Dashboard offers added functionality, like historical views.
For example, using the Kubernetes dashboard, you can access node and pod metrics, similar to how you’d do that using <terminal inline>Kubectl top<terminal inline>. You can view point-in-time metrics for all of your resources in addition to metadata. You can also view and visualize recent metrics — up to 15 minutes by default — metrics for all of your nodes, pods, and namespaces. This additional context can be helpful when you’re debugging and getting a sense of how metrics may have recently changed.
Installing the Kubernetes dashboard is relatively easy, and we provide a step-by-step walkthrough in this guide.
Tools to Track Metrics | ContainIQ or Prometheus
Due in large part to the growth and support of the Cloud Native Computing Foundation (CNCF), there are plenty of open-source and commercially available tools for monitoring metrics in 2022. In this section of the article, we’ll highlight two important tools: one commercial offering and one popular open-source tool.
Introduced in 2021, ContainIQ is a SaaS and on-prem solution for Kubernetes metrics, as well as logs and traces.
ContainIQ offers a number of dashboards and features that are helpful for monitoring metrics, including both pod and node resource metrics, as well as metrics for deployments. As pictured above, ContainIQ’s Node Metrics dashboard provides a visualization of a cluster or multiple clusters’ node metrics. Metrics are displayed for each node and are color-coded based on usage. Users can view CPU and memory usage over periods of time, as well as limits. ContainIQ’s pod metrics dashboard provides similar functionality.
Users are able to set alerts on changes in metrics, including both CPU and memory usage. And they can see the status, conditions, and events associated with each pod and node. Alerts can be sent to Slack, or to a large number of destinations using a webhook.
ContainIQ’s other dashboards, like the Tracer dashboard, which displays all internal and external requests, allow users to correlate individual requests with metrics and logs at points in time. This can be particularly useful when you’re debugging problems or investigating significant changes in latency for a given request, path, or service.
ContainIQ is launching its custom metrics feature set to the public during the summer of 2022. ContainIQ’s custom metrics feature lets users bring in metrics from third-party sources, like Prometheus, and then create custom dashboards.
ContainIQ offers a self-service sign-up experience. Users are billed based on usage at $20 per node per month, and $0.50 per GB of log ingest if they’re using the logging feature set. ContainIQ does not charge based on metric ingest, by seat, or by trace ingest.
Prometheus is a popular and thoughtfully built toolset for metrics and monitoring. This open-source time-series database is used by companies large and small and is very popular in the Kubernetes ecosystem.
Prometheus allows teams to efficiently collect and store metrics using a highly dimensional data model. Prometheus is well known for its PromQL query language and for Alertmanager, which can be used to configure alerts. Users are able to send alerts from Alertmanager to a large number of services, including PagerDuty and Opsgenie.
While Prometheus does offer a basic UI, it’s most often used in combination with another tool for visualization, such as Grafana or ContainIQ (as previously mentioned).
Prometheus offers a large number of integrations and client libraries.
Prometheus can be self-hosted, or it can be used as a managed offering through many of the cloud providers, like AWS, GCP, and other third parties.
Kubernetes metrics play an important role in ensuring cluster health and performance applications. In this article, we introduced the three fundamental types of Kubernetes metrics: resource metrics, cluster-level metrics, and control plane metrics. Depending on your use cases, and the applications running on K8s, you may want to prioritize certain metrics in your monitoring and alerting. Resource metrics in particular are important because they showcase the availability of core resources, like CPU and memory, which are needed to run your workloads.
Identifying the metrics that matter is an important first step. And once you’ve identified the metrics that most matter in your environment, it’s strategically important to put in place a toolset that allows your team to monitor, track, and alert on changes in these metrics. In this article, we highlighted both ContainIQ, a Kubernetes-native solution for metrics, logs, and traces, and Prometheus, the popular open-source time-series database and monitoring tool.
By using ContainIQ, alone or in conjunction with Prometheus, teams are able to get a clear view of metrics in real-time, as well as the historical information they need to troubleshoot issues that have happened or are currently occurring.