Resources

Kubernetes Alerting | Best Practices in 2021

November 30, 2021

Alerts should be both symptoms and actionable. These best practices are a great starting point for most teams to get up and running.

Harshil Patel
Software Developer

It’s easy to see why Kubernetes is one of the most popular container orchestration tools. It isn’t just for batch processing; it can handle real-time data, too. However, running Kubernetes means you’ve got to monitor the health of your cluster closely to make sure everything runs smoothly at all times. Alerting on metrics, Kubernetes events, logs and more is possible and often a requirement for today’s engineering teams.

This usually means the need for a Kubernetes monitoring stack that’s specially designed for containers and microservices architecture. Quick awareness of an issue is essential for keeping your cluster running. For most organizations, this means setting up an alerting solution that will send notifications to the relevant stakeholder when certain conditions are met.

Why Do You Need Alerting?

If you are using Kubernetes to manage your containerized apps, it’s important to keep an eye on the health of your cluster, including your pods and nodes. Alerts are helpful for identifying issues as they happen, like when a pod is evicted or dies. Proper alerting ensures that you don’t lose any data, helps to prevent downtime, and will ultimately improve the health of your engineering team too. Setting up alerts does take some time, but once you have them up and running, your team will likely recoup this time.

Before we dive in, know that this is one of those things that can get very complex very quickly and takes some trial and error. But failure to implement proper alerting can result in a slower response time to downtime, server resets, and other costly consequences that affect the company’s performance. If you aren’t in a position to respond quickly to problems, setting very thoughtful alerts and avoiding alert fatigue becomes even more important.

To make sure you get started with your best foot forward, let’s check out a few best practices to implement when designing alerts for your Kubernetes cluster.

Deciding What Is Important

At the heart of it, all alerts should be both symptoms and actionable.

For most teams, that involves keeping track of the cluster, nodes, pods, deployments, and services.

Cluster and Node Metrics

Monitoring your cluster will offer you a better understanding of its general health, but it’s important to pay attention to a few specific metrics:

  • How many resources your entire cluster is using
  • How many nodes are present and how many apps are on each node
  • The amount of memory used
  • Network bandwidth

Deployments and Pods

To effectively monitor the health of your pods, keep an eye on:

Applications

Application metrics, which are generally supplied by the apps themselves, can help you evaluate the performance and stability of applications operating inside your Kubernetes pods. Metrics vary based on the scope of the application, but pay attention to:

  • Latency
  • Requests Per Second
  • Responsiveness
  • Uptime
  • Reaction times

Implementing Best Practices | Our Suggestions

These best practices are a great starting point for most teams to get up and running.

Create Your Own Alerting System

When first getting started, it is helpful for teams to concentrate on workload performance and availability.

Let’s look at some recommended alerts that most teams will find valuable.

Disk Usage Warning

Many teams find it useful to alert on disk usage. This is a generic alert that triggers when consumption exceeds 80 percent. However, you may want to implement various iterations, such as a second, higher priority alert with a higher threshold, like 90 percent, or different thresholds based on the file system.

If you want to set various criteria for different services or hosts, simply alter the scope of where you want to apply a certain threshold.

Network Connectivity Issues

Most engineers want to be notified if a server is offline or unavailable. This single alert will be used across your whole infrastructure, and will trigger when the value of a metric doesn’t meet or cross a threshold. An example would be <terminal inline>mean_response_time<terminal inline> not being less than 400ms. If the mean response time is below 400ms most of the time, then the threshold shouldn’t be triggered.

Depending on how rapidly you want to receive notifications, you may want to reduce this to one or two minutes. However, you risk having too many notifications floating around if it scales up and down too often.

Pods That Aren’t Working

If you have resource issues or configuration flaws, Kubernetes will most likely fail to schedule pods into the cluster. If a pod isn’t operating or even scheduled, there may be a problem with the pod, the cluster, or the entire Kubernetes setup.

If pods aren’t operating, you’ll want to know a few things:

  • If any pods are stuck in a restart loop
  • The frequency of failed requests
  • If there are any resource or configuration concerns
  • If a pod has been decommissioned

As previously stated, Kubernetes may be unable to schedule pods if you have resource issues or configuration mistakes. In instances like that, you should examine the status of your deployments and look for configuration faults or resource problems.

Setting alerts for your pods’ status; notifications should be triggered when a pod’s state is Failed, Pending, or Unknown for the time period you select. You can do this by setting alerts on the corresponding Kubernetes events for each status change.

Node Resource Consumption

In addition to keeping track of nodes in your cluster, monitor the CPU, memory, and disc use of Kubernetes nodes to verify that they’re all healthy. This will guarantee that your cluster has enough nodes, that you don’t run out of resources (ex OOMKilled), and that etcd is in excellent working order.

Create alerts to get notifications when hosts stop reporting or when a node’s CPU or memory usage falls below a specified threshold.

Missing Pods

Over time, you will notice that your cluster is missing pods. If the engineers did not supply enough resources while scheduling the pod, it could go missing. It’s possible that the pod never started, that it’s stuck in a restart loop, or that it’s gone missing due to a setup issue.

You must confirm the health and availability of pod deployments in order for Kubernetes to do its work successfully. A pod deployment specifies the number of situations that must be available for each pod, including pod replicas. On some deployments, the number of active pods is not given in the replicas parameter. Even if they are, Kubernetes may be able to launch a different instance dependent on the administrator’s resources. The warning will be generated if the number of available pods for a deployment falls below the quantity you set when you created the deployment.

Container Restarts

Under normal conditions, containers do not restart. Container restarts indicate that your containers have reached their memory limit. Restarts could also signal a problem with the container or its host.

Owing to the way Kubernetes schedules containers, identifying container resource issues can be difficult, as Kubernetes will restart or terminate containers when they approach their limits. Simply observing the container restarts will reveal what’s going on:

  • Are any containers trapped in a restart cycle?
  • In a specific time period, how many container restarts occurred?
  • Why are containers resuming?

You can set up alerts based on how many Kubernetes containers are restarted. Setting up an alert provides you with instant and valuable notifications while also preventing container restarts.

Deploy Alerts on the Monitoring System

Even if you have a good monitoring dashboard, you won’t be staring at it all day. This is where the use of notifications comes into play. Pair your monitoring system with an alerting system or alert manager. Alerts automatically check for conditions you’ve set and notify you if the cluster has a problem.

Some examples of possible conditions to target are:

  • Metrics for host resources
  • Metrics on container resources
  • Metrics for the application
  • Metrics that are focused on customer service

ContainIQ gives users the ability to alert on most metrics, events, and logs:

Create new monitor
Create new monitor

Creating an alert is easy, and should only take a minute or two. Toggling alerts off and on can be done from the Monitors tab, and it is easy to delete alerts with one click. Users can feed alerts to a Slack channel by using the one-click Slack integration, where the desired channel is chosen.

There are a variety of other alerting tools too. A popular open-source tool is Alertmanager by Prometheus.

Determine Ownership

Because a wide range of stakeholders are engaged in cluster workload monitoring, you need to know who’s in charge of what, both in terms of infrastructure and workload. You may like to ensure that the appropriate individuals are notified at the appropriate time. Likely, you will also want to reduce the amount of noise generated by notifications concerning matters that do not concern other specific individuals on the engineering team.

Defining Priority

Impact and priority are two important characteristics to evaluate on a regular basis. It’s essential to be able to evaluate whether an alert is actionable and the number of users or services that would be affected.

The sense of urgency is also a role. It’s important to know when to deal with the situation: right now, in the next hour, or the next day? Defining the urgency of a problem will make it easier to address it.

Final Thoughts

It’s critical to keep track of how your applications are performing in real-time. When problems arise, the severity of the problem depends on the effect and the number of users or business services affected, but more significant outages can often be avoided if you use the correct monitoring and alerting stack.

It’s difficult to decide ahead of time what to watch, so you’ll need some context to figure out what’s wrong. Generally, it takes a few weeks to a month for the team to set actionable alerts, and those that avoid alert fatigue. The process of setting strong alerts requires continual thought and reconsideration over time. Be patient during the process as in the long run it is worth it.

ContainIQ provides multiple features to track and monitor events, as well as CPU, memory, and service latency for pods and nodes. However, there are a number of other awesome open-source tools, like Prometheus, too.

Looking for an out-of-the-box monitoring solution?

With a simple one-line install, ContainIQ allows you to monitor the health of your cluster with pre-built dashboards and easy-to-set alerts.

Article by

Harshil Patel

Software Developer

Harshil is a Software Developer and machine learning enthusiast with experience in designing, developing, maintaining software applications, and writing about various technologies.

Read More