If you’ve been working with Kubernetes for any period of time, you’ve probably come across the <terminal inline>OOMkilled<terminal inline> error. It can be a frustrating error to debug if you don’t understand how it works. In this article, we’ll take a closer look at the <terminal inline>OOMKilled<terminal inline> error, why this error occurs, how to troubleshoot it when it happens, and what steps you can take to help prevent it.
Memory in Kubernetes
Let’s begin by understanding how Kubernetes thinks about memory allocation. When the scheduler is trying to decide how to place pods in the Kubernetes cluster, it looks at the capacity for each node.
You should note that a node with 8 GB of memory won’t necessarily have 8 GB available to run pods. Kubernetes tries to determine how much of the 8 GB the node needs for normal operation and how much is left over to run pods.
You can see a breakdown of allocatable resources by taking a look at the YAML for a node: <terminal inline>kubectl get node my_node -oyaml<terminal inline>. You should see something like this:
Based on this resource, the scheduler decides which pods to run where, and it tries to make sure that none of the nodes in the cluster end up running more pods than they can handle.
When you define a container, you can set two different variables for memory. Whether or not you set these variables and what you set them to can have huge repercussions for your pod.
The first is the requests variable. This tells Kubernetes that this particular container needs, at minimum, this much memory. Kubernetes will guarantee the memory is available when it places the pod. If you don’t set it, Kubernetes will assume you don’t need any resources by default and so it won’t guarantee that your pod will be placed on a node with enough memory.
The next value you can set is the limit, which is the maximum. The container won’t always need this much memory, but if it asks for more, it can’t go above the limit. Limits can be tricky—when Kubernetes places a pod, it only checks the requests variable.
We’ll take a look at how this can contribute to an <terminal inline>OOMKilled<terminal inline> error in a bit.
The <terminal inline>request<terminal inline> and the <terminal inline>limit<terminal inline> are important because they play a big role in how Kubernetes decides which pods to kill when it needs to free up resources:
- Pods that do not have the limit or the request set
- Pods with no set limit
- Pods that are over memory request but under limit
- Pods using less than requested memory
So What Is OOMKilled?
<terminal inline>OOMKilled<terminal inline> is an error that actually has its origins in Linux. Linux systems have a program called OOM (Out of Memory Manager) that tracks memory usage per process. If the system is in danger of running out of available memory, OOM Killer will come in and start killing processes to try to free up memory and prevent a crash. The goal of OOM Killer is to free up as much memory as possible by killing off the least number of processes.
Under the hood, OOM Killer allocates each running process a score. The greater the score, the greater the possibility the process will be killed off. The method it uses to calculate this score is beyond this tutorial, but it’s good to know that Kubernetes takes advantage of the score to help make decisions about which pods to kill.
The kubelet running on the VM monitors memory consumption. If resources on a VM become scarce, the kubelet will start killing pods. Essentially, the idea is to preserve the health of the VM so that all the pods running on it won’t fail. The needs of the many outweigh the needs of the few, and the few get murdered.
There are two main <terminal inline>OOMKilled<terminal inline> errors you’ll see in Kubernetes:
- OOMKilled: Limit Overcommit
- OOMKilled: Container Limit Reached
Let’s take a look at each one.
OOMKilled Because of Limit Overcommit
Remember that limit variable we talked about? Here is where it can get you into trouble.
The <terminal inline>OOMKilled: Limit Overcommit<terminal inline> error can occur when the sum of pod limits is greater than the available memory on the node. So for example, if you have a node with 8 GB of available memory, you might get eight pods that each need a gig of memory. However, if even one of those pods is configured with a limit of, say 1.5 gigs, you run the risk of running out of memory. All it would take is for that one pod to have a spike in traffic or an unknown memory leak, and Kubernetes will be forced to start killing pods.
You might also want to check the host itself and see if there are any processes running outside of Kubernetes that could be eating up memory, leaving less for the pods.
OOMKilled Because of Container Limit Reached
While the <terminal inline>Limit Overcommit<terminal inline> error is related to the total amount of memory on the node, <terminal inline>Container Limit Reached<terminal inline> is usually relegated to a single pod. When Kuberntetes detects a pod using more memory than the set limit, it will kill the pod with error <terminal inline>OOMKilled—Container Limit Reached<terminal inline>.
When this happens, check the application logs to try to understand why the pod was using more memory than the set limit. It could be for a number of reasons, such as a spike in traffic or a long-running Kubernetes job that caused it to use more memory than usual.
If during your investigation you find that the application is running as expected and that it just requires more memory to run, you might consider increasing the values for request and limit.
Using ContainIQ To Debug OOMKilled
Troubleshooting OOMKilled events manually can get quite tricky. However, using ContainIQ, users can debug the issue faster and learn more about the sequence of issues leading up to the OOMKilled error.
ContainIQ, a Kubernetes native monitoring platform, includes Metrics, Events, and Logs dashboards. Together, these three features can be quite effective in determining what led up to pods running out of memory and then subsequently being killed. As a starting point, ContainIQ is quite useful for tracking node and pod memory limits, as well as tracking limits of pods scheduled on a given node and that node's conditions:
By clicking Show Limits on the Metrics dashboard, users can see memory for pods and nodes alongside the set limits. This can help you identify pods without limits and set more appropriate limits so that you don’t encounter OOMKilled events in the future. Users can also filter and sort for given pods and nodes by name. You can also set alerts on spikes in memory or even OOMkilled related events so that you can catch spikes early or react appropriately when pods are killed, because of memory issues.
Also, users can use the Events dashboard to view related events before and after an OOMKilled occurs. Viewing and tracking pod restarts and evictions, for example, could likely be related to a memory issue. And using ContainIQ, users can click from the Events dashboard to the Logs dashboard directly and see the pod level logs at that point in time. For example, a user can click from a pod eviction event to the logs at that given point in time. This is particularly helpful for debugging what issues may have caused the OOMKilled, and other related errors.
You can sign up for ContainIQ here, or book a demo to learn more.
In this article, we took a closer look at the Kubernetes <terminal inline>OOMKilled<terminal inline> error, an error that has its origins in Linux. It helps Kubernetes manage memory when scheduling pods and make decisions about what pods to kill when resources are running low. Don’t forget to consider the two flavors of the <terminal inline>OOMKilled<terminal inline> error, Container Limit Reached and Limit Overcommit. Understanding them both can go a long way toward successful troubleshooting, and ensure that you minimize running into the error in the future.