Resources

Troubleshooting Kubernetes FailedAttachVolume and FailedMount

November 30, 2021

When working with Persistent Volumes in Kubernetes, you might run into the FailedAttachVolume or FailedMount error. In this tutorial, we’ll show you how to troubleshoot these errors and find the root cause and fix them.

Ricardo Castro
Senior Site Reliability Engineer

Simple Kubernetes workloads can sometimes fail and be easily restarted by the kubelet to a clean state without any problem. Nontrivial workloads (for example, when containers need to persist a state or share files with other containers) need a way to recover their previous states whenever they restart.

Persistent Volumes provide an API that allows Kubernetes administrators to manage volumes in a safe and abstracted way, without them needing to understand the nitty-gritty of different storage providers. It also provides a convenient way for Pods to store necessary states to perform their tasks.

When working with Persistent Volumes, two common issues often seen with Kubernetes are FailedAttachVolume and FailedMount. These errors generally mean there was a failure using the desired volume, which, in turn, prevented workloads from functioning as intended.

Since there can be many different reasons why an underlying volume can malfunction, you need to dig deeper to find the root cause. In this article, you will learn how to troubleshoot the incident when you see this error.

Understanding Persistent Volumes

Persistent Volumes are storage resources created dynamically or statically by administrators, just like any other Kubernetes resource. It has its own life cycle, independent of the individual Pod that uses it. A strict dependency between a Pod and a Persistent Volume prevents normal workload operation.

Once a Persistent Volume object is created, an underlying disk is also created, which, in turn, is attached to the scheduled node and, consequently, mounted on the desired path. When the workload needs to move somewhere else in the cluster, the reverse process occurs by unmounting the volume, detaching it from the node, and moving it to its new destination.

When working with dynamically provisioned volumes in cloud environments (e.g., AWS, Azure, or Google Cloud Platform), it’s not uncommon for Persistent Volume life cycles to be broken, preventing the underlying disk from being correctly detached and attached. This will prevent correct workload scheduling, potentially causing downtime or data loss.

Troubleshooting the Error

The Persistent Volume life cycle can be broken for a number of reasons:

  • node failure
  • underlying service API call failure
  • network partition
  • incorrect access mode (e.g., ReadWriteOnce)
  • new node already has too many disks attached
  • new node does not have enough mount points

These issues usually manifest themselves through Pods failing to start and becoming stuck in an endless waiting loop. To help diagnose the issue, you’ll need to <terminal inline>describe<terminal inline> a Pod and try to understand what’s going on:


sh
~ kubectl describe pod prometheus-ksszs
Name:       prometheus-ksszs
Namespace:  monitoring
Status:     Pending
Containers:
prometheus:
 State:  Waiting
 Reason:  ContainerCreating
Conditions:
Type           Status
Initialized    True
Ready          False
PodScheduled   True
Volumes:
prometheus-db:
 Type:  azureDisk
 Name:  prometheus-db
Events:
Type    Reason              Age   From     Message
----    ------              ----  ----     -------
Warning FailedAttachVolume  11m   kubelet  FailedAttachVolume Multi-Attach error for volume "pvc-8f40a2f7-1tr3-22u8-a18a-01244r1333cc" Volume is already exclusively attached to one node and can’t be attached to another
Warning FailedMount         11m   kubelet  Unable to mount volumes for pod "prometheus-ksszs":  timeout expired waiting for volumes to attach/mount for pod "prometheus-ksszs".

Under <terminal inline>[events](https://www.containiq.com/post/kubernetes-events)<terminal inline>, you’ll find a series of messages related to the Pod’s life cycle that can help you diagnose the issue.

The failures can generally be divided into two main categories. On one side, there are detach failures, where Kubernetes is unable to detach a disk from a specific node. On the other side, there are attach and mount failures, where Kubernetes can’t attach and/or mount a disk on the new node.

FailedAttachVolume

FailedAttachVolume occurs when a volume cannot be detached from its current node and attached to a new one. When Kubernetes performs the detach and attach operation, it first checks if the volume is safe to be detached and aborts the operation if the check fails. Also, Kubernetes does not force detach any volume. This error indicates a fundamental failure with the underlying storage infrastructure. The message <terminal inline>Volume is already exclusively attached to one node and can’t be attached to another<terminal inline> also confirms this. There can be other causes—for example, too many disks attached to a node—but it will be shown in the message.

FailedMount

FailedMount means a volume can’t be mounted on a specific path and can be a consequence of the previous error since the mount operation happens after attach. Because the attach operation fails, the mount timeout expires, meaning the mount operation is not possible. Other reasons can be incorrect device path or device mount path.

Recovering from the Failure

Since Kubernetes can’t automatically handle the FailedAttachVolume and FailedMount errors on its own, sometimes you have to take manual steps.

Failure to Detach

When Kubernetes fails to detach a disk, you can use the storage provider’s CLI or API to detach it manually. For example, when using Azure, you can detach a disk from a virtual machine by running this code:


powershell
$VirtualMachine = Get-AzVM `
 -ResourceGroupName "myResourceGroup" `
 -Name "myVM"
Remove-AzVMDataDisk `
 -VM $VirtualMachine `
 -Name "myDisk"
Update-AzVM `
 -ResourceGroupName "myResourceGroup" `
 -VM $VirtualMachine

When using AWS EBS volumes, you can perform the same operation by running this command:


sh
~ aws ec2 detach-volume --volume-id <my-volume-id> --force

Failure to Attach or Mount

There may be situations when Kubernetes can detach the volume but is unable to attach or mount the disk in the scheduled node. In this situation, the easiest way to overcome the issue is to force Kubernetes to schedule the workload to another node. This can be done in a few different ways.

Cordon

Cordon marks a node as unschedulable. This means that the Kubernetes Scheduler will not take a cordoned node as an available node. Let’s say you have a Pod scheduled to <terminal inline>node-2<terminal inline>, but it’s unable to start because the node doesn’t have enough mount points available. The node can be cordoned using kubectl:


sh
~ kubectl cordon node-2

And then the Pod can be rescheduled to another node:


sh
~ kubectl delete pod <my-pod>

Node Selectors, Affinity, and Anti-Affinity

Node selectors, affinity, and anti-affinity tell Kubernetes whether to schedule Pods in specific nodes. Nodes will have certain labels that will be used in <terminal inline>nodeSelector<terminal inline> as well as in <terminal inline>affinity<terminal inline> and <terminal inline>anti-affinity<terminal inline> rules to force Pods to be scheduled accordingly.

The simplest mechanism is to use <terminal inline>nodeSelector<terminal inline> where a node is assigned a label and the Pod is configured with a matching label. For example, if you are sure <terminal inline>node-1<terminal inline> can have another disk attached and has enough mount points available, you can run this command:


sh
kubectl label nodes node-1 schedule=nginx

You can then configure the Pod with the <terminal inline>schedule=nginx<terminal inline> node selector:


yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
  env: test
spec:
containers:
- name: nginx
  image: nginx
  imagePullPolicy: IfNotPresent
nodeSelector:
  schedule=nginx

Final Thoughts

Persistent Volumes provide an abstraction that allows Kubernetes workloads to easily provision persistent storage that can survive restarts and scheduling to different nodes. Sometimes the Persistent Volume life cycle is broken and Kubernetes can’t perform rescheduling on its own. FailedAttachVolume and FailedMount are two common errors in this situation that mean Kubernetes is unable to detach, reattach, and mount a volume.

When this happens, you may need to manually detach a disk or instruct Kubernetes Scheduler to start the Pod in a specific node.

The first step to fixing any issue is to understand it. Unless you are proactively alerted, you’ll have to spend time to find the root cause, using precious time that will be adding to the already ticking downtime, or even worse, data loss.

ContainIQ can give you a hand by monitoring your Kubernetes cluster and alerting on events whenever an error, like FailedAttachVolume or FailedMount happens, making it easy to fix it and perhaps paving the way for developing automated self-healing capabilities.

Looking for an out-of-the-box monitoring solution?

With a simple one-line install, ContainIQ allows you to monitor the health of your cluster with pre-built dashboards and easy-to-set alerts.

Article by

Ricardo Castro

Senior Site Reliability Engineer

Ricardo Castro is a Senior Site Reliability Engineer at FARFETCH, as well as a Certified Kubernetes Administrator (CKA) and Certified Kubernetes Application Developer (CKAD). He has a M.Sc. in Parallel and Distributed Systems from Universidade do Porto. Ricardo works daily to build high-performance, reliable, and scalable systems. He is also the DevOps Porto Meetup co-organizer and DevOpsDays Portugal co-organizer.

Read More