Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI volume reconstruction does not work for ephemeral volumes #79980

Closed
jsafrane opened this issue Jul 10, 2019 · 29 comments · Fixed by #108997
Closed

CSI volume reconstruction does not work for ephemeral volumes #79980

jsafrane opened this issue Jul 10, 2019 · 29 comments · Fixed by #108997
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@jsafrane
Copy link
Member

jsafrane commented Jul 10, 2019

When a pod is marked as deleted while kubelet is down / being restarted, newly started kubelet does not clean up CSI filesystem volumes of the pod.

Newly started kubelet tries to reconstruct the volume using CSI's ConstructVolumeSpec function. This part looks working, CSI volume plugin loads its json file.

But then VolumeManager checks if the volume is still mounted in /var/lib/kubelet/pods/9440e7e5-d454-4555-84b7-d72e43ec4b3a/volumes/kubernetes.io~csi/pvc-45640a32-4ba3-4a7d-ad4b-087281f1460d/mount directory.

There are two issues:

  1. CSI does not require volumes to be presented as mounts. They can be just directories with files on them. This will be case of the most of in-line volumes.

  2. Even if the CSI driver used mount, kubelet mounts it into /var/lib/kubelet/pods/9440e7e5-d454-4555-84b7-d72e43ec4b3a/volumes/kubernetes.io~csi/pvc-45640a32-4ba3-4a7d-ad4b-087281f1460d/mount. Checking of /var/lib/kubelet/pods/9440e7e5-d454-4555-84b7-d72e43ec4b3a/volumes/kubernetes.io~csi/pvc-45640a32-4ba3-4a7d-ad4b-087281f1460d does not make sense.
    Kubelet checks the right directory given by GetPath()

@jsafrane jsafrane added the kind/bug Categorizes issue or PR as related to a bug. label Jul 10, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 10, 2019
@jsafrane
Copy link
Member Author

/sig storage

related to #79896

cc @jingxu97 @vladimirvivien @msau42

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 10, 2019
@jsafrane
Copy link
Member Author

Maybe a stupid idea... Reconciler sync has found a directory in /var/lib/kubelet/pods/<pod UID>/volumes/<plugin name>/<volume name>. Reconciler has no way how to check that CSI volume (or any other volume) is really mounted / published there. Depending on DSW:

  • The volume is in DSW -> from comment here it seems that the SetUp (=NodePublish) will be called on the volume:

if volumeInDSW {
// Some pod needs the volume. And it exists on disk. Some previous
// kubelet must have created the directory, therefore it must have
// reported the volume as in use. Mark the volume as in use also in
// this new kubelet so reconcile() calls SetUp and re-mounts the
// volume if it's necessary.
volumeNeedReport = append(volumeNeedReport, reconstructedVolume.volumeName)
klog.V(4).Infof("Volume exists in desired state (volume.SpecName %s, pod.UID %s), marking as InUse", volume.volumeSpecName, volume.podName)
continue

  • The volume is not in DSW and it will be added to ASW, so it's unmounted / unpublished in the next sync:

klog.V(2).Infof(
"Reconciler sync states: could not find pod information in desired state, update it in actual state: %+v",
reconstructedVolume)
volumesNeedUpdate[reconstructedVolume.volumeName] = reconstructedVolume

In both cases, we can expect that the volume plugin / CSI driver is idempotent and SetUp() won't do anything if the volume is already set up / TearDown() will not do anything if the volume has been torn down already.

Are our volume plugins really idempotent? IMO they are. We perhaps don't need to check for presence of mount point anywhere, just presence of directory /var/lib/kubelet/pods/<pod UID>/volumes/<plugin name>/<volume name> should be enough to reconstruct the volume and set it up / tear it down.

Adding @gnufied to the loop.

@jsafrane
Copy link
Member Author

Filed #79980, PTAL

@msau42
Copy link
Member

msau42 commented Jul 10, 2019

I'm curious how the reconstruction e2e tests passed

@jsafrane
Copy link
Member Author

I'm curious how the reconstruction e2e tests passed

They start kubelet with a pod already deleted. If the pod still exists, volume reconstruction can see it in DSW and does nothing when reconstructVolume fails:

reconstructedVolume, err := rc.reconstructVolume(volume)
if err != nil {
if volumeInDSW {
// Some pod needs the volume, don't clean it up and hope that
// reconcile() calls SetUp and reconstructs the volume in ASW.
klog.V(4).Infof("Volume exists in desired state (volume.SpecName %s, pod.UID %s), skip cleaning up mounts", volume.volumeSpecName, volume.podName)
continue

[Reconstruction fails because it checks for a mount in a wrong directory.]

In this case, kubelet waits for the volume to appear in VolumesInUse before calling SetUp(). The pod is deleted before that and kubelet then just deletes DSW entry and the volume is never cleaned up.

mkimuram added a commit to mkimuram/kubernetes that referenced this issue Jul 19, 2019
@jsafrane
Copy link
Member Author

I'm curious how the reconstruction e2e tests passed

They pass because their container does not handle SIGTERM and it takes 30 seconds to kill them. Kubelet has enough time to SetUp the volume during reconciliation.

@jsafrane
Copy link
Member Author

It turns out that I tested with broken version of #80743 and reconstruction is broken only for ephemeral volumes that don't mount.

@jingxu97
Copy link
Contributor

jingxu97 commented Jul 30, 2019 via email

@jsafrane
Copy link
Member Author

Yes, sorry again about the noise.

@jsafrane jsafrane changed the title CSI volume reconstruction does not work CSI volume reconstruction does not work for ephemeral volumes Jul 31, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2019
@msau42
Copy link
Member

msau42 commented Nov 27, 2019

/remove-lifecycle stale
cc @pohly

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 27, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 28, 2020
@pohly
Copy link
Contributor

pohly commented Aug 3, 2020

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 3, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 1, 2020
@pohly
Copy link
Contributor

pohly commented Nov 2, 2020

/remove-lifecycle rotten

@pohly
Copy link
Contributor

pohly commented Nov 2, 2020

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 2, 2020
@verult
Copy link
Contributor

verult commented Jul 28, 2021

From #103651: this issue surfaced through e2e tests using the hostPath CSI driver. The cause is IsLikelyNotMountPoint() (which is used in CheckVolumeExistenceOperation() for reconstruction) - it doesn't play well with volumes that bind mount from a directory instead of a device.

@pohly
Copy link
Contributor

pohly commented Sep 9, 2021

To work around this bug here, volume life cycle checking was disabled for all tests, not just the subpath test:
f1e1f3a#r56140970

This check is useful and should be enabled again.

@dbgoytia
Copy link
Contributor

dbgoytia commented Oct 4, 2021

Hey everyone! I arrived at this conversation by investigating on issue #105242 (seems to me like a duplicate of #103651). I'm wondering if there's something I can help with? I'm still a bit new to the codebase, though.

I was able to reproduce the problem by using the following ginkgo focus command in the e2e framework:

./hack/ginkgo-e2e.sh --provider=gce --ginkgo.focus="CSI.Volumes.*csi-hostpath*.*Dynamic.PV.*default.fs.*should.unmount.if.pod.is.force.deleted.while.kubelet.is.down" --storage.testdriver=/home/build/oss/refactor/test-linux.yaml

And do see that if we manually add the --check-volume-lifecycle=false to the statefulset/csi-hostpathplugin tests will pass correctly.

{"msg":"PASSED [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is force deleted while kubelet is down [Disruptive][Slow][LinuxOnly]","total":1,"completed":1,"skipped":6322,"failed":0}

I also see that during the process, something in the cleanup function doesn't correctly delete the pv's after the tests are done, so they remain alive if we execute a kubectl get pv command.

@dbgoytia
Copy link
Contributor

dbgoytia commented Oct 6, 2021

I think I found a bit of more info on the issue @jsafrane, probably could be of some help. I think that the volume is not being correclty marked as unmounted inside operation_generator.go in GenerateUnmountVolumeFunc. I'm not sure why but I'll continue to investigate on it.

journalctl -u kubelet| grep 'UnmountVolume.TearDown succeeded for volume "pvc-cfed9346-6b91-4a2a-9fdd-67f81' -A10
Oct 06 18:45:22 e2e-test-build-minion-group-j4tc kubelet[4635]: I1006 18:45:22.616963    4635 operation_generator.go:867] UnmountVolume.TearDown succeeded for volume "pvc-cfed9346-6b91-4a2a-9fdd-67f81
34ec3ea" (OuterVolumeSpecName: "") pod "e438ee68-674a-4ab9-9943-ae63a3fe5a3f" (UID: "e438ee68-674a-4ab9-9943-ae63a3fe5a3f"). InnerVolumeSpecName "pvc-cfed9346-6b91-4a2a-9fdd-67f8134ec3ea". PluginName 
"kubernetes.io/csi", VolumeGidValue ""
Oct 06 18:45:22 e2e-test-build-minion-group-j4tc kubelet[4635]: I1006 18:45:22.616993    4635 operation_generator.go:880] "DEBUG" PodName=e438ee68-674a-4ab9-9943-ae63a3fe5a3f VolumeName=pvc-cfed9346-6
b91-4a2a-9fdd-67f8134ec3ea
Oct 06 18:45:22 e2e-test-build-minion-group-j4tc kubelet[4635]: E1006 18:45:22.617017    4635 operation_generator.go:883] UnmountVolume.MarkVolumeAsUnmounted failed for volume "" (UniqueName: "pvc-cfe
d9346-6b91-4a2a-9fdd-67f8134ec3ea") pod "e438ee68-674a-4ab9-9943-ae63a3fe5a3f" (UID: "e438ee68-674a-4ab9-9943-ae63a3fe5a3f") : no volume with the name "pvc-cfed9346-6b91-4a2a-9fdd-67f8134ec3ea" exists
 in the list of attached volumes

I think this error is coming from DeletePodFromVolume in actual_state_of_the_world.go

@dbgoytia
Copy link
Contributor

dbgoytia commented Oct 7, 2021

I added a couple of extra debug lines to that lines in the actual_state_of_the_world.go file, and can see that when DeletePodFromVolume tries to see if the volume exists, it doesn't find it in the asw.attachedVolumes struct, like so:

func (asw *actualStateOfWorld) DeletePodFromVolume(
	podName volumetypes.UniquePodName, volumeName v1.UniqueVolumeName) error {
	asw.Lock()
	defer asw.Unlock()

	volumeObj, volumeExists := asw.attachedVolumes[volumeName]
	klog.InfoS("DEBUG:", "volumeName", volumeName)
	klog.InfoS("DEBUG:", "volumeObj", volumeObj, "volumeExists", volumeExists, "asw.attachedVolumes", asw.attachedVolumes)
	klog.Info("DEBUG:", "volumeExists",volumeExists)
	if !volumeExists {
		return fmt.Errorf(
			"no volume with the name %q exists in the list of attached volumes",
			volumeName)
	}

	_, podExists := volumeObj.mountedPods[podName]
	if podExists {
		delete(asw.attachedVolumes[volumeName].mountedPods, podName)
	}

	return nil
}
Oct 07 00:04:30 e2e-test-build-minion-group-klvd kubelet[4740]: I1007 00:04:30.677329    4740 actual_state_of_world.go:662] "DEBUG:" volumeName=pvc-295aa817-8cdd-4dac-818a-6c81b79778f5
Oct 07 00:04:30 e2e-test-build-minion-group-klvd kubelet[4740]: I1007 00:04:30.679563    4740 actual_state_of_world.go:663] "DEBUG:" volumeObj={volumeName: mountedPods:map[] spec:<nil> pluginName: pluginIsAttachable:false deviceMountState: devicePath: deviceMountPath: volumeInUseErrorForExpansion:false} volumeExists=false

Probably we should search for the name of the attached volume differently for ephemeral volumes?

@dobsonj
Copy link
Member

dobsonj commented Feb 2, 2022

/assign

dobsonj added a commit to dobsonj/kubernetes that referenced this issue Mar 25, 2022
@dobsonj
Copy link
Member

dobsonj commented Mar 25, 2022

There are two problems:

  1. The IsLikelyNotMountPoint() call that verult already called out, which is known not to work for bind mounts. The mountpoint never gets found, so volume reconstruction fails with "... is not mounted", the PVC never never gets added to the list of attached volumes, and that's why we end up hitting this error:
E0324 22:56:00.476853 2430927 operation_generator.go:856] UnmountVolume.MarkVolumeMountAsUncertain failed for volume "" (UniqueName: "pvc-1787a6c7-ccd3-44a1-b6ad-474da06c5e2b") pod "bf5d71e6-da6a-448a-bab2-f35b2a938d45" (UID: "bf5d71e6-da6a-448a-bab2-f35b2a938d45") : no volume with the name "pvc-1787a6c7-ccd3-44a1-b6ad-474da06c5e2b" exists in the list of attached volumes
  1. CSI inline volumes hit a different error (in addition to the first problem).
I0323 22:28:34.117352   76056 reconciler.go:388] "Could not construct volume information, cleaning up mounts" podName=55703bf4-9ee4-4fd5-8f4c-1e20f82391e4 volumeSpecName="my-csi-volume" err="failed to GetVolumeName from volumePlugin for volumeSpec \"my-csi-volume\" err=kubernetes.io/csi: plugin.GetVolumeName failed to extract volume source from spec: unexpected api.CSIVolumeSource found in volume.Spec"

The call path looks like this (starting in reconcile.go):

reconstructVolume > GetUniqueVolumeNameFromSpec > GetVolumeName > getPVSourceFromSpec

Inline volumes do not have a PV spec, so it fails here. We should be calling GetUniqueVolumeNameFromSpecWithPod() instead.

@mkimuram
Copy link
Contributor

  1. The IsLikelyNotMountPoint() call that verult already called out, which is known not to work for bind mounts. The mountpoint never gets found, so volume reconstruction fails with "... is not mounted", the PVC never never gets added to the list of attached volumes, and that's why we end up hitting this error:

There is an ongoing work that changes IsNotMountPoint to utilize openat2(2) syscall to detect mount point by using MountedFast. By using openat2(2), bind mount will be properly detected fast, but it requires kernel version is v5.6 or later. So, we would be able to also utilize openat2(2) in IsLikelyNotMountPoint. Then, the issue of bind mount can be resolved at least for kernel v5.6+, and we will be able to focus on how we solve this issue for old kernels.

dobsonj added a commit to dobsonj/kubernetes that referenced this issue Jun 1, 2022
muyangren2 pushed a commit to muyangren2/kubernetes that referenced this issue Jul 14, 2022
dobsonj added a commit to dobsonj/kubernetes that referenced this issue Oct 25, 2022
dobsonj added a commit to dobsonj/kubernetes that referenced this issue Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
10 participants