CSI volume reconstruction does not work for ephemeral volumes #79980

jsafrane · 2019-07-10T10:48:09Z

When a pod is marked as deleted while kubelet is down / being restarted, newly started kubelet does not clean up CSI filesystem volumes of the pod.

Newly started kubelet tries to reconstruct the volume using CSI's ConstructVolumeSpec function. This part looks working, CSI volume plugin loads its json file.

But then VolumeManager checks if the volume is still mounted in /var/lib/kubelet/pods/9440e7e5-d454-4555-84b7-d72e43ec4b3a/volumes/kubernetes.io~csi/pvc-45640a32-4ba3-4a7d-ad4b-087281f1460d/mount directory.

There are two issues:

CSI does not require volumes to be presented as mounts. They can be just directories with files on them. This will be case of the most of in-line volumes.
Even if the CSI driver used mount, kubelet mounts it into /var/lib/kubelet/pods/9440e7e5-d454-4555-84b7-d72e43ec4b3a/volumes/kubernetes.io~csi/pvc-45640a32-4ba3-4a7d-ad4b-087281f1460d/mount. Checking of /var/lib/kubelet/pods/9440e7e5-d454-4555-84b7-d72e43ec4b3a/volumes/kubernetes.io~csi/pvc-45640a32-4ba3-4a7d-ad4b-087281f1460d does not make sense.
Kubelet checks the right directory given by GetPath()

The text was updated successfully, but these errors were encountered:

jsafrane · 2019-07-10T10:50:09Z

/sig storage

related to #79896

cc @jingxu97 @vladimirvivien @msau42

jsafrane · 2019-07-10T11:53:13Z

Maybe a stupid idea... Reconciler sync has found a directory in /var/lib/kubelet/pods/<pod UID>/volumes/<plugin name>/<volume name>. Reconciler has no way how to check that CSI volume (or any other volume) is really mounted / published there. Depending on DSW:

The volume is in DSW -> from comment here it seems that the SetUp (=NodePublish) will be called on the volume:

kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go

Lines 398 to 406 in a292437

    
           if volumeInDSW { 
        
           	// Some pod needs the volume. And it exists on disk. Some previous 
        
           	// kubelet must have created the directory, therefore it must have 
        
           	// reported the volume as in use. Mark the volume as in use also in 
        
           	// this new kubelet so reconcile() calls SetUp and re-mounts the 
        
           	// volume if it's necessary. 
        
           	volumeNeedReport = append(volumeNeedReport, reconstructedVolume.volumeName) 
        
           	klog.V(4).Infof("Volume exists in desired state (volume.SpecName %s, pod.UID %s), marking as InUse", volume.volumeSpecName, volume.podName) 
        
           	continue

The volume is not in DSW and it will be added to ASW, so it's unmounted / unpublished in the next sync:

kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go

Lines 412 to 415 in a292437

    
           klog.V(2).Infof( 
        
           	"Reconciler sync states: could not find pod information in desired state, update it in actual state: %+v", 
        
           	reconstructedVolume) 
        
           volumesNeedUpdate[reconstructedVolume.volumeName] = reconstructedVolume

In both cases, we can expect that the volume plugin / CSI driver is idempotent and SetUp() won't do anything if the volume is already set up / TearDown() will not do anything if the volume has been torn down already.

Are our volume plugins really idempotent? IMO they are. We perhaps don't need to check for presence of mount point anywhere, just presence of directory /var/lib/kubelet/pods/<pod UID>/volumes/<plugin name>/<volume name> should be enough to reconstruct the volume and set it up / tear it down.

Adding @gnufied to the loop.

jsafrane · 2019-07-10T12:37:10Z

Filed #79980, PTAL

msau42 · 2019-07-10T15:31:12Z

I'm curious how the reconstruction e2e tests passed

msau42 · 2019-07-10T15:35:15Z

https://k8s-testgrid.appspot.com/sig-storage-kubernetes#gce-serial&include-filter-by-regex=CSI.*force

jsafrane · 2019-07-10T17:46:12Z

I'm curious how the reconstruction e2e tests passed

They start kubelet with a pod already deleted. If the pod still exists, volume reconstruction can see it in DSW and does nothing when reconstructVolume fails:

kubernetes/pkg/kubelet/volumemanager/reconciler/reconciler.go

Lines 385 to 391 in 978c38d

    
           reconstructedVolume, err := rc.reconstructVolume(volume) 
        
           if err != nil { 
        
           	if volumeInDSW { 
        
           		// Some pod needs the volume, don't clean it up and hope that 
        
           		// reconcile() calls SetUp and reconstructs the volume in ASW. 
        
           		klog.V(4).Infof("Volume exists in desired state (volume.SpecName %s, pod.UID %s), skip cleaning up mounts", volume.volumeSpecName, volume.podName) 
        
           		continue

[Reconstruction fails because it checks for a mount in a wrong directory.]

In this case, kubelet waits for the volume to appear in VolumesInUse before calling SetUp(). The pod is deleted before that and kubelet then just deletes DSW entry and the volume is never cleaned up.

…es#79980 is fixed

jsafrane · 2019-07-29T13:30:55Z

I'm curious how the reconstruction e2e tests passed

They pass because their container does not handle SIGTERM and it takes 30 seconds to kill them. Kubelet has enough time to SetUp the volume during reconciliation.

jsafrane · 2019-07-30T10:25:50Z

It turns out that I tested with broken version of #80743 and reconstruction is broken only for ephemeral volumes that don't mount.

jingxu97 · 2019-07-30T18:21:12Z

So the conclusion is that we only need to fix for ephemeral volumes? Thanks!

…

On Tue, Jul 30, 2019 at 3:27 AM Jan Šafránek ***@***.***> wrote: It turns out that I tested with broken version of #80743 <#80743> and reconstruction is broken only for ephemeral volumes that don't mount. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79980?email_source=notifications&email_token=AESI3ROIV2SK6SNZWU4EFFTQCAJSBA5CNFSM4H7OA642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3DQYAQ#issuecomment-516361218>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AESI3RIJYBLS5F3WOVSEBE3QCAJSBANCNFSM4H7OA64Q> .

-- - Jing

jsafrane · 2019-07-31T08:02:27Z

Yes, sorry again about the noise.

fejta-bot · 2019-10-29T08:24:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

msau42 · 2019-11-27T01:09:22Z

/remove-lifecycle stale
cc @pohly

fejta-bot · 2020-02-25T01:44:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-03-26T02:27:19Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

pohly · 2020-08-03T14:45:58Z

/remove-lifecycle rotten

fejta-bot · 2020-11-01T15:28:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

pohly · 2020-11-02T07:31:39Z

/remove-lifecycle rotten

pohly · 2020-11-02T07:31:49Z

/lifecycle frozen

verult · 2021-07-28T01:53:14Z

From #103651: this issue surfaced through e2e tests using the hostPath CSI driver. The cause is IsLikelyNotMountPoint() (which is used in CheckVolumeExistenceOperation() for reconstruction) - it doesn't play well with volumes that bind mount from a directory instead of a device.

pohly · 2021-09-09T06:27:37Z

To work around this bug here, volume life cycle checking was disabled for all tests, not just the subpath test:
f1e1f3a#r56140970

This check is useful and should be enabled again.

dbgoytia · 2021-10-04T18:15:04Z

Hey everyone! I arrived at this conversation by investigating on issue #105242 (seems to me like a duplicate of #103651). I'm wondering if there's something I can help with? I'm still a bit new to the codebase, though.

I was able to reproduce the problem by using the following ginkgo focus command in the e2e framework:

./hack/ginkgo-e2e.sh --provider=gce --ginkgo.focus="CSI.Volumes.*csi-hostpath*.*Dynamic.PV.*default.fs.*should.unmount.if.pod.is.force.deleted.while.kubelet.is.down" --storage.testdriver=/home/build/oss/refactor/test-linux.yaml

And do see that if we manually add the --check-volume-lifecycle=false to the statefulset/csi-hostpathplugin tests will pass correctly.

{"msg":"PASSED [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is force deleted while kubelet is down [Disruptive][Slow][LinuxOnly]","total":1,"completed":1,"skipped":6322,"failed":0}

I also see that during the process, something in the cleanup function doesn't correctly delete the pv's after the tests are done, so they remain alive if we execute a kubectl get pv command.

dbgoytia · 2021-10-06T18:55:56Z

I think I found a bit of more info on the issue @jsafrane, probably could be of some help. I think that the volume is not being correclty marked as unmounted inside operation_generator.go in GenerateUnmountVolumeFunc. I'm not sure why but I'll continue to investigate on it.

journalctl -u kubelet| grep 'UnmountVolume.TearDown succeeded for volume "pvc-cfed9346-6b91-4a2a-9fdd-67f81' -A10
Oct 06 18:45:22 e2e-test-build-minion-group-j4tc kubelet[4635]: I1006 18:45:22.616963    4635 operation_generator.go:867] UnmountVolume.TearDown succeeded for volume "pvc-cfed9346-6b91-4a2a-9fdd-67f81
34ec3ea" (OuterVolumeSpecName: "") pod "e438ee68-674a-4ab9-9943-ae63a3fe5a3f" (UID: "e438ee68-674a-4ab9-9943-ae63a3fe5a3f"). InnerVolumeSpecName "pvc-cfed9346-6b91-4a2a-9fdd-67f8134ec3ea". PluginName 
"kubernetes.io/csi", VolumeGidValue ""
Oct 06 18:45:22 e2e-test-build-minion-group-j4tc kubelet[4635]: I1006 18:45:22.616993    4635 operation_generator.go:880] "DEBUG" PodName=e438ee68-674a-4ab9-9943-ae63a3fe5a3f VolumeName=pvc-cfed9346-6
b91-4a2a-9fdd-67f8134ec3ea
Oct 06 18:45:22 e2e-test-build-minion-group-j4tc kubelet[4635]: E1006 18:45:22.617017    4635 operation_generator.go:883] UnmountVolume.MarkVolumeAsUnmounted failed for volume "" (UniqueName: "pvc-cfe
d9346-6b91-4a2a-9fdd-67f8134ec3ea") pod "e438ee68-674a-4ab9-9943-ae63a3fe5a3f" (UID: "e438ee68-674a-4ab9-9943-ae63a3fe5a3f") : no volume with the name "pvc-cfed9346-6b91-4a2a-9fdd-67f8134ec3ea" exists
 in the list of attached volumes

I think this error is coming from DeletePodFromVolume in actual_state_of_the_world.go

dbgoytia · 2021-10-07T00:17:18Z

I added a couple of extra debug lines to that lines in the actual_state_of_the_world.go file, and can see that when DeletePodFromVolume tries to see if the volume exists, it doesn't find it in the asw.attachedVolumes struct, like so:

func (asw *actualStateOfWorld) DeletePodFromVolume(
	podName volumetypes.UniquePodName, volumeName v1.UniqueVolumeName) error {
	asw.Lock()
	defer asw.Unlock()

	volumeObj, volumeExists := asw.attachedVolumes[volumeName]
	klog.InfoS("DEBUG:", "volumeName", volumeName)
	klog.InfoS("DEBUG:", "volumeObj", volumeObj, "volumeExists", volumeExists, "asw.attachedVolumes", asw.attachedVolumes)
	klog.Info("DEBUG:", "volumeExists",volumeExists)
	if !volumeExists {
		return fmt.Errorf(
			"no volume with the name %q exists in the list of attached volumes",
			volumeName)
	}

	_, podExists := volumeObj.mountedPods[podName]
	if podExists {
		delete(asw.attachedVolumes[volumeName].mountedPods, podName)
	}

	return nil
}

Oct 07 00:04:30 e2e-test-build-minion-group-klvd kubelet[4740]: I1007 00:04:30.677329    4740 actual_state_of_world.go:662] "DEBUG:" volumeName=pvc-295aa817-8cdd-4dac-818a-6c81b79778f5
Oct 07 00:04:30 e2e-test-build-minion-group-klvd kubelet[4740]: I1007 00:04:30.679563    4740 actual_state_of_world.go:663] "DEBUG:" volumeObj={volumeName: mountedPods:map[] spec:<nil> pluginName: pluginIsAttachable:false deviceMountState: devicePath: deviceMountPath: volumeInUseErrorForExpansion:false} volumeExists=false

Probably we should search for the name of the attached volume differently for ephemeral volumes?

dobsonj · 2022-02-02T22:46:39Z

/assign

These tests were previously disabled to work around kubernetes#61446 and kubernetes#79980 kubernetes@f1e1f3a

dobsonj · 2022-03-25T00:25:37Z

There are two problems:

The IsLikelyNotMountPoint() call that verult already called out, which is known not to work for bind mounts. The mountpoint never gets found, so volume reconstruction fails with "... is not mounted", the PVC never never gets added to the list of attached volumes, and that's why we end up hitting this error:

E0324 22:56:00.476853 2430927 operation_generator.go:856] UnmountVolume.MarkVolumeMountAsUncertain failed for volume "" (UniqueName: "pvc-1787a6c7-ccd3-44a1-b6ad-474da06c5e2b") pod "bf5d71e6-da6a-448a-bab2-f35b2a938d45" (UID: "bf5d71e6-da6a-448a-bab2-f35b2a938d45") : no volume with the name "pvc-1787a6c7-ccd3-44a1-b6ad-474da06c5e2b" exists in the list of attached volumes

CSI inline volumes hit a different error (in addition to the first problem).

I0323 22:28:34.117352   76056 reconciler.go:388] "Could not construct volume information, cleaning up mounts" podName=55703bf4-9ee4-4fd5-8f4c-1e20f82391e4 volumeSpecName="my-csi-volume" err="failed to GetVolumeName from volumePlugin for volumeSpec \"my-csi-volume\" err=kubernetes.io/csi: plugin.GetVolumeName failed to extract volume source from spec: unexpected api.CSIVolumeSource found in volume.Spec"

The call path looks like this (starting in reconcile.go):

reconstructVolume > GetUniqueVolumeNameFromSpec > GetVolumeName > getPVSourceFromSpec

Inline volumes do not have a PV spec, so it fails here. We should be calling GetUniqueVolumeNameFromSpecWithPod() instead.

mkimuram · 2022-03-25T14:27:07Z

The IsLikelyNotMountPoint() call that verult already called out, which is known not to work for bind mounts. The mountpoint never gets found, so volume reconstruction fails with "... is not mounted", the PVC never never gets added to the list of attached volumes, and that's why we end up hitting this error:

There is an ongoing work that changes IsNotMountPoint to utilize openat2(2) syscall to detect mount point by using MountedFast. By using openat2(2), bind mount will be properly detected fast, but it requires kernel version is v5.6 or later. So, we would be able to also utilize openat2(2) in IsLikelyNotMountPoint. Then, the issue of bind mount can be resolved at least for kernel v5.6+, and we will be able to focus on how we solve this issue for old kernels.

These tests were previously disabled to work around kubernetes#79980 kubernetes@f1e1f3a

jsafrane added the kind/bug Categorizes issue or PR as related to a bug. label Jul 10, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 10, 2019

jsafrane mentioned this issue Jul 10, 2019

CSI block volume expects that devices are symlinks #79896

Closed

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 10, 2019

jsafrane mentioned this issue Jul 10, 2019

Fix volume reconstruction and add e2e tests #75071

Merged

jsafrane mentioned this issue Jul 10, 2019

Fix CSI volume reconstruction #79985

Closed

mkimuram added a commit to mkimuram/kubernetes that referenced this issue Jul 19, 2019

Disable 'distruptive' volume test without force option until kubernet…

7ba3b5c

…es#79980 is fixed

jsafrane mentioned this issue Jul 30, 2019

WIP: Fix reconstruction #80743

Closed

jsafrane changed the title ~~CSI volume reconstruction does not work~~ CSI volume reconstruction does not work for ephemeral volumes Jul 31, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2019

jsafrane mentioned this issue Nov 7, 2019

Fix uncertain mounts #82492

Merged

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 27, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2020

k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 28, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 3, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 1, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 2, 2020

msau42 mentioned this issue Jul 26, 2021

[Failing Test] [sig-storage] subPath should unmount if pod is deleted while kubelet is down (ci-kubernetes-e2e-gci-gce-serial) #103651

Closed

k8s-ci-robot assigned dobsonj Feb 2, 2022

dobsonj added a commit to dobsonj/kubernetes that referenced this issue Mar 25, 2022

e2e: restore volume lifecycle checks for hostpath driver

caec4ad

These tests were previously disabled to work around kubernetes#61446 and kubernetes#79980 kubernetes@f1e1f3a

dobsonj mentioned this issue Mar 25, 2022

Fix volume reconstruction for CSI ephemeral volumes #108997

Merged

dobsonj added a commit to dobsonj/kubernetes that referenced this issue Jun 1, 2022

e2e: restore volume lifecycle checks for csi-hostpath driver

c8d3cc5

These tests were previously disabled to work around kubernetes#79980 kubernetes@f1e1f3a

k8s-ci-robot closed this as completed in #108997 Jun 4, 2022

muyangren2 pushed a commit to muyangren2/kubernetes that referenced this issue Jul 14, 2022

e2e: restore volume lifecycle checks for csi-hostpath driver

1dba291

These tests were previously disabled to work around kubernetes#79980 kubernetes@f1e1f3a

dobsonj added a commit to dobsonj/kubernetes that referenced this issue Oct 25, 2022

e2e: restore volume lifecycle checks for csi-hostpath driver

c0f3525

These tests were previously disabled to work around kubernetes#79980 kubernetes@f1e1f3a

dobsonj added a commit to dobsonj/kubernetes that referenced this issue Oct 25, 2022

e2e: restore volume lifecycle checks for csi-hostpath driver

589223b

These tests were previously disabled to work around kubernetes#79980 kubernetes@f1e1f3a

This was referenced Oct 25, 2022

Automated cherry pick of #108997: kubelet: fix volume reconstruction for CSI ephemeral #113346

Merged

Automated cherry pick of #108997: kubelet: fix volume reconstruction for CSI ephemeral #113347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI volume reconstruction does not work for ephemeral volumes #79980

CSI volume reconstruction does not work for ephemeral volumes #79980

jsafrane commented Jul 10, 2019 •

edited

Loading

jsafrane commented Jul 10, 2019

jsafrane commented Jul 10, 2019

jsafrane commented Jul 10, 2019

msau42 commented Jul 10, 2019

msau42 commented Jul 10, 2019

jsafrane commented Jul 10, 2019

jsafrane commented Jul 29, 2019

jsafrane commented Jul 30, 2019

jingxu97 commented Jul 30, 2019 via email

jsafrane commented Jul 31, 2019

fejta-bot commented Oct 29, 2019

msau42 commented Nov 27, 2019

fejta-bot commented Feb 25, 2020

fejta-bot commented Mar 26, 2020

pohly commented Aug 3, 2020

fejta-bot commented Nov 1, 2020

pohly commented Nov 2, 2020

pohly commented Nov 2, 2020

verult commented Jul 28, 2021

pohly commented Sep 9, 2021

dbgoytia commented Oct 4, 2021 •

edited

Loading

dbgoytia commented Oct 6, 2021 •

edited

Loading

dbgoytia commented Oct 7, 2021

dobsonj commented Feb 2, 2022

dobsonj commented Mar 25, 2022

mkimuram commented Mar 25, 2022

CSI volume reconstruction does not work for ephemeral volumes #79980

CSI volume reconstruction does not work for ephemeral volumes #79980

Comments

jsafrane commented Jul 10, 2019 • edited Loading

jsafrane commented Jul 10, 2019

jsafrane commented Jul 10, 2019

jsafrane commented Jul 10, 2019

msau42 commented Jul 10, 2019

msau42 commented Jul 10, 2019

jsafrane commented Jul 10, 2019

jsafrane commented Jul 29, 2019

jsafrane commented Jul 30, 2019

jingxu97 commented Jul 30, 2019 via email

jsafrane commented Jul 31, 2019

fejta-bot commented Oct 29, 2019

msau42 commented Nov 27, 2019

fejta-bot commented Feb 25, 2020

fejta-bot commented Mar 26, 2020

pohly commented Aug 3, 2020

fejta-bot commented Nov 1, 2020

pohly commented Nov 2, 2020

pohly commented Nov 2, 2020

verult commented Jul 28, 2021

pohly commented Sep 9, 2021

dbgoytia commented Oct 4, 2021 • edited Loading

dbgoytia commented Oct 6, 2021 • edited Loading

dbgoytia commented Oct 7, 2021

dobsonj commented Feb 2, 2022

dobsonj commented Mar 25, 2022

mkimuram commented Mar 25, 2022

jsafrane commented Jul 10, 2019 •

edited

Loading

dbgoytia commented Oct 4, 2021 •

edited

Loading

dbgoytia commented Oct 6, 2021 •

edited

Loading