Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement (un)stage/(un)publish volume according csi spec for mount mode #2195

Merged
merged 22 commits into from
Oct 16, 2024

Conversation

antonmyagkov
Copy link
Collaborator

@antonmyagkov antonmyagkov commented Oct 1, 2024

Problem:

CSI Driver implementation violates CSI specification in terms of stage/publish/unstage/unpublish volumes.
At this moment StageVolume step is completely ignored and start endpoint/mounting volumes happens at PublishVolume step. As a result CSI Driver doesn't support ReadWriteOnce access mode in the correct way and only one pod on the same node can mount the volume, however it should be allowed to mount the same volume into multiple pods on the same node.

According to CSI Driver specification:

  1. NodeStageVolume should mount the volume to the staging path
  2. NodePublishVolume should mount the volume to the target path
  3. NodeUnpublishVolume should unmount the volume from the target path
  4. NodeUnstageVolume should unmount the volume from the staging path

As we already have current implementation of CSI Driver in production clusters we need to handle migration from existing implementation of mounting volumes(only NodePublishVolume/NodeUnpublishVolume is implemented) to the new implementation.

The tricky part here is using different UnixSocketPath/InstanceId/ClientId for already bounded volumes and "new" volumes.
Current format of UnixSocketPath: socketsDir/podId/volumeId
New format of UnixSocketPath: socketsDir/nodeId/volumeId

Current format of InstanceId: podId
New format of InstanceId: nodeId

Current format of ClientId: clientID-podId
New format of ClientId: clientID-nodeId

Possible scenarios:

  1. Volume was staged and published
  2. CSI Driver was updated
  3. Volume was unpublished and unstaged <- here we should handle unpublish with old unix socket path

  1. Volume was staged and published
  2. CSI Driver was updated
  3. Kubelet restart had happened
  4. CSI Driver received stage and publish for the same volume again <- here we should handle stage/publish with old unix socket path

  1. CSI Driver was updated
  2. Volume was staged and published
  3. endpoint should be start with new unix socket path
  4. Volume was unpublished and unstaged
  5. UnstageVolume should stop endpoint with new unix socket path

  1. CSI Driver was updated
  2. Volume was staged and published on the node1 with RWO access mode
  3. Staging volume on the node2
  4. StageVolume on the node2 should return error

Migration is splitted for differnt modes
VM mode: #1982
Mount mode: #2195
Block mode: #2269

After migration of all volumes to the new endpoints we can remove backward compatibility
with old format of endpoints.

External links/documentation

https://github.com/container-storage-interface/spec/blob/master/spec.md#node-service-rpc

Copy link
Contributor

github-actions bot commented Oct 1, 2024

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 2ad8e33.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
5856 5844 0 8 4 0

@antonmyagkov antonmyagkov force-pushed the users/myagkov/csi-fix-stage-publish-steps branch from 62d8f3b to 2e70895 Compare October 2, 2024 10:41
Copy link
Contributor

github-actions bot commented Oct 2, 2024

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 2e70895.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
5856 5845 0 7 4 0

Copy link
Contributor

github-actions bot commented Oct 2, 2024

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit c3f12f1.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
5856 5848 0 4 4 0

Copy link
Contributor

github-actions bot commented Oct 3, 2024

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit e414b69.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
5856 5848 0 4 4 0

@antonmyagkov antonmyagkov added the blockstore Add this label to run only cloud/blockstore build and tests on PR label Oct 4, 2024
Copy link
Contributor

github-actions bot commented Oct 4, 2024

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 83eece4.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3296 3294 0 2 0 0

Copy link
Contributor

@aikuchin aikuchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Я вижу потенциальную проблему с рестартом кублета при переезде на новую схему: когда он рестартует то восстанавливает картину мира, полагаясь на идемпотентность запросов - просто повторно посылает NodeStage и NodePublish всем вольюмам, которые по его мнению должны быть замонтированы.
Если диск был подключен со старым драйвером Stage попробует стартовать эндпоинт по новому пути и я предполагаю что возникнет конфликт (если NBS в этом месте не останавливает автоматически старый эндпоинт).
В случае ВМ у нас в качесте признака что эндпоинт стартует на Stage является атрибут instanceId, для инфракубера я не вижу как можно отличить старый диск от нового.

@antonmyagkov antonmyagkov force-pushed the users/myagkov/csi-fix-stage-publish-steps branch from 83eece4 to d7dafeb Compare October 7, 2024 10:15
@antonmyagkov
Copy link
Collaborator Author

antonmyagkov commented Oct 7, 2024

Я вижу потенциальную проблему с рестартом кублета при переезде на новую схему: когда он рестартует то восстанавливает картину мира, полагаясь на идемпотентность запросов - просто повторно посылает NodeStage и NodePublish всем вольюмам, которые по его мнению должны быть замонтированы. Если диск был подключен со старым драйвером Stage попробует стартовать эндпоинт по новому пути и я предполагаю что возникнет конфликт (если NBS в этом месте не останавливает автоматически старый эндпоинт). В случае ВМ у нас в качесте признака что эндпоинт стартует на Stage является атрибут instanceId, для инфракубера я не вижу как можно отличить старый диск от нового.

Можно попробовать так:
Делаем StartEndpoint по новому формату эндпоинта. В случае ошибки mount conflict вернуть ОК и уже на publish примонтироваться по старому формату эндпоинта или вернуть ошибку в случае повторного mount conflict.

Copy link
Contributor

github-actions bot commented Oct 7, 2024

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit d7dafeb.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3303 3301 0 2 0 0

@antonmyagkov antonmyagkov force-pushed the users/myagkov/csi-fix-stage-publish-steps branch from d7dafeb to 5496f67 Compare October 8, 2024 17:18
Copy link
Contributor

github-actions bot commented Oct 8, 2024

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 5496f67.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3307 3300 0 3 4 0

Copy link
Contributor

github-actions bot commented Oct 9, 2024

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 9b3dc89.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3308 3301 0 3 4 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit cb4f860.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3307 3306 0 1 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 57f3167.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3315 0 0 0 0

@antonmyagkov antonmyagkov marked this pull request as ready for review October 11, 2024 15:31
@antonmyagkov antonmyagkov changed the title Implement stage/unstage/publish/unpublish volume according csi spec Implement stage/unstage/publish/unpublish volume according csi spec for mount mode Oct 11, 2024
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 1d88142.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3315 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit e924066.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3315 0 0 0 0

@antonmyagkov antonmyagkov changed the title Implement stage/unstage/publish/unpublish volume according csi spec for mount mode Implement (un)stage/(un)publish volume according csi spec for mount mode Oct 13, 2024
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 4f7f5c4.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3315 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 20b6040.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3315 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 93ca36c.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3314 0 1 0 0

@antonmyagkov antonmyagkov force-pushed the users/myagkov/csi-fix-stage-publish-steps branch from 93ca36c to 520bc78 Compare October 15, 2024 10:09
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 520bc78.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3314 0 1 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 520bc78.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3314 0 1 0 0

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 520bc78.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3315 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 9191b39.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3315 3315 0 0 0 0

@antonmyagkov antonmyagkov merged commit 080c2ce into main Oct 16, 2024
11 of 12 checks passed
@antonmyagkov antonmyagkov deleted the users/myagkov/csi-fix-stage-publish-steps branch October 16, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blockstore Add this label to run only cloud/blockstore build and tests on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants