Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSINode creation fails because spec.drivers[*].nodeID can exceed the maximum allowed length of 128 characters #581

Closed
vpnachev opened this issue Aug 14, 2020 · 24 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@vpnachev
Copy link

vpnachev commented Aug 14, 2020

For GCP projects and nodes with long names the nodeID can become longer than 128 characters. If this happens, the CSINode object cannot be created because of this validation https://github.com/kubernetes/kubernetes/blob/d822b8b230bce0178400e8949ff3f45423d315d0/pkg/apis/storage/validation/validation.go#L345-L347 and the result is that no volume managed by the CSI driver can be attached and used on this node.

In my honest opinion, the problem is the used nodeID format

nodeIDFmt = "projects/%s/zones/%s/instances/%s"
which contains too much information which might not be needed.

According to the spec

// The identifier of the node as understood by the SP.
// This field is REQUIRED.
// This field MUST contain enough information to uniquely identify
// this specific node vs all other nodes supported by this plugin.
// This field SHALL be used by the CO in subsequent calls, including
// ControllerPublishVolume, to refer to this node.
// The SP is NOT responsible for global uniqueness of node_id across
// multiple SPs.

My understanding of the spec is that the CSI driver should not care for the global uniqueness of the Node, so only the Node name should be enough, isn't it? Or at least the placeholders projects, zones, and instances can be removed from the nodeID which will free up 25 characters.

I have quickly checked the names length limit:

  1. The maximum allowed project ID length is 30 (ref: https://cloud.google.com/resource-manager/docs/creating-managing-projects).
  2. The instance name should be no more than 63 characters (ref: https://cloud.google.com/compute/docs/naming-resources).
  3. The zone name is managed by Google, but currently the longest one is northamerica-northeast1-a with 25 characters.

With two / separators the total achieved length is 120 characters which is less than the allowed 128.

Here is a sample logs from the csi-node-driver-registrar from a k8s cluster on GCP (managed by Gardener).

I0505 13:45:10.412325       1 main.go:110] Version: v1.3.0-0-g6e9fff3e
I0505 13:45:10.412405       1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0505 13:45:10.412427       1 connection.go:151] Connecting to unix:///csi/csi.sock
I0505 13:45:10.412865       1 main.go:127] Calling CSI driver to discover driver name
I0505 13:45:10.412881       1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0505 13:45:10.412886       1 connection.go:181] GRPC request: {}
I0505 13:45:10.414462       1 connection.go:183] GRPC response: {"name":"pd.csi.storage.gke.io","vendor_version":"v0.7.0-gke.0"}
I0505 13:45:10.414820       1 connection.go:184] GRPC error: <nil>
I0505 13:45:10.414827       1 main.go:137] CSI driver name: "pd.csi.storage.gke.io"
I0505 13:45:10.414915       1 node_register.go:51] Starting Registration Server at: /registration/pd.csi.storage.gke.io-reg.sock
I0505 13:45:10.415103       1 node_register.go:60] Registration Server started at: /registration/pd.csi.storage.gke.io-reg.sock
I0505 13:45:10.870866       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0505 13:45:11.870960       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0505 13:45:13.585060       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: error updating CSINode object with CSI driver node info: error updating CSINode: timed out waiting for the condition; caused by: CSINode.storage.k8s.io "shoot--12345678--123-56789ab-cpu-worker-z1-7c4f48599f-q6vbk" is invalid: spec.drivers[0].nodeID: Invalid value: "projects/012-34-56789abcdefghij-klmnopq/zones/us-central1-a/instances/shoot--12345678--123-56789ab-cpu-worker-z1-7c4f48599f-q6vbk": must be 128 characters or less,}
E0505 13:45:13.585115       1 main.go:89] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: error updating CSINode object with CSI driver node info: error projects/012-34-56789abcdefghij-klmnopq/zones/us-central1-a/instances/shoot--12345678--123-56789ab-cpu-worker-z1-7c4f48599f-q6vbk": must be 128 characters or less, restarting registration container.

/kind bug

@vpnachev
Copy link
Author

/sig storage
/priority important-soon

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 14, 2020
@k8s-ci-robot
Copy link
Contributor

@vpnachev: The label(s) priority/ cannot be applied, because the repository doesn't have them

In response to this:

/sig storage
/priority important-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Jiawei0227
Copy link
Contributor

Thanks for submitting this issue. The reason for 128 length limitation is because the CSI spec. We have decided to update the CSI spec to loose the length constraint on certain fields. And then an update to k8s will resolve this question.

@vpnachev
Copy link
Author

While I was investigating I was just about to open the issue on k/k side to change the limit, but then I saw that the CSI spec imply the 128 bytes limit.

For me, changing the spec is also fine, if it is not too much effort.

@simon-anz
Copy link

We ran into this also.

Registration process failed with error: plugin registration failed with err: error updating CSINode object with CSI driver node info: error updating CSINode: timed out waiting for the condition; caused by: CSINode.storage.k8s.io “removed” is invalid: spec.drivers[0].nodeID: Invalid value: “projects/our-project-30-char/zones/australia-southeast1-a/instances/our-instance-52-char”: must be 128 characters or less, restarting registration container

@msau42
Copy link
Contributor

msau42 commented Sep 3, 2020

Fixes in GKE are in the process of rolling out, first in rapid channel.

We will separately work with the CSI community to change the limit.

@Jiawei0227
Copy link
Contributor

This is available in GKE rapid and regular channel already.

@vpnachev
Copy link
Author

vpnachev commented Oct 7, 2020

How can other k8s distributions consume the fix?
Could you share more details about it?

@Jiawei0227
Copy link
Contributor

Jiawei0227 commented Oct 7, 2020

You can manually update pkg/apis/storage/validation/validation.go csiNodeIDMaxLength to a larger size to workaround the issue.

The plan for fixing it in OSS is:

  1. Update CSI spec to say the csi node max length can be longer than 128. Probably we will set to 256
  2. Update OSS Kubernetes to use the new version spec.

@ialidzhikov
Copy link
Contributor

You can manually update pkg/apis/storage/validation/validation.go csiNodeIDMaxLength to a larger size to workaround the issue.

This is not an option for parties which use upstream images and do not maintain fork of k/k.

The plan for fixing it in OSS is:

  1. Update CSI spec to say the csi node max length can be longer than 128. Probably we will set to 256
  2. Update OSS Kubernetes to use the new version spec.

@msau42 , are these activities part of the milestone for v1.20? Is the above mentioned change to k/k enough or the CSI sidecars need to be updated/adapted as well?

@msau42
Copy link
Contributor

msau42 commented Oct 7, 2020

Agree we should definitely discuss this in the next csi meeting @saad-ali. No changes to sidecars are anticipated.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2021
@ialidzhikov
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2021
@vpnachev
Copy link
Author

vpnachev commented Jan 13, 2021

@msau42 @Jiawei0227 is there any progress on this?
Could you refer an issue or pull request about the CSI spec change?

PS. I just saw container-storage-interface/spec#464, excuse me for the troubles.

@Jiawei0227
Copy link
Contributor

Hey @vpnachev , this is the PR. It is still waiting for some reply. I am expecting to get it submitted after next CSI community meeting which is next week.
container-storage-interface/spec#464

@apit
Copy link

apit commented Feb 6, 2021

We're using csi-powerscale on Tanzu. the driver tried to register 199 chars-long NodeID, with workers node FQDN is 139 chars of it. even with the new csiNodeIDMaxLength of 192, we're out of luck 😅

@Jiawei0227
Copy link
Contributor

Hmm, if its FQDN is 139, could you help to check why it is so long? Kubernetes has an object length limit of 253 https://www.google.com/search?q=kubernetes+object+name+length+limit&oq=kubernetes+object+lengt&aqs=chrome.2.69i57j0i22i30l2.5623j1j7&sourceid=chrome&ie=UTF-8

With the CSINode length 192 and some potential prefix/suffix we may or may not add to it. We think 192 is a safe length unless there are some special case that you encounter.

We had some discussion on the change or length limits in the CSI community meeting:
https://docs.google.com/document/d/1-oiNg5V_GtS_JBAEViVBhZ3BYVFlbSz70hreyaD7c5Y/edit#heading=h.pqkeeybx3la7

You are also welcome to join the CSI community meeting to bring up any concern you have. Thanks!

@apit
Copy link

apit commented Feb 7, 2021

This is from the registrar container of isilon pod in our Vmware TKGI (aka Enterprise Pivotal K8s Service):

Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: error updating CSINode object with CSI driver node info: error updating CSINode: timed out waiting for the condition; caused by: CSINode.storage.k8s.io "791b277c-3834-450e-9d89-2368f3d8dbb9" is invalid: spec.drivers[0].nodeID: Invalid value: "791b277c-3834-450e-9d89-2368f3d8dbb9=#=#=2ad4c584-1353-47cb-86c3-69cb730ee82a.worker.pks-24cfafd4-c644-4a18-9941-aba9dc7c743e.service-instance-24cfafd4-c644-4a18-9941-aba9dc7c743e.bosh=#=#=172.28.0.4": must be 128 characters or less, restarting registration container.

The second part of that nodeID is what hostname -A from inside the worker. I can tell it is unique across clusters. Btw I'm very much still learning around Kubernetes and TKGI so forgive my ignorance. Perhaps there's a checkbox or template somewhere to make it shorter in VSphere/BOSH/else.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2021
@Jiawei0227
Copy link
Contributor

/remove-lifecycle stale
We updated the CSI spec to length limit 256. Will update on the k8s side once csi spec 1.5 is cut

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2021
@ialidzhikov
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2021
@msau42
Copy link
Contributor

msau42 commented Aug 9, 2021

Kubernetes 1.22 increases the limit to 192 characters and Kubernetes 1.23 will allow 256 characters:
kubernetes/kubernetes#98753
kubernetes/kubernetes#101256

Closing the issue here as there are no changes required on the driver side.
/close

@k8s-ci-robot
Copy link
Contributor

@msau42: Closing this issue.

In response to this:

Kubernetes 1.22 increases the limit to 192 characters and Kubernetes 1.23 will allow 256 characters:
kubernetes/kubernetes#98753
kubernetes/kubernetes#101256

Closing the issue here as there are no changes required on the driver side.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

9 participants