Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orthogonal requisite and preferred in TopologyRequirement #553

Open
huww98 opened this issue Oct 1, 2023 · 0 comments
Open

orthogonal requisite and preferred in TopologyRequirement #553

huww98 opened this issue Oct 1, 2023 · 0 comments

Comments

@huww98
Copy link
Contributor

huww98 commented Oct 1, 2023

Is your feature request related to a problem?/Why is this needed

This is inspired from the discussion in kubernetes-csi/external-provisioner#221 . I conclude the issue discussed there: If we want to comply with the CSI spec by telling the SP that the node selected by scheduler must be able to access the volume (i.e. specify --strict-topology), we need to pass only one item in requisite. And preferred should be the subset of requisite, so it is not useful here. Now we cannot pass more context to SP (the AllowedTopologies of storage class, the zones in which we deployed nodes, etc.)

On the other hand, if we don't specify --strict-topology, Kubernetes assumes the first topology in preferred can access the volume, which is not in the CSI spec.

Currently external-provisioner also does not make good use of these two fields. From this table, we can see requisite is always a random ordered version of preferred, which adds no information at all.

The original "one of" semantics of requisite is a bit confusing by itself. One may naturally think an empty "one of" list means always false (e.g. python3 -c 'print(any([]))' prints False). But the empty list is defined as "no requirements" (always true) in CSI.

Describe the solution you'd like in detail

Change the semantic of the existing TopologyRequirement message:

  • The volume should be accessible from ALL of the requisite topologies, instead of the original "one of".
  • The preferred topologies are just hints to SP (e.g. for placing replicas), and not necessarily the subset of requisite.

This should be easy to understand, and greatly simplified the original x vs n conditions. requisite and preferred are orthogonal.

This should be a breaking change, so introduce a new STRICT_VOLUME_TOPOLOGY_REQUISITE controller capability to enable this new semantic.

At the Kubernetes side, to support such new semantic, we should set:

  • requisite
    • If WaitForFirstConsumer, the topology of the scheduler-selected node
    • Else, empty
  • preferred
    • If available, allowed topologies from storage class
    • Else if enabled by a new flag (say --aggregated-topology), aggregated cluster topology, with scheduler-selected node at first.
    • Else, empty

The behavior is also clearer and simpler. Again, requisite and preferred are orthogonal.
For reference, the original requisite generated by Kubernetes is the allowed topologies from storage class or aggregated cluster topology, with special cases:

  • if WaitForFirstConsumer and --strict-topology is specified, the topology of the scheduler-selected node
  • if Immediate binding and --immediate-topology=false is specified and allowed topologies from storage class is not available, empty

I would expect minimal code changes to existing CSI drivers that works with Kubernetes:

  • For those using --strict-topology=true and --immediate-topology=false, they should now use --aggregated-topology=false. The requisite is not changed, and preferred contains more information about allowed topologies, which should do no harm.
  • For those using --strict-topology=false and --immediate-topology=true, They should now look at requisite for hard requirement. But if they continue to assume the first item in preferred as hard requirement, they should also continue to work with Kubernetes.
  • For other cases, I don't think they are useful.

Describe alternatives you've considered

Add a requirement that the SP MUST ensure the volume is accessible from the first item of preferred.
This makes the already complex requirement more complex. This seems breaking for SP, but Kubernetes already expect SP to implement this, or else Kubernetes will fail to schedule the pod. So, this way is breaking for spec, but not for implementation.

And consider what if a distributed workload needs to access the same volume from more than one topology?

Additional context

I'm not sure about the impact on the Mesos implementation.

Here is my draft of the new TopologyRequirement

// Current description applies if the SP has
// STRICT_VOLUME_TOPOLOGY_REQUISITE capability. See version 1.9.0 for
// the description if such capability is not present.
message TopologyRequirement {
  // Specifies the list of topologies the provisioned volume MUST be
  // accessible from.
  // This field is OPTIONAL. If TopologyRequirement is specified either
  // requisite or preferred or both MUST be specified.
  //
  // The SP MUST make the provisioned volume available to
  // all topologies from the list of requisite topologies. If it is
  // unable to do so, the SP MUST fail the CreateVolume call.
  //
  // The volume MAY be accessible from additional topologies. If it
  // is, the SP SHOULD prefer the topologies in preferred list.
  //
  // For example, if a volume should be accessible from a single zone,
  // and requisite =
  //   {"region": "R1", "zone": "Z2"}
  // then the provisioned volume MUST be accessible from the "region"
  // "R1" and the "zone" "Z2". the SP MAY select the second zone
  // independently, e.g. "R1/Z4".
  repeated Topology requisite = 1;

  // Specifies the list of topologies the CO would prefer the volume to
  // be accessible from (in order of preference).
  //
  // This field is OPTIONAL. If TopologyRequirement is specified either
  // requisite or preferred or both MUST be specified.
  //
  // An SP MUST attempt to make the provisioned volume available using
  // the preferred topologies in order from first to last.
  //
  // If requisite is specified, the topologies in preferred list MAY
  // also present in the list of requisite topologies. In such case,
  // the SP MAY use this hint to determine where the primary replica is
  // placed.
  //
  // If the topologies in preferred list are not present in the list of
  // requisite topologies, the SP MAY use them as hints about future
  // access patterns, and MAY place additional replicas in those
  // topologies. The SP MAY use an opaque parameter in
  // CreateVolumeRequest to determine the number of replicas.
  //
  // Example:
  // requisite =
  //   {"zone": "Z2"},
  //   {"zone": "Z3"}
  // preferred =
  //   {"zone": "Z3"}
  //   {"zone": "Z2"}
  //   {"zone": "Z4"}
  // then the SP MUST make the provisioned volume accessible from
  // "zone" "Z3" and "Z2". The SP MAY place the primary replica in
  // "zone" "Z3". The SP MAY place additional replicas in "zone" "Z4".
  repeated Topology preferred = 2;
}

The rationale of introducing --aggregated-topology to replace the original --strict-topology and --immediate-topology:

Both --strict-topology and --immediate-topology are introduced to resolve a similar issue: avoiding the long list of requirements. But one for WaitForFirstConsumer and one for Immediate. Based on this proposal, the preferred list is now irrelevant to the binding timing. So there is no reason to configure this based on binding timing.

P.S. I'm waiting for #552 to be approved to be able to run make on my MacBook, so I can open a PR for this proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant