You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem?/Why is this needed
This is inspired from the discussion in kubernetes-csi/external-provisioner#221 . I conclude the issue discussed there: If we want to comply with the CSI spec by telling the SP that the node selected by scheduler must be able to access the volume (i.e. specify --strict-topology), we need to pass only one item in requisite. And preferred should be the subset of requisite, so it is not useful here. Now we cannot pass more context to SP (the AllowedTopologies of storage class, the zones in which we deployed nodes, etc.)
On the other hand, if we don't specify --strict-topology, Kubernetes assumes the first topology in preferred can access the volume, which is not in the CSI spec.
Currently external-provisioner also does not make good use of these two fields. From this table, we can see requisite is always a random ordered version of preferred, which adds no information at all.
The original "one of" semantics of requisite is a bit confusing by itself. One may naturally think an empty "one of" list means always false (e.g. python3 -c 'print(any([]))' prints False). But the empty list is defined as "no requirements" (always true) in CSI.
Describe the solution you'd like in detail
Change the semantic of the existing TopologyRequirement message:
The volume should be accessible from ALL of the requisite topologies, instead of the original "one of".
The preferred topologies are just hints to SP (e.g. for placing replicas), and not necessarily the subset of requisite.
This should be easy to understand, and greatly simplified the original x vs n conditions. requisite and preferred are orthogonal.
This should be a breaking change, so introduce a new STRICT_VOLUME_TOPOLOGY_REQUISITE controller capability to enable this new semantic.
At the Kubernetes side, to support such new semantic, we should set:
requisite
If WaitForFirstConsumer, the topology of the scheduler-selected node
Else, empty
preferred
If available, allowed topologies from storage class
Else if enabled by a new flag (say --aggregated-topology), aggregated cluster topology, with scheduler-selected node at first.
Else, empty
The behavior is also clearer and simpler. Again, requisite and preferred are orthogonal.
For reference, the original requisite generated by Kubernetes is the allowed topologies from storage class or aggregated cluster topology, with special cases:
if WaitForFirstConsumer and --strict-topology is specified, the topology of the scheduler-selected node
if Immediate binding and --immediate-topology=false is specified and allowed topologies from storage class is not available, empty
I would expect minimal code changes to existing CSI drivers that works with Kubernetes:
For those using --strict-topology=true and --immediate-topology=false, they should now use --aggregated-topology=false. The requisite is not changed, and preferred contains more information about allowed topologies, which should do no harm.
For those using --strict-topology=false and --immediate-topology=true, They should now look at requisite for hard requirement. But if they continue to assume the first item in preferred as hard requirement, they should also continue to work with Kubernetes.
For other cases, I don't think they are useful.
Describe alternatives you've considered
Add a requirement that the SP MUST ensure the volume is accessible from the first item of preferred.
This makes the already complex requirement more complex. This seems breaking for SP, but Kubernetes already expect SP to implement this, or else Kubernetes will fail to schedule the pod. So, this way is breaking for spec, but not for implementation.
And consider what if a distributed workload needs to access the same volume from more than one topology?
Additional context
I'm not sure about the impact on the Mesos implementation.
Here is my draft of the new TopologyRequirement
// Current description applies if the SP has// STRICT_VOLUME_TOPOLOGY_REQUISITE capability. See version 1.9.0 for// the description if such capability is not present.messageTopologyRequirement {
// Specifies the list of topologies the provisioned volume MUST be// accessible from.// This field is OPTIONAL. If TopologyRequirement is specified either// requisite or preferred or both MUST be specified.//// The SP MUST make the provisioned volume available to// all topologies from the list of requisite topologies. If it is// unable to do so, the SP MUST fail the CreateVolume call.//// The volume MAY be accessible from additional topologies. If it// is, the SP SHOULD prefer the topologies in preferred list.//// For example, if a volume should be accessible from a single zone,// and requisite =// {"region": "R1", "zone": "Z2"}// then the provisioned volume MUST be accessible from the "region"// "R1" and the "zone" "Z2". the SP MAY select the second zone// independently, e.g. "R1/Z4".repeatedTopologyrequisite=1;
// Specifies the list of topologies the CO would prefer the volume to// be accessible from (in order of preference).//// This field is OPTIONAL. If TopologyRequirement is specified either// requisite or preferred or both MUST be specified.//// An SP MUST attempt to make the provisioned volume available using// the preferred topologies in order from first to last.//// If requisite is specified, the topologies in preferred list MAY// also present in the list of requisite topologies. In such case,// the SP MAY use this hint to determine where the primary replica is// placed.//// If the topologies in preferred list are not present in the list of// requisite topologies, the SP MAY use them as hints about future// access patterns, and MAY place additional replicas in those// topologies. The SP MAY use an opaque parameter in// CreateVolumeRequest to determine the number of replicas.//// Example:// requisite =// {"zone": "Z2"},// {"zone": "Z3"}// preferred =// {"zone": "Z3"}// {"zone": "Z2"}// {"zone": "Z4"}// then the SP MUST make the provisioned volume accessible from// "zone" "Z3" and "Z2". The SP MAY place the primary replica in// "zone" "Z3". The SP MAY place additional replicas in "zone" "Z4".repeatedTopologypreferred=2;
}
The rationale of introducing --aggregated-topology to replace the original --strict-topology and --immediate-topology:
Both --strict-topology and --immediate-topology are introduced to resolve a similar issue: avoiding the long list of requirements. But one for WaitForFirstConsumer and one for Immediate. Based on this proposal, the preferred list is now irrelevant to the binding timing. So there is no reason to configure this based on binding timing.
P.S. I'm waiting for #552 to be approved to be able to run make on my MacBook, so I can open a PR for this proposal.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?/Why is this needed
This is inspired from the discussion in kubernetes-csi/external-provisioner#221 . I conclude the issue discussed there: If we want to comply with the CSI spec by telling the SP that the node selected by scheduler must be able to access the volume (i.e. specify
--strict-topology
), we need to pass only one item inrequisite
. Andpreferred
should be the subset ofrequisite
, so it is not useful here. Now we cannot pass more context to SP (theAllowedTopologies
of storage class, the zones in which we deployed nodes, etc.)On the other hand, if we don't specify
--strict-topology
, Kubernetes assumes the first topology inpreferred
can access the volume, which is not in the CSI spec.Currently external-provisioner also does not make good use of these two fields. From this table, we can see
requisite
is always a random ordered version ofpreferred
, which adds no information at all.The original "one of" semantics of
requisite
is a bit confusing by itself. One may naturally think an empty "one of" list means always false (e.g.python3 -c 'print(any([]))'
printsFalse
). But the empty list is defined as "no requirements" (always true) in CSI.Describe the solution you'd like in detail
Change the semantic of the existing
TopologyRequirement
message:requisite
topologies, instead of the original "one of".preferred
topologies are just hints to SP (e.g. for placing replicas), and not necessarily the subset ofrequisite
.This should be easy to understand, and greatly simplified the original
x
vsn
conditions.requisite
andpreferred
are orthogonal.This should be a breaking change, so introduce a new STRICT_VOLUME_TOPOLOGY_REQUISITE controller capability to enable this new semantic.
At the Kubernetes side, to support such new semantic, we should set:
requisite
WaitForFirstConsumer
, the topology of the scheduler-selected nodepreferred
--aggregated-topology
), aggregated cluster topology, with scheduler-selected node at first.The behavior is also clearer and simpler. Again,
requisite
andpreferred
are orthogonal.For reference, the original
requisite
generated by Kubernetes is the allowed topologies from storage class or aggregated cluster topology, with special cases:WaitForFirstConsumer
and--strict-topology
is specified, the topology of the scheduler-selected nodeImmediate
binding and--immediate-topology=false
is specified and allowed topologies from storage class is not available, emptyI would expect minimal code changes to existing CSI drivers that works with Kubernetes:
--strict-topology=true
and--immediate-topology=false
, they should now use--aggregated-topology=false
. Therequisite
is not changed, andpreferred
contains more information about allowed topologies, which should do no harm.--strict-topology=false
and--immediate-topology=true
, They should now look atrequisite
for hard requirement. But if they continue to assume the first item inpreferred
as hard requirement, they should also continue to work with Kubernetes.Describe alternatives you've considered
Add a requirement that the SP MUST ensure the volume is accessible from the first item of
preferred
.This makes the already complex requirement more complex. This seems breaking for SP, but Kubernetes already expect SP to implement this, or else Kubernetes will fail to schedule the pod. So, this way is breaking for spec, but not for implementation.
And consider what if a distributed workload needs to access the same volume from more than one topology?
Additional context
I'm not sure about the impact on the Mesos implementation.
Here is my draft of the new TopologyRequirement
The rationale of introducing
--aggregated-topology
to replace the original--strict-topology
and--immediate-topology
:Both
--strict-topology
and--immediate-topology
are introduced to resolve a similar issue: avoiding the long list of requirements. But one forWaitForFirstConsumer
and one forImmediate
. Based on this proposal, thepreferred
list is now irrelevant to the binding timing. So there is no reason to configure this based on binding timing.P.S. I'm waiting for #552 to be approved to be able to run
make
on my MacBook, so I can open a PR for this proposal.The text was updated successfully, but these errors were encountered: