-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom TopologyKey in Affinity Assistant #3731
Comments
/kind feature |
My experience of this is that the pods that did not get "place" on the node, will be run when the first task-pods has completed, so they all eventually complete.
The point with the AA was to provide "Node Affinity", since "Zone Affinity" doesn't solve the problem. But the Affinity Assistant can be disabled
This is a bit unfortunate. My initial though was that we could provide a possibility to add a custom PodSpec for the Affinity Assistant - but this has not been implemented. Now, the Affinity Assistant has some shortcomings, e.g. the one you describe, and its future is questioned e.g. see Design doc: Task parallelism when using workspace, so my opinion is that we should not add new API-things that is specific for the Affinity Assistant, but rather find a different solution to the problem, e.g. the one described in tektoncd/community#318 |
That would generally be the case, but there are a few situations that result in a deadlock:
#2630 mentions issues with pods spread among different zones, but within an availability zone there shouldn't be any PVC visibility problem. We are currently running an almost exact copy of Tekton's Affinity Assistant with AZ affinity and everything works as intended.
Though I agree that including the full PipelineRun inside a single pod aligns better with Kubernetes scheduling and it's cleaner than an Assistant, it is also true that using an Assistant with an AZ TopologyKey solves all those problems. In addition, it might be harder to schedule very large Pipelines with that proposed solution. |
Yes, #2630 address two problems in one. Both "Task parallelism" and "Regional Clusters". It was not described as two different problems in that issue, I apologize for that. This is more clearly described in #3563 The property that solves "Task parallelism" (this is specific for PVCs with RWO access mode) is the "Node Affinity" property, "Zone Affinity" does not solve that. The "Node Affinity" also makes sequential Pipeline faster, since the PVC is only mounted once - this was a positive non-intended side-effect. The property that solves "Regional clusters" (this is specific for PVC with non-regional storageClass) is the limitation to only use at most one PVC per Task. So effectively I don't think the AA is needed for this as long as that limitation is followed. I think I have stated different before, but yeah it was not clear when #2630 was written, but more clearly described in #3563 You may have a situation where you only have the latter problem, depending on what PVCs you are using.
An AZ TopologyKey does not solve the "Task parallelism"-problem for PVCs with RWO access modes. But, yes, you are right, it might help "Regional Clusters" if you use PVCs with RWM access modes but only zonal storageClass, good point.
This is a problem also with the Affinity Assistant solution, as your issue shows, especially regarding, as you said:
The solution in tektoncd/community#318 solves both those problems, but as you say large Pipelines is not without scheduling problems with that solution either, in general using dedicated nodes for pipeline-workload should help for this problem both when using the single-pod solution and when using the affinity-assistant solution, but yes, there might be a gap when using PVCs with PWM access mode and regional clusters - using only one AZ using nodeSelectors or Toleration is probably the best workaround currently for this setup? |
True, AZ affinity doesn't solve Task parallelism, but it's marginally better than current single-node affinity, it solves the scheduling issue if you don't need parallelism. Using AZ nodeSelectors, correct me if I'm wrong, would entail setting the AZ in advance. With the Assistant we can just let it get scheduled anywhere and all other pods will follow, it has better AZ failure toleration. |
Yes, this is correct. The currently best workaround to that is to set nodeSelectors in the PipelineRun - using a custom "Mutating Webhook" to, perhaps randomly (or intelligently?) set the nodeSelector for the PipelineRun. I have not tested this setup, but it might be a way to address the problem you describe? |
Now, I don't understand what problems you are solving with a custom affinity assistant for AZ-affinity. Isn't this problem solved without any Affinity Assistant but just following the limit to use at most one PVC per TaskRun (Pod)? My experience is that should solve this problem alone. |
We'd need to somehow retrieve available AZs in the current cluster, it's kind of hackier than the Assistant solution, as we don't need to know or care about the available AZs, so I think we'll stick to our custom Assistant for now. We just wondered if it made sense to put together a PR to support that in Tekton itself instead of duplicating an Assistant that already exists, but as you pointed out, it doesn't solve every problem. Thanks! |
We have a PipelineRun which starts with a git-clone task that is later used by other tasks. We need to make sure all task pods get scheduled in the same AZ to keep the workspace PVC visible to all of them. With default Assistant we randomly hit the scheduler deadlock that I mentioned, with AZ affinity that problem gets solved. Would we be able to share a workspace between all tasks with your proposed solution? |
I can see that problem with the Affinity Assistant. But what problems do you get with this kind of Pipeline with the Affinity Assistant disabled? My understanding is that should work equally good as your custom Affinity Assistant. But I am interesting to hear what issue you run into if not. |
Sorry I just edited my previous answer. I was wondering how using one PVC per TaskRun would allow us to share the workspace between tasks. The Affinity Assistant allows us to set a no-op initial pod for the others to follow, otherwise no pod gets scheduled. And we need affinity to make all pods share the same AZ and keep the PVC visible. I'm not sure I understand your proposed solution, would that solve this problem? |
Yes, to let multiple Tasks share data via a PVC Workspace did work before the Affinity Assistant was introduced in #2630 So I am interested to hear if you run into issues with your Pipeline when disabling the Affinity Assistant. What does not work well is to have two parallel task to concurrently access the Workspace, but this does not work with only AZ-affinity as well. There might be different behavior depending on the volumeBindingMode for your storageClass. But I don't think that should affect your use case, it is a behavior that is a bit difficult to grasp. |
If we remove the Assistant, won't the task pods get scheduled anywhere, maybe into different zones? Does the PipelineRun ensure somehow that the workspace remains visible to all of them? |
A Pod using a PVC is usually scheduled to the PVC - this is e.g. how it works when the PVC is a local disk - this is also how Tekton worked before (or works when the affinity assistant is disabled) #2630 But storage is a complicated field, so I can not promise, but it sounds like it should work that way for you too. But it might depend on how your storage system is working. I am interested to hear if you get issues without the Affinity Assistant. The case I know that does not work is when a task use two PVCs, those two PVCs has been used by two different previous tasks that might have run in two different AZs. This was the case in #2546 |
We will try, but my understanding is that the workspace PVC is created at the PipelineRun level. In our case, our storage backend is AWS EBS, so the PVC is visible through the AZ. If the different task pods are scheduled in different AZs, I fail to see how that workspace PVC can be shared between task pods. Maybe I'm missing something. EDIT: Ok I get it now, the PVC is attached to the pod on creation time so it will automatically be scheduled in the same AZ as the PVC. We totally missed that point, sorry about that, it should perfectly work without assistant in single workspace PVC scenarios. |
Closing this up, @jlpettersson thanks a lot for your help! |
If I understand tektoncd/pipeline#3731 correctly we don't need it, since our tasks run sequentially. Since we use two sub-path mounts, we get two assistants, which could be scheduled on different nodes. In this situation our taskrun pod can never start: `Multi-Attach error for volume "pvc-aac89874" Volume is already used by pod(s) affinity-assistant-38e-0`
We get one affinity assistant per "use" of the workspace (not per PVC, not per Task). We are mounting 2 subPaths of the same PVC in the same Task and that's why we get 2 affinity assistants. This is a problem when the provider allows you to use a PVC from 2 different nodes but not in parallel. That's because your 2 affinity assistants could land on 2 different nodes and they would both try to mount the PVC, which is not allowed. So, depending on the storage class attributes, this could fail. Notes: * tektoncd/pipeline#3731 * our tasks run sequentially. * previously resulted in `Multi-Attach error for volume "pvc-aac89874" Volume is already used by pod(s) affinity-assistant-38e-0`
Feature Request
Currently AffinityAssistant hardcodes
kubernetes.io/hostname
as the Affinity TopologyKey:pipeline/pkg/pod/pod.go
Line 335 in 08177fa
This setting results in a "scheduler deadlock" when some of the Task's pods fit in one node but not all of them. To solve this, we'd like to use the availability zone as the topology key instead, but due to the setting being hardcoded we must implement our own Affinity Assistant. Is there a reason not to support a custom TopologyKey?
The text was updated successfully, but these errors were encountered: