Support custom TopologyKey in Affinity Assistant #3731

aleixripoll · 2021-01-29T07:55:18Z

Feature Request

Currently AffinityAssistant hardcodes kubernetes.io/hostname as the Affinity TopologyKey:

Line 335 in 08177fa

TopologyKey: "kubernetes.io/hostname",

This setting results in a "scheduler deadlock" when some of the Task's pods fit in one node but not all of them. To solve this, we'd like to use the availability zone as the topology key instead, but due to the setting being hardcoded we must implement our own Affinity Assistant. Is there a reason not to support a custom TopologyKey?

The text was updated successfully, but these errors were encountered:

vdemeester · 2021-01-29T09:05:32Z

/kind feature
cc @jlpettersson

jlpettersson · 2021-01-29T16:50:27Z

This setting results in a "scheduler deadlock" when some of the Task's pods fit in one node but not all of them.

My experience of this is that the pods that did not get "place" on the node, will be run when the first task-pods has completed, so they all eventually complete.

we'd like to use the availability zone as the topology key instead

The point with the AA was to provide "Node Affinity", since "Zone Affinity" doesn't solve the problem.
As described in #2630

But the Affinity Assistant can be disabled

the setting being hardcoded

This is a bit unfortunate. My initial though was that we could provide a possibility to add a custom PodSpec for the Affinity Assistant - but this has not been implemented.

Now, the Affinity Assistant has some shortcomings, e.g. the one you describe, and its future is questioned e.g. see Design doc: Task parallelism when using workspace, so my opinion is that we should not add new API-things that is specific for the Affinity Assistant, but rather find a different solution to the problem, e.g. the one described in tektoncd/community#318

aleixripoll · 2021-01-30T06:31:50Z

My experience of this is that the pods that did not get "place" on the node, will be run when the first task-pods has completed, so they all eventually complete.

That would generally be the case, but there are a few situations that result in a deadlock:

The node has reached its pod count limit and the Assistant takes that last pod spot.
A task requests more resources than others.

The point with the AA was to provide "Node Affinity", since "Zone Affinity" doesn't solve the problem.
As described in #2630

#2630 mentions issues with pods spread among different zones, but within an availability zone there shouldn't be any PVC visibility problem. We are currently running an almost exact copy of Tekton's Affinity Assistant with AZ affinity and everything works as intended.

Now, the Affinity Assistant has some shortcomings, e.g. the one you describe, and its future is questioned e.g. see Design doc: Task parallelism when using workspace, so my opinion is that we should not add new API-things that is specific for the Affinity Assistant, but rather find a different solution to the problem, e.g. the one described in tektoncd/community#318

Though I agree that including the full PipelineRun inside a single pod aligns better with Kubernetes scheduling and it's cleaner than an Assistant, it is also true that using an Assistant with an AZ TopologyKey solves all those problems. In addition, it might be harder to schedule very large Pipelines with that proposed solution.

jlpettersson · 2021-01-30T07:28:45Z

#2630 mentions issues with pods spread among different zones

Yes, #2630 address two problems in one. Both "Task parallelism" and "Regional Clusters". It was not described as two different problems in that issue, I apologize for that. This is more clearly described in #3563

The property that solves "Task parallelism" (this is specific for PVCs with RWO access mode) is the "Node Affinity" property, "Zone Affinity" does not solve that. The "Node Affinity" also makes sequential Pipeline faster, since the PVC is only mounted once - this was a positive non-intended side-effect.

The property that solves "Regional clusters" (this is specific for PVC with non-regional storageClass) is the limitation to only use at most one PVC per Task. So effectively I don't think the AA is needed for this as long as that limitation is followed. I think I have stated different before, but yeah it was not clear when #2630 was written, but more clearly described in #3563

You may have a situation where you only have the latter problem, depending on what PVCs you are using.

that using an Assistant with an AZ TopologyKey solves all those problems

An AZ TopologyKey does not solve the "Task parallelism"-problem for PVCs with RWO access modes. But, yes, you are right, it might help "Regional Clusters" if you use PVCs with RWM access modes but only zonal storageClass, good point.

In addition, it might be harder to schedule very large Pipelines with that proposed solution.

This is a problem also with the Affinity Assistant solution, as your issue shows, especially regarding, as you said:

The node has reached its pod count limit and the Assistant takes that last pod spot. (this was new for me)
A task requests more resources than others.

The solution in tektoncd/community#318 solves both those problems, but as you say large Pipelines is not without scheduling problems with that solution either, in general using dedicated nodes for pipeline-workload should help for this problem both when using the single-pod solution and when using the affinity-assistant solution, but yes, there might be a gap when using PVCs with PWM access mode and regional clusters - using only one AZ using nodeSelectors or Toleration is probably the best workaround currently for this setup?

aleixripoll · 2021-01-30T07:35:36Z

True, AZ affinity doesn't solve Task parallelism, but it's marginally better than current single-node affinity, it solves the scheduling issue if you don't need parallelism.

Using AZ nodeSelectors, correct me if I'm wrong, would entail setting the AZ in advance. With the Assistant we can just let it get scheduled anywhere and all other pods will follow, it has better AZ failure toleration.

jlpettersson · 2021-01-30T07:51:32Z

Using AZ nodeSelectors, correct me if I'm wrong, would entail setting the AZ in advance. With the Assistant we can just let it get scheduled anywhere and all other pods will follow, it has better AZ failure toleration.

Yes, this is correct.

The currently best workaround to that is to set nodeSelectors in the PipelineRun - using a custom "Mutating Webhook" to, perhaps randomly (or intelligently?) set the nodeSelector for the PipelineRun. I have not tested this setup, but it might be a way to address the problem you describe?

jlpettersson · 2021-01-30T07:57:55Z

True, AZ affinity doesn't solve Task parallelism

Now, I don't understand what problems you are solving with a custom affinity assistant for AZ-affinity. Isn't this problem solved without any Affinity Assistant but just following the limit to use at most one PVC per TaskRun (Pod)? My experience is that should solve this problem alone.

aleixripoll · 2021-01-30T08:09:02Z

We'd need to somehow retrieve available AZs in the current cluster, it's kind of hackier than the Assistant solution, as we don't need to know or care about the available AZs, so I think we'll stick to our custom Assistant for now. We just wondered if it made sense to put together a PR to support that in Tekton itself instead of duplicating an Assistant that already exists, but as you pointed out, it doesn't solve every problem.

Thanks!

aleixripoll · 2021-01-30T08:12:26Z

Now, I don't understand what problems you are solving with a custom affinity assistant for AZ-affinity. Isn't this problem solved without any Affinity Assistant but just following the limit to use at most one PVC per TaskRun (Pod)? My experience is that should solve this problem alone.

We have a PipelineRun which starts with a git-clone task that is later used by other tasks. We need to make sure all task pods get scheduled in the same AZ to keep the workspace PVC visible to all of them. With default Assistant we randomly hit the scheduler deadlock that I mentioned, with AZ affinity that problem gets solved.

Would we be able to share a workspace between all tasks with your proposed solution?

jlpettersson · 2021-01-30T08:25:18Z

With default Assistant we randomly hit the scheduler deadlock that I mentioned, with AZ affinity that problem gets solved.

I can see that problem with the Affinity Assistant. But what problems do you get with this kind of Pipeline with the Affinity Assistant disabled?

My understanding is that should work equally good as your custom Affinity Assistant. But I am interesting to hear what issue you run into if not.

aleixripoll · 2021-01-30T08:27:09Z

Sorry I just edited my previous answer. I was wondering how using one PVC per TaskRun would allow us to share the workspace between tasks. The Affinity Assistant allows us to set a no-op initial pod for the others to follow, otherwise no pod gets scheduled. And we need affinity to make all pods share the same AZ and keep the PVC visible. I'm not sure I understand your proposed solution, would that solve this problem?

jlpettersson · 2021-01-30T08:44:59Z

Would we be able to share a workspace between all tasks with your proposed solution?

Yes, to let multiple Tasks share data via a PVC Workspace did work before the Affinity Assistant was introduced in #2630
What did not work, is to use two or more PVC Workspaces in a single Task in a regional cluster, since the two PVCs might be located in different AZs.

So I am interested to hear if you run into issues with your Pipeline when disabling the Affinity Assistant. What does not work well is to have two parallel task to concurrently access the Workspace, but this does not work with only AZ-affinity as well.

There might be different behavior depending on the volumeBindingMode for your storageClass. But I don't think that should affect your use case, it is a behavior that is a bit difficult to grasp.

aleixripoll · 2021-01-30T09:10:21Z

If we remove the Assistant, won't the task pods get scheduled anywhere, maybe into different zones? Does the PipelineRun ensure somehow that the workspace remains visible to all of them?

jlpettersson · 2021-01-30T09:30:35Z

A Pod using a PVC is usually scheduled to the PVC - this is e.g. how it works when the PVC is a local disk - this is also how Tekton worked before (or works when the affinity assistant is disabled) #2630

But storage is a complicated field, so I can not promise, but it sounds like it should work that way for you too. But it might depend on how your storage system is working. I am interested to hear if you get issues without the Affinity Assistant.

The case I know that does not work is when a task use two PVCs, those two PVCs has been used by two different previous tasks that might have run in two different AZs. This was the case in #2546

aleixripoll · 2021-01-30T12:14:30Z

We will try, but my understanding is that the workspace PVC is created at the PipelineRun level. In our case, our storage backend is AWS EBS, so the PVC is visible through the AZ. If the different task pods are scheduled in different AZs, I fail to see how that workspace PVC can be shared between task pods. Maybe I'm missing something.

EDIT: Ok I get it now, the PVC is attached to the pod on creation time so it will automatically be scheduled in the same AZ as the PVC. We totally missed that point, sorry about that, it should perfectly work without assistant in single workspace PVC scenarios.

aleixripoll · 2021-01-30T12:54:30Z

Closing this up, @jlpettersson thanks a lot for your help!

If I understand tektoncd/pipeline#3731 correctly we don't need it, since our tasks run sequentially. Since we use two sub-path mounts, we get two assistants, which could be scheduled on different nodes. In this situation our taskrun pod can never start: `Multi-Attach error for volume "pvc-aac89874" Volume is already used by pod(s) affinity-assistant-38e-0`

We get one affinity assistant per "use" of the workspace (not per PVC, not per Task). We are mounting 2 subPaths of the same PVC in the same Task and that's why we get 2 affinity assistants. This is a problem when the provider allows you to use a PVC from 2 different nodes but not in parallel. That's because your 2 affinity assistants could land on 2 different nodes and they would both try to mount the PVC, which is not allowed. So, depending on the storage class attributes, this could fail. Notes: * tektoncd/pipeline#3731 * our tasks run sequentially. * previously resulted in `Multi-Attach error for volume "pvc-aac89874" Volume is already used by pod(s) affinity-assistant-38e-0`

tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 29, 2021

aleixripoll closed this as completed Jan 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support custom TopologyKey in Affinity Assistant #3731

Support custom TopologyKey in Affinity Assistant #3731

aleixripoll commented Jan 29, 2021 •

edited

Loading

vdemeester commented Jan 29, 2021

jlpettersson commented Jan 29, 2021

aleixripoll commented Jan 30, 2021 •

edited

Loading

jlpettersson commented Jan 30, 2021

aleixripoll commented Jan 30, 2021

jlpettersson commented Jan 30, 2021

jlpettersson commented Jan 30, 2021

aleixripoll commented Jan 30, 2021

aleixripoll commented Jan 30, 2021 •

edited

Loading

jlpettersson commented Jan 30, 2021

aleixripoll commented Jan 30, 2021 •

edited

Loading

jlpettersson commented Jan 30, 2021

aleixripoll commented Jan 30, 2021

jlpettersson commented Jan 30, 2021 •

edited

Loading

aleixripoll commented Jan 30, 2021 •

edited

Loading

aleixripoll commented Jan 30, 2021

Support custom TopologyKey in Affinity Assistant #3731

Support custom TopologyKey in Affinity Assistant #3731

Comments

aleixripoll commented Jan 29, 2021 • edited Loading

Feature Request

vdemeester commented Jan 29, 2021

jlpettersson commented Jan 29, 2021

aleixripoll commented Jan 30, 2021 • edited Loading

jlpettersson commented Jan 30, 2021

aleixripoll commented Jan 30, 2021

jlpettersson commented Jan 30, 2021

jlpettersson commented Jan 30, 2021

aleixripoll commented Jan 30, 2021

aleixripoll commented Jan 30, 2021 • edited Loading

jlpettersson commented Jan 30, 2021

aleixripoll commented Jan 30, 2021 • edited Loading

jlpettersson commented Jan 30, 2021

aleixripoll commented Jan 30, 2021

jlpettersson commented Jan 30, 2021 • edited Loading

aleixripoll commented Jan 30, 2021 • edited Loading

aleixripoll commented Jan 30, 2021

aleixripoll commented Jan 29, 2021 •

edited

Loading

aleixripoll commented Jan 30, 2021 •

edited

Loading

aleixripoll commented Jan 30, 2021 •

edited

Loading

aleixripoll commented Jan 30, 2021 •

edited

Loading

jlpettersson commented Jan 30, 2021 •

edited

Loading

aleixripoll commented Jan 30, 2021 •

edited

Loading