[Bug]: GPUs cannot be used with Tainted Nodes #633

andrewballantyne · 2022-10-06T17:10:07Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

The difference is that I usually add a Taint to my nodes called "notebooksonly". I did add it to the GPU node, because:
A) that node has no taint by default
B) I had noticed random pods landing on it and hogging it for nothing
However, as soon as I add the taint, a few minutes later, the drop down disappears.
When I remove the taint, the drop-down re-appears.

From the downstream issue https://issues.redhat.com/browse/RHODS-4769

Expected Behavior

GPUs should work for both using tolerations & not using tolerations.

Steps To Reproduce

See the current behaviour / downstream ticket.

This is likely related to the Prometheus query omitting results -- but that is just a guess. Might be impacted by #573 and supporting of scaling (which could change the "show the dropdown" logic -- may not work with unscalable nodes that have tolerations)

Workaround (if any)

No response

OpenShift Infrastructure Version

No response

Openshift Version

No response

What browsers are you seeing the problem on?

No response

Open Data Hub Version

2.3.0

Relevant log output

No response

guimou · 2023-07-10T17:56:00Z

As it is, tolerations added manually to a Notebook definition are "erased" when you start the Notebook through the UI. This is incompatible with multiple pools of GPUs that you have to taint to properly schedule specific workloads, or just with standard management of workloads placement through taints.
We should either:

Leave the tolerations in the Notebook CR alone, or only reconcile the ones we provide access to trough the UI, like GPU request = nvidia.com/gpu toleration.
Or provide a mechanism in the UI to fully managed tolerations on Notebooks.

andrewballantyne · 2023-07-10T20:26:13Z

So today we have a toleration for nvidia.com/gpu (code link) -- this should be covered if you taint your node with the same. We also have the "NotebooksOnly" (configurable) default toleration for Notebooks if you check out the cluster settings as an admin -- but that's not tied to (Nvidia) GPU, that's additional.

I would imagine this is handled already to some degree 🤔

In the work for accelerators, I imagine this will be more flexible and seamless for what is being tainted/tolerated... effectively this should be done, no?

See [UI] Habana Support Part 1 #1450
- More specifically the CR we will create to allow customization for accelerators (Create a custom resource to track accelerator devices #1371)

dgutride · 2023-11-13T18:54:45Z

This is no longer valid due to Habana work - moving to closed after talking to Gage. Please reopen with questions if there are any misunderstandings.

andrewballantyne added this to ODH Dashboard Planning Oct 6, 2022

andrewballantyne moved this to Backlog in ODH Dashboard Planning Oct 6, 2022

andrewballantyne added this to the Next Release milestone Oct 6, 2022

andrewballantyne moved this from Backlog to To do in ODH Dashboard Planning Oct 6, 2022

andrewballantyne modified the milestones: v2.4.0, Next Release Oct 31, 2022

andrewballantyne modified the milestones: v2.5.0, Next Release Nov 18, 2022

andrewballantyne removed this from the Next Release milestone Nov 29, 2022

andrewballantyne added this to the Next Release milestone Jan 16, 2023

andrewballantyne removed this from the Next Release milestone Jan 23, 2023

andrewballantyne added this to the Next Release milestone Feb 17, 2023

andrewballantyne removed this from the Next Release milestone Mar 9, 2023

lucferbux moved this from To do to Backlog in ODH Dashboard Planning Apr 26, 2023

andrewballantyne added the feature/accelerator-support All things related to Accelerators label May 18, 2023

andrewballantyne added the field-priority Flag to track improvements that are for stability -- effort to put in front of new functionality label Jun 28, 2023

andrewballantyne moved this from Backlog to To do in ODH Dashboard Planning Jun 28, 2023

jkoehler-redhat added this to ODH Feature Tracking Jul 19, 2023

jkoehler-redhat moved this to Dashboard in ODH Feature Tracking Jul 19, 2023

lucferbux modified the milestones: Current Release, Upcoming Release Aug 9, 2023

Gkrumbach07 self-assigned this Aug 29, 2023

Gkrumbach07 moved this from To do to In progress in ODH Dashboard Planning Aug 29, 2023

Gkrumbach07 removed their assignment Aug 29, 2023

Gkrumbach07 moved this from In progress to To do in ODH Dashboard Planning Aug 29, 2023

andrewballantyne removed this from the Current Release milestone Sep 15, 2023

andrewballantyne added this to Internal tracking Oct 5, 2023

adrien-legros mentioned this issue Oct 26, 2023

[Feature Request]: Inject default configurations in Notebook CR on workbench creation #2015

Closed

dgutride assigned Gkrumbach07 Nov 13, 2023

dgutride closed this as completed Nov 13, 2023

github-project-automation bot moved this from Dev To do to Done in ODH Dashboard Planning Nov 13, 2023

github-project-automation bot moved this to Done in Internal tracking Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: GPUs cannot be used with Tainted Nodes #633

[Bug]: GPUs cannot be used with Tainted Nodes #633

andrewballantyne commented Oct 6, 2022

guimou commented Jul 10, 2023

andrewballantyne commented Jul 10, 2023

dgutride commented Nov 13, 2023

[Bug]: GPUs cannot be used with Tainted Nodes #633

[Bug]: GPUs cannot be used with Tainted Nodes #633

Comments

andrewballantyne commented Oct 6, 2022

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Workaround (if any)

OpenShift Infrastructure Version

Openshift Version

What browsers are you seeing the problem on?

Open Data Hub Version

Relevant log output

guimou commented Jul 10, 2023

andrewballantyne commented Jul 10, 2023

dgutride commented Nov 13, 2023