Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: GPUs cannot be used with Tainted Nodes #633

Closed
1 task done
andrewballantyne opened this issue Oct 6, 2022 · 3 comments
Closed
1 task done

[Bug]: GPUs cannot be used with Tainted Nodes #633

andrewballantyne opened this issue Oct 6, 2022 · 3 comments
Assignees
Labels
feature/accelerator-support All things related to Accelerators feature/notebook-controller KubeFlow NoteBook Controller (KFNBC) Feature field-priority Flag to track improvements that are for stability -- effort to put in front of new functionality kind/bug Something isn't working priority/high Important issue that needs to be resolved asap. Releases should not have too many of these.

Comments

@andrewballantyne
Copy link
Member

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

The difference is that I usually add a Taint to my nodes called "notebooksonly". I did add it to the GPU node, because:
A) that node has no taint by default
B) I had noticed random pods landing on it and hogging it for nothing
However, as soon as I add the taint, a few minutes later, the drop down disappears.
When I remove the taint, the drop-down re-appears.

From the downstream issue https://issues.redhat.com/browse/RHODS-4769

Expected Behavior

GPUs should work for both using tolerations & not using tolerations.

Steps To Reproduce

See the current behaviour / downstream ticket.

This is likely related to the Prometheus query omitting results -- but that is just a guess. Might be impacted by #573 and supporting of scaling (which could change the "show the dropdown" logic -- may not work with unscalable nodes that have tolerations)

Workaround (if any)

No response

OpenShift Infrastructure Version

No response

Openshift Version

No response

What browsers are you seeing the problem on?

No response

Open Data Hub Version

2.3.0

Relevant log output

No response

@andrewballantyne andrewballantyne added kind/bug Something isn't working untriaged Indicates the newly create issue has not been triaged yet feature/notebook-controller KubeFlow NoteBook Controller (KFNBC) Feature priority/high Important issue that needs to be resolved asap. Releases should not have too many of these. and removed untriaged Indicates the newly create issue has not been triaged yet labels Oct 6, 2022
@andrewballantyne andrewballantyne added this to the Next Release milestone Oct 6, 2022
@andrewballantyne andrewballantyne moved this from Backlog to To do in ODH Dashboard Planning Oct 6, 2022
@andrewballantyne andrewballantyne modified the milestones: v2.4.0, Next Release Oct 31, 2022
@andrewballantyne andrewballantyne modified the milestones: v2.5.0, Next Release Nov 18, 2022
@andrewballantyne andrewballantyne removed this from the Next Release milestone Nov 29, 2022
@andrewballantyne andrewballantyne added this to the Next Release milestone Jan 16, 2023
@andrewballantyne andrewballantyne removed this from the Next Release milestone Jan 23, 2023
@andrewballantyne andrewballantyne added this to the Next Release milestone Feb 17, 2023
@andrewballantyne andrewballantyne removed this from the Next Release milestone Mar 9, 2023
@lucferbux lucferbux moved this from To do to Backlog in ODH Dashboard Planning Apr 26, 2023
@andrewballantyne andrewballantyne added the feature/accelerator-support All things related to Accelerators label May 18, 2023
@andrewballantyne andrewballantyne added the field-priority Flag to track improvements that are for stability -- effort to put in front of new functionality label Jun 28, 2023
@andrewballantyne andrewballantyne moved this from Backlog to To do in ODH Dashboard Planning Jun 28, 2023
@guimou
Copy link
Member

guimou commented Jul 10, 2023

As it is, tolerations added manually to a Notebook definition are "erased" when you start the Notebook through the UI. This is incompatible with multiple pools of GPUs that you have to taint to properly schedule specific workloads, or just with standard management of workloads placement through taints.
We should either:

  • Leave the tolerations in the Notebook CR alone, or only reconcile the ones we provide access to trough the UI, like GPU request = nvidia.com/gpu toleration.
  • Or provide a mechanism in the UI to fully managed tolerations on Notebooks.

@andrewballantyne
Copy link
Member Author

So today we have a toleration for nvidia.com/gpu (code link) -- this should be covered if you taint your node with the same. We also have the "NotebooksOnly" (configurable) default toleration for Notebooks if you check out the cluster settings as an admin -- but that's not tied to (Nvidia) GPU, that's additional.

I would imagine this is handled already to some degree 🤔

In the work for accelerators, I imagine this will be more flexible and seamless for what is being tainted/tolerated... effectively this should be done, no?

@lucferbux lucferbux modified the milestones: Current Release, Upcoming Release Aug 9, 2023
@Gkrumbach07 Gkrumbach07 self-assigned this Aug 29, 2023
@dgutride
Copy link
Contributor

This is no longer valid due to Habana work - moving to closed after talking to Gage. Please reopen with questions if there are any misunderstandings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/accelerator-support All things related to Accelerators feature/notebook-controller KubeFlow NoteBook Controller (KFNBC) Feature field-priority Flag to track improvements that are for stability -- effort to put in front of new functionality kind/bug Something isn't working priority/high Important issue that needs to be resolved asap. Releases should not have too many of these.
Projects
Status: Done
Status: Dashboard
Archived in project
Development

No branches or pull requests

5 participants