-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add autoscale check for GPU #573
Add autoscale check for GPU #573
Conversation
Skipping CI for Draft Pull Request. |
…into gpu-autoscale
@andrewballantyne Should I add the autoscaler value to the available gpus and let the dropdown just show a higher number without a UX difference? |
@maroroman see Jeff's comment here, I think the answer to your question is this sentence:
No UX difference, just show the numbers. |
1450cec
to
067ce2f
Compare
frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx
Outdated
Show resolved
Hide resolved
044c57f
to
aadae4d
Compare
FYI: Added one more force push to deal with a log and fix the role yaml. |
frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx
Outdated
Show resolved
Hide resolved
…Field.tsx Use the value from the reduce
frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx
Outdated
Show resolved
Hide resolved
Added some data caching on the backend and removed the re-fetching toggle on the frontend. |
@maroroman we've made some progress here, but please test with a built image:
|
Some comments on existing code: There's another await in a for loop you need to Promise.all: If that requests takes 3 seconds and there's 10 nodes, the user is going to be waiting for 30 seconds for the dialog to get a gpu value. It will grow O(n) as the cluster increases nodes. I think this is affecting all current users. Unless a request needs a previous result you shouldn't wait before executing. I think you also have to take out the This token should be found at start up, added to the fastify.kube.saToken so you don't have to fetch it on every request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
…into gpu-autoscale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/unhold
Forgot to unhold this after Chris approved it. Doing so now.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewballantyne, cfchase The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Add autoscale check * Clean after rebase * Add scalable gpus to dropdown count * Change code to use MachineAutoscalers * Use machineautoscalers, add logic to frontend * check available machine replicas in max scalable calculation * Add role change and fix availability check * Fix machineSet get logic * Update frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx Use the value from the reduce * Update backend/src/routes/api/gpu/gpuUtils.ts * Update backend/src/routes/api/gpu/gpuUtils.ts * Update frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx * add saToken to kube, optimise gpu data getting Co-authored-by: Andrew Ballantyne <8126518+andrewballantyne@users.noreply.github.com>
Resolves #407
Description
This PR adds logic to check the MachineAutoscaler CRs for the maximum amount of scalable GPU nodes.
How Has This Been Tested?
Testing was done on a OSD cluster with a gpu node with autoscaling enabled.
machine.openshift.io/GPU
value which should contain the true gpu amount (correct me if my assumption is wrong)Resulting backend response was this:
The value is used by the frontend to extend the dropdown if the scalable gpus are over the amount of gpus reported by prometheus metrics.
Merge criteria: