Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sealing: Make lotus-worker report GPU usage to miner during ReplicaUpdate task #10770

Closed
wants to merge 2 commits into from

Conversation

jcrowthe
Copy link
Contributor

Related Issues

Fixes #10767

Proposed Changes

This PR fixes a bug in which a lotus-worker never reports GPU usage status to the lotus-miner when tasked with a ReplicaUpdate task.

lotus-miner sealing workers prior to this change:

Worker 84411102-45c2-4581-b988-a8b8e268d0c9, host worker6
TASK: RU(1/1)
CPU:  [||||||                                                          ] 6/64 core(s) in use
RAM:  [|||||||||||||||||||||||||||||                                   ] 44% 224.5 GiB/503.8 GiB
VMEM: [|||||||||||||||||||||||||||||                                   ] 44% 224.5 GiB/503.8 GiB
GPU:  [                                                                ] 0% 0.00/1 gpu(s) in use
GPU: NVIDIA GeForce RTX 3090, not used

After this change:

Worker 4ffe89c9-a31d-434a-8bf0-208aab47f0d4, host worker6
TASK: RU(1/1)
CPU:  [||||||                                                          ] 6/64 core(s) in use
RAM:  [||                                                              ] 1% 13.83 GiB/503.8 GiB
VMEM: [||                                                              ] 1% 9.83 GiB/503.8 GiB
GPU:  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% 1.00/1 gpu(s) in use
GPU: NVIDIA GeForce RTX 3090, used

This fixes a scheduling bug where lotus-miner might schedule too many GPU-based tasks to a single worker, causing CUDA out-of-memory conditions.

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

@jcrowthe jcrowthe requested a review from a team as a code owner April 26, 2023 23:00
@jcrowthe
Copy link
Contributor Author

jcrowthe commented May 3, 2023

Looks like there were some docs needing generation as well. Closing mine in favor of the identical #10806 with docs.

@jcrowthe jcrowthe closed this May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Miner task scheduling is unaware that ReplicaUpdate uses a GPU resource
1 participant