Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extern/sector-storage: fix GPU usage overwrite bug #4627

Merged
merged 1 commit into from
Oct 29, 2020
Merged

extern/sector-storage: fix GPU usage overwrite bug #4627

merged 1 commit into from
Oct 29, 2020

Conversation

karalabe
Copy link
Contributor

GPU tracking currently is a bit broken on master because each subsequently scheduled task blindly overwrites the flag, irrelevant of what its current value is. E.g.:

  • Schedule C2, mark GPU as used
  • Schedule PC1, mark GPU as unused (blindly overwrite)
  • Schedule C2, mark GPU as used (double usage allocated)

The second C2 should not have been permitted to be allocated onto the same GPU, but since PC1 blindly overwritten the flag, a second C2 becomes schedulable. Repeat until the GPU is overloaded and starts going OOM (PC2 cannot recover from it).

The fix is that tasks should only set the flag if they themselves requested the GPU (and have been granted). If they didn't request the GPU, they should most definitely not set the GPU as unused.


Funky caveat: On my system at least, overlapping GPU allocation results in better resource use because I can run multiple PC2 or PC2+C2 on the same GPU. This fix will actually prevent that, but IMHO it's nonetheless necessary because a better GPU scheduling can't be written as long as tasks can exceed the available video RAM and crash.

@magik6k magik6k linked an issue Oct 28, 2020 that may be closed by this pull request
@magik6k
Copy link
Contributor

magik6k commented Oct 29, 2020

Thanks!

@magik6k magik6k merged commit f0f75e2 into filecoin-project:master Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Worker API return wrong data
2 participants