-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #87 - read GPU information directly from drivers #227
Conversation
This is WIP, but essentially complete for NVIDIA. AMD will be the same - though the AMD documentation is mostly for a C++ interface, there is a lower-level, documented C interface that will be fine for us. As noted on the bug, there are issues with linking, and the fact that the static libraries (or at least object files) must be built on hosts that have the GPU SDKs installed. For now, the .a files are part of this patch. |
For reasons not understood, linking against the static libraries works on some hosts but fails on others (including on CI). |
df2d1db
to
768d499
Compare
The linking issue appears to be a GCC problem. If I ensure that GCC11 is loaded with the appropriate |
71c7622
to
d615f4f
Compare
Since betzy is literally on fire I guess I won't be checking those nodes but I'm pretty satisfied that this approach will work everywhere, for now. I'm a little worried about having to load rocm_smi64.so.7 (why ".7"??!) but that's how it is. If we had an error channel it would be possible to check for other versions if the .7 is suddenly not there, and report back if other versions are found. Presumably nvidia has the same problem. There, we load a file that is a symlink to the .1 file which is a symlink to a file that has the driver version number encoded in it. It's possible we should load the .1 file directly, and then have a similar fallback / reporting mechanism for a version change. I think dealing with the versioning will have to be a separate bug. |
ce9125e
to
0240a46
Compare
Reviewing ... |
@bast, it'd be good to get this reviewed asap to see if this approach is sensible and acceptable, but probably best not to land quite yet even if review is positive, I'd like to fix the outstanding items on the to-do list above. |
I finally read up on #87 and also the diff and I think this is a good solution. It increases complexity a bit but also solves other problems and I would not be able to suggest a better solution at this point. |
OK, thanks. I'm not real fond of including compiled code - it's basically a vector for a hard-to-spot supply chain attack - but given that the data structures in the header files are not fully documented and available only on some systems there's really no good way for us to avoid the C compiler. We might consider requiring 2fa when pushing to the official repo, and some routines for maintaining the integrity of the binary blobs. I'm already developing mostly on my own fork so this will hardly impact me. |
Good point about 2FA and careful review. |
0098732
to
509a054
Compare
@bast, I think this has no outstanding known issues and is ready for final review. |
Thank you for all the work! |
TODO short term:
... @ 7GB
but it's 7.95GB or thereabouts, it should round and not truncateTODO longer term: