Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #87 - read GPU information directly from drivers #227

Merged
merged 4 commits into from
Jan 21, 2025

Conversation

lars-t-hansen
Copy link
Collaborator

@lars-t-hansen lars-t-hansen commented Dec 20, 2024

TODO short term:

  • UUID for AMD devices, can get this with the amdsmi library if available, but should not fail if not available
  • A little better documentation in sonar-nvidia.h
  • Linking issues on the build bot and indeed on the dev systems if the "right" gcc is not loaded, needs exploration
  • amd_smi.rs: extract architecture somehow?
  • amd_smi.rs: extract mem_pct somehow? do we need it?
  • sonar-nvidia.c etc - perf state need not be formatted as string at this api, better if it isn't
  • sonar-nvidia.c etc - compute mode can maybe be transmitted as some well-known integer instead
  • (fixing those two allows us to derive the Default for the nvidia crad state)
  • sonar-amd.c etc - clean up buffer sizes as much as possible, and fix the comment about this
  • more testing, perhaps esp nvidia systems, the AMD tests look ok for now
  • sysinfo now reports ... @ 7GB but it's 7.95GB or thereabouts, it should round and not truncate

TODO longer term:

  • check all available hardware to make sure we don't need the fallback solutions
    • Saga a100
    • Saga accel
  • remove the fallback code entirely
  • File bug about dealing with .so versioning, see comment below
  • Fix the documentation: it says that we run nvidia-smi and rocm-smi, but after this patch we don't

@lars-t-hansen lars-t-hansen changed the title W 87 gpu apis Fix #87 - read GPU information directly from drivers Dec 20, 2024
@lars-t-hansen
Copy link
Collaborator Author

This is WIP, but essentially complete for NVIDIA. AMD will be the same - though the AMD documentation is mostly for a C++ interface, there is a lower-level, documented C interface that will be fine for us.

As noted on the bug, there are issues with linking, and the fact that the static libraries (or at least object files) must be built on hosts that have the GPU SDKs installed. For now, the .a files are part of this patch.

@lars-t-hansen
Copy link
Collaborator Author

For reasons not understood, linking against the static libraries works on some hosts but fails on others (including on CI).

@lars-t-hansen
Copy link
Collaborator Author

The linking issue appears to be a GCC problem. If I ensure that GCC11 is loaded with the appropriate module load command then linking works on both ML1 (cuda+gcc11) and ML4 (hipSYCL+gcc11) ; with the default GCC (gcc8), it does not.

@lars-t-hansen
Copy link
Collaborator Author

Since betzy is literally on fire I guess I won't be checking those nodes but I'm pretty satisfied that this approach will work everywhere, for now. I'm a little worried about having to load rocm_smi64.so.7 (why ".7"??!) but that's how it is. If we had an error channel it would be possible to check for other versions if the .7 is suddenly not there, and report back if other versions are found.

Presumably nvidia has the same problem. There, we load a file that is a symlink to the .1 file which is a symlink to a file that has the driver version number encoded in it. It's possible we should load the .1 file directly, and then have a similar fallback / reporting mechanism for a version change.

I think dealing with the versioning will have to be a separate bug.

@lars-t-hansen lars-t-hansen marked this pull request as ready for review January 9, 2025 12:31
@lars-t-hansen lars-t-hansen requested a review from bast January 9, 2025 12:32
@bast
Copy link
Member

bast commented Jan 13, 2025

Reviewing ...

@lars-t-hansen
Copy link
Collaborator Author

@bast, it'd be good to get this reviewed asap to see if this approach is sensible and acceptable, but probably best not to land quite yet even if review is positive, I'd like to fix the outstanding items on the to-do list above.

@bast
Copy link
Member

bast commented Jan 13, 2025

I finally read up on #87 and also the diff and I think this is a good solution. It increases complexity a bit but also solves other problems and I would not be able to suggest a better solution at this point.

@lars-t-hansen
Copy link
Collaborator Author

I finally read up on #87 and also the diff and I think this is a good solution. It increases complexity a bit but also solves other problems and I would not be able to suggest a better solution at this point.

OK, thanks.

I'm not real fond of including compiled code - it's basically a vector for a hard-to-spot supply chain attack - but given that the data structures in the header files are not fully documented and available only on some systems there's really no good way for us to avoid the C compiler.

We might consider requiring 2fa when pushing to the official repo, and some routines for maintaining the integrity of the binary blobs. I'm already developing mostly on my own fork so this will hardly impact me.

@bast
Copy link
Member

bast commented Jan 13, 2025

Good point about 2FA and careful review.

@lars-t-hansen
Copy link
Collaborator Author

@bast, I think this has no outstanding known issues and is ready for final review.

@bast bast merged commit 0d5077d into NordicHPC:main Jan 21, 2025
2 checks passed
@bast
Copy link
Member

bast commented Jan 21, 2025

Thank you for all the work!

@lars-t-hansen lars-t-hansen deleted the w-87-gpu-apis branch January 30, 2025 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants