Fix #87 - read GPU information directly from drivers #227

lars-t-hansen · 2024-12-20T13:10:36Z

TODO short term:

TODO longer term:

check all available hardware to make sure we don't need the fallback solutions
- Saga a100
- Saga accel
remove the fallback code entirely
File bug about dealing with .so versioning, see comment below
Fix the documentation: it says that we run nvidia-smi and rocm-smi, but after this patch we don't

lars-t-hansen · 2024-12-20T13:12:57Z

This is WIP, but essentially complete for NVIDIA. AMD will be the same - though the AMD documentation is mostly for a C++ interface, there is a lower-level, documented C interface that will be fine for us.

As noted on the bug, there are issues with linking, and the fact that the static libraries (or at least object files) must be built on hosts that have the GPU SDKs installed. For now, the .a files are part of this patch.

lars-t-hansen · 2024-12-20T13:13:41Z

For reasons not understood, linking against the static libraries works on some hosts but fails on others (including on CI).

lars-t-hansen · 2025-01-06T07:10:54Z

The linking issue appears to be a GCC problem. If I ensure that GCC11 is loaded with the appropriate module load command then linking works on both ML1 (cuda+gcc11) and ML4 (hipSYCL+gcc11) ; with the default GCC (gcc8), it does not.

lars-t-hansen · 2025-01-09T12:03:10Z

Since betzy is literally on fire I guess I won't be checking those nodes but I'm pretty satisfied that this approach will work everywhere, for now. I'm a little worried about having to load rocm_smi64.so.7 (why ".7"??!) but that's how it is. If we had an error channel it would be possible to check for other versions if the .7 is suddenly not there, and report back if other versions are found.

Presumably nvidia has the same problem. There, we load a file that is a symlink to the .1 file which is a symlink to a file that has the driver version number encoded in it. It's possible we should load the .1 file directly, and then have a similar fallback / reporting mechanism for a version change.

I think dealing with the versioning will have to be a separate bug.

bast · 2025-01-13T10:27:29Z

Reviewing ...

lars-t-hansen · 2025-01-13T11:30:42Z

@bast, it'd be good to get this reviewed asap to see if this approach is sensible and acceptable, but probably best not to land quite yet even if review is positive, I'd like to fix the outstanding items on the to-do list above.

bast · 2025-01-13T12:39:15Z

I finally read up on #87 and also the diff and I think this is a good solution. It increases complexity a bit but also solves other problems and I would not be able to suggest a better solution at this point.

lars-t-hansen · 2025-01-13T12:52:13Z

I finally read up on #87 and also the diff and I think this is a good solution. It increases complexity a bit but also solves other problems and I would not be able to suggest a better solution at this point.

OK, thanks.

I'm not real fond of including compiled code - it's basically a vector for a hard-to-spot supply chain attack - but given that the data structures in the header files are not fully documented and available only on some systems there's really no good way for us to avoid the C compiler.

We might consider requiring 2fa when pushing to the official repo, and some routines for maintaining the integrity of the binary blobs. I'm already developing mostly on my own fork so this will hardly impact me.

bast · 2025-01-13T12:53:38Z

Good point about 2FA and careful review.

lars-t-hansen · 2025-01-14T07:30:59Z

@bast, I think this has no outstanding known issues and is ready for final review.

bast · 2025-01-21T20:46:07Z

Thank you for all the work!

lars-t-hansen changed the title ~~W 87 gpu apis~~ Fix #87 - read GPU information directly from drivers Dec 20, 2024

lars-t-hansen force-pushed the w-87-gpu-apis branch from df2d1db to 768d499 Compare January 6, 2025 07:04

lars-t-hansen force-pushed the w-87-gpu-apis branch from 71c7622 to d615f4f Compare January 7, 2025 12:48

lars-t-hansen force-pushed the w-87-gpu-apis branch from ce9125e to 0240a46 Compare January 9, 2025 12:26

lars-t-hansen marked this pull request as ready for review January 9, 2025 12:31

lars-t-hansen requested a review from bast January 9, 2025 12:32

Lars T Hansen added 4 commits January 14, 2025 08:10

For NordicHPC#87 - Read GPU data via SMI library for NVIDIA and AMD

5c910e5

For NordicHPC#87 - remove old smi text parsing code

92787c0

For NordicHPC#87 - update README

6937c2d

For NordicHPC#87 - improved GPU test cases

509a054

lars-t-hansen force-pushed the w-87-gpu-apis branch from 0098732 to 509a054 Compare January 14, 2025 07:11

bast merged commit 0d5077d into NordicHPC:main Jan 21, 2025
2 checks passed

lars-t-hansen deleted the w-87-gpu-apis branch January 30, 2025 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #87 - read GPU information directly from drivers #227

Fix #87 - read GPU information directly from drivers #227

lars-t-hansen commented Dec 20, 2024 •

edited

Loading

lars-t-hansen commented Dec 20, 2024

lars-t-hansen commented Dec 20, 2024

lars-t-hansen commented Jan 6, 2025

lars-t-hansen commented Jan 9, 2025

bast commented Jan 13, 2025

lars-t-hansen commented Jan 13, 2025

bast commented Jan 13, 2025

lars-t-hansen commented Jan 13, 2025

bast commented Jan 13, 2025

lars-t-hansen commented Jan 14, 2025

bast commented Jan 21, 2025

Fix #87 - read GPU information directly from drivers #227

Fix #87 - read GPU information directly from drivers #227

Conversation

lars-t-hansen commented Dec 20, 2024 • edited Loading

lars-t-hansen commented Dec 20, 2024

lars-t-hansen commented Dec 20, 2024

lars-t-hansen commented Jan 6, 2025

lars-t-hansen commented Jan 9, 2025

bast commented Jan 13, 2025

lars-t-hansen commented Jan 13, 2025

bast commented Jan 13, 2025

lars-t-hansen commented Jan 13, 2025

bast commented Jan 13, 2025

lars-t-hansen commented Jan 14, 2025

bast commented Jan 21, 2025

lars-t-hansen commented Dec 20, 2024 •

edited

Loading