-
Notifications
You must be signed in to change notification settings - Fork 720
feat: Add GPU discovery utilities #4695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis PR introduces GPU discovery utilities for fault-tolerance testing by creating a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py (3)
56-58: Uselogging.exceptionfor better debugging.Replace
logging.errorwithlogging.exceptionto automatically include the full traceback. This pattern applies to all exception handlers in this module (lines 57, 134, 175, 226, 270).Apply this diff:
except Exception as e: - logger.error(f"Failed to get GPU IDs from pod {pod.name}: {e}") + logger.exception(f"Failed to get GPU IDs from pod {pod.name}: {e}") return []Repeat the same change for exception handlers in:
get_gpu_id_for_process(line 134)get_gpu_pci_address(line 175)get_gpu_info(line 226)get_processes_on_gpu(line 270)
61-135: Clarify the return value semantics in edge cases.The function returns
0when no GPUs exist (line 89) or on exception (line 135), but returnsgpu_ids[0]when the process isn't found (line 131). This creates ambiguity: does0mean "GPU 0" or "error/not found"?Consider documenting this behavior more explicitly in the docstring, especially since callers might not distinguish between "no GPUs available" and "process is on GPU 0".
For example, update the Returns section:
Returns: - GPU ID (0-N) where the process is running, or 0 if not found + GPU ID (0-N) where the process is running. Returns 0 if no GPUs exist or + on error. Returns the first available GPU if the process isn't found on + any GPU (process may not have initialized CUDA yet).
56-56: Consider catching more specific exceptions.All functions catch broad
Exception, which is flagged by static analysis. While this provides robustness for a testing utility (returning safe defaults on any failure), consider catching more specific exceptions likesubprocess.CalledProcessError,AttributeError, orValueErrorto better distinguish between different failure modes.However, given this is a fault-tolerance testing utility where robustness is critical, the current approach is acceptable if paired with
logging.exception(already suggested) for full tracebacks.Also applies to: 133-133, 174-174, 225-225, 269-269
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py(1 hunks)tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4695/merge) by tzulingk.
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py
[error] 1-1: Black formatting check failed. The hook reformatted 1 file. Run 'black' to fix code style issues in this file.
🪛 Ruff (0.14.7)
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py
54-54: Consider moving this statement to an else block
(TRY300)
56-56: Do not catch blind exception: Exception
(BLE001)
57-57: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
133-133: Do not catch blind exception: Exception
(BLE001)
134-134: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
172-172: Consider moving this statement to an else block
(TRY300)
174-174: Do not catch blind exception: Exception
(BLE001)
175-175: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
225-225: Do not catch blind exception: Exception
(BLE001)
226-226: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
267-267: Consider moving this statement to an else block
(TRY300)
269-269: Do not catch blind exception: Exception
(BLE001)
270-270: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: sglang (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py (3)
138-176: LGTM: PCI address retrieval logic is sound.The function correctly queries nvidia-smi for the PCI bus ID, handles empty output, and returns None on failure.
230-271: LGTM: Process listing logic is sound.The function correctly queries nvidia-smi for processes, parses PIDs line by line with validation, and returns an empty list on failure.
210-223: The CSV parsing approach is correct for nvidia-smi's format. nvidia-smi's--format=csv,noheaderoutput is plain comma-separated without field quoting (unlike RFC 4180 CSV), and GPU model names do not contain commas in practice. The existing validation (len(parts) < 5) adequately handles unexpected output.tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1)
11-34: LGTM: Public API exports are correctly configured.The new GPU discovery functions are properly added to
__all__and imported from.gpu_discovery. The categorization with comments improves readability.
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py
Outdated
Show resolved
Hide resolved
…ult injection tests Signed-off-by: tzulingk@nvidia.com <tzulingk@nvidia.com>
62380df to
9218156
Compare
Overview:
This PR introduces GPU discovery utilities for the hardware fault injection testing framework. The new utilities enable precise process-to-GPU mapping, which is essential for targeting specific GPUs during fault injection tests.
Details:
Added a new
gpu_discovery.pymodule with the following key functions:get_available_gpu_ids(): Retrieves all GPU IDs available in a pod, correctly handling non-sequential GPU configurations (e.g., [0, 1, 3, 7])get_gpu_id_for_process(): Maps a process PID to the GPU it's using by querying nvidia-smi, with proper handling of CUDA_VISIBLE_DEVICES remappingget_gpu_pci_address(): Obtains the PCI bus address for a GPU, which is used in kernel XID messages to identify physical hardwareget_gpu_info(): Retrieves comprehensive GPU information including name, PCI address, memory, and driver versionget_processes_on_gpu(): Lists all process IDs running compute workloads on a specific GPUThe module includes comprehensive error handling, logging, and documentation. All functions are exported through the helpers package
__init__.pyfor easy import.Where should the reviewer start?
Start with
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.pyto review the core GPU discovery logic, particularly:get_gpu_id_for_process()function which handles the critical process-to-GPU mappingRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
DIS-1124
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.