feat: Add GPU discovery utilities #4695

tzulingk · 2025-12-02T06:06:02Z

Overview:

This PR introduces GPU discovery utilities for the hardware fault injection testing framework. The new utilities enable precise process-to-GPU mapping, which is essential for targeting specific GPUs during fault injection tests.

Details:

Added a new gpu_discovery.py module with the following key functions:

get_available_gpu_ids(): Retrieves all GPU IDs available in a pod, correctly handling non-sequential GPU configurations (e.g., [0, 1, 3, 7])
get_gpu_id_for_process(): Maps a process PID to the GPU it's using by querying nvidia-smi, with proper handling of CUDA_VISIBLE_DEVICES remapping
get_gpu_pci_address(): Obtains the PCI bus address for a GPU, which is used in kernel XID messages to identify physical hardware
get_gpu_info(): Retrieves comprehensive GPU information including name, PCI address, memory, and driver version
get_processes_on_gpu(): Lists all process IDs running compute workloads on a specific GPU

The module includes comprehensive error handling, logging, and documentation. All functions are exported through the helpers package __init__.py for easy import.

Where should the reviewer start?

Start with tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py to review the core GPU discovery logic, particularly:

The get_gpu_id_for_process() function which handles the critical process-to-GPU mapping
Error handling and edge cases (no GPUs found, process not using GPU yet, etc.)
The logic for parsing nvidia-smi output across different formats

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-1124

Summary by CodeRabbit

Tests
- Added GPU discovery utilities to support fault tolerance testing, enabling queries for GPU availability, process-to-GPU mapping, PCI addresses, and per-GPU information in test environments.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-02T06:10:43Z

Walkthrough

This PR introduces GPU discovery utilities for fault-tolerance testing by creating a new gpu_discovery.py module containing five functions to query GPU information within Kubernetes pods via nvidia-smi, and exposes these functions through the helpers package public API.

Changes

Cohort / File(s)	Summary
GPU Discovery Utilities `tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py`	New module introducing five GPU utility functions: `get_available_gpu_ids()` queries available GPUs, `get_gpu_id_for_process()` maps process PIDs to GPUs, `get_gpu_pci_address()` retrieves PCI addresses, `get_gpu_info()` fetches per-GPU details, and `get_processes_on_gpu()` lists process PIDs on a specific GPU. All functions execute nvidia-smi commands via pod exec and handle error scenarios with logging and safe defaults.
Public API Expansion `tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py`	Updated to expose five GPU discovery functions (`get_available_gpu_ids`, `get_gpu_id_for_process`, `get_gpu_pci_address`, `get_gpu_info`, `get_processes_on_gpu`) through new import block while preserving existing exports unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

nvidia-smi command parsing: Verify CSV output parsing logic correctly handles varied GPU configurations and non-sequential GPU IDs
Error handling and defaults: Ensure all functions gracefully handle empty outputs, command failures, and edge cases (e.g., process not found on GPU)
Pod exec integration: Confirm subprocess execution and error propagation work correctly within Kubernetes pod context
PID matching logic in get_gpu_id_for_process(): Validate the iteration and matching logic, especially with CUDA_VISIBLE_DEVICES remapping scenarios

Poem

🐰✨ A hop through GPU lands so bright,
Discovery utilities, utilities taking flight,
With nvidia-smi and pod commands keen,
We map the processes and GPU scenes,
Let testing faults hop on GPUs free! 🚀

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Add GPU discovery utilities' directly and concisely describes the main change in the PR—adding new GPU discovery utility functions for the fault injection testing framework.
Description check	✅ Passed	The PR description follows the template structure with all required sections completed: Overview explains the purpose, Details list the new functions and their capabilities, 'Where should the reviewer start' provides guidance, and a related issue is referenced.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py (3)
56-58: Use logging.exception for better debugging.

Replace logging.error with logging.exception to automatically include the full traceback. This pattern applies to all exception handlers in this module (lines 57, 134, 175, 226, 270).

Apply this diff:
     except Exception as e:
-        logger.error(f"Failed to get GPU IDs from pod {pod.name}: {e}")
+        logger.exception(f"Failed to get GPU IDs from pod {pod.name}: {e}")
         return []
Repeat the same change for exception handlers in:

get_gpu_id_for_process (line 134)

get_gpu_pci_address (line 175)

get_gpu_info (line 226)

get_processes_on_gpu (line 270)

61-135: Clarify the return value semantics in edge cases.

The function returns 0 when no GPUs exist (line 89) or on exception (line 135), but returns gpu_ids[0] when the process isn't found (line 131). This creates ambiguity: does 0 mean "GPU 0" or "error/not found"?

Consider documenting this behavior more explicitly in the docstring, especially since callers might not distinguish between "no GPUs available" and "process is on GPU 0".

For example, update the Returns section:
     Returns:
-        GPU ID (0-N) where the process is running, or 0 if not found
+        GPU ID (0-N) where the process is running. Returns 0 if no GPUs exist or 
+        on error. Returns the first available GPU if the process isn't found on 
+        any GPU (process may not have initialized CUDA yet).
56-56: Consider catching more specific exceptions.

All functions catch broad Exception, which is flagged by static analysis. While this provides robustness for a testing utility (returning safe defaults on any failure), consider catching more specific exceptions like subprocess.CalledProcessError, AttributeError, or ValueError to better distinguish between different failure modes.

However, given this is a fault-tolerance testing utility where robustness is critical, the current approach is acceptable if paired with logging.exception (already suggested) for full tracebacks.

Also applies to: 133-133, 174-174, 225-225, 269-269

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb4432e and 93efd7f.

📒 Files selected for processing (2)

tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1 hunks)
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4695/merge) by tzulingk.

tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py

[error] 1-1: Black formatting check failed. The hook reformatted 1 file. Run 'black' to fix code style issues in this file.

🪛 Ruff (0.14.7)

tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py

54-54: Consider moving this statement to an else block

(TRY300)

56-56: Do not catch blind exception: Exception

(BLE001)

57-57: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

133-133: Do not catch blind exception: Exception

(BLE001)

134-134: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

172-172: Consider moving this statement to an else block

(TRY300)

174-174: Do not catch blind exception: Exception

(BLE001)

175-175: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

225-225: Do not catch blind exception: Exception

(BLE001)

226-226: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

267-267: Consider moving this statement to an else block

(TRY300)

269-269: Do not catch blind exception: Exception

(BLE001)

270-270: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: sglang (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: vllm (arm64)
GitHub Check: operator (amd64)
GitHub Check: operator (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (4)

tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py (3)

138-176: LGTM: PCI address retrieval logic is sound.

The function correctly queries nvidia-smi for the PCI bus ID, handles empty output, and returns None on failure.

230-271: LGTM: Process listing logic is sound.

The function correctly queries nvidia-smi for processes, parses PIDs line by line with validation, and returns an empty list on failure.

210-223: The CSV parsing approach is correct for nvidia-smi's format. nvidia-smi's --format=csv,noheader output is plain comma-separated without field quoting (unlike RFC 4180 CSV), and GPU model names do not contain commas in practice. The existing validation (len(parts) < 5) adequately handles unexpected output.

tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1)

11-34: LGTM: Public API exports are correctly configured.

The new GPU discovery functions are properly added to __all__ and imported from .gpu_discovery. The categorization with comments improves readability.

tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py

…ult injection tests Signed-off-by: tzulingk@nvidia.com <tzulingk@nvidia.com>

Signed-off-by: tzulingk@nvidia.com <tzulingk@nvidia.com>

tzulingk requested review from a team as code owners December 2, 2025 06:06

pull-request-size bot added the size/L label Dec 2, 2025

github-actions bot added the feat label Dec 2, 2025

tzulingk requested a review from nv-oviya December 2, 2025 06:06

tzulingk enabled auto-merge (squash) December 2, 2025 06:06

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py Outdated Show resolved Hide resolved

tzulingk requested review from keivenchang and saturley-hall December 2, 2025 18:00

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 06:24 Inactive

tzulingk added 2 commits December 2, 2025 22:27

Add GPU discovery utilities for process-to-GPU mapping in hardware fa…

481b5a0

…ult injection tests Signed-off-by: tzulingk@nvidia.com <tzulingk@nvidia.com>

format

9218156

Signed-off-by: tzulingk@nvidia.com <tzulingk@nvidia.com>

tzulingk force-pushed the tzulingk/gpu-discovery branch from 62380df to 9218156 Compare December 3, 2025 06:28

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 06:28 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 06:34 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add GPU discovery utilities #4695

feat: Add GPU discovery utilities #4695

tzulingk commented Dec 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 2, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add GPU discovery utilities #4695

Are you sure you want to change the base?

feat: Add GPU discovery utilities #4695

Conversation

tzulingk commented Dec 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 2, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tzulingk commented Dec 2, 2025 •

edited by coderabbitai bot

Loading