Skip to content

Conversation

@tzulingk
Copy link
Contributor

@tzulingk tzulingk commented Dec 2, 2025

Overview:

This PR introduces GPU discovery utilities for the hardware fault injection testing framework. The new utilities enable precise process-to-GPU mapping, which is essential for targeting specific GPUs during fault injection tests.

Details:

Added a new gpu_discovery.py module with the following key functions:

  • get_available_gpu_ids(): Retrieves all GPU IDs available in a pod, correctly handling non-sequential GPU configurations (e.g., [0, 1, 3, 7])
  • get_gpu_id_for_process(): Maps a process PID to the GPU it's using by querying nvidia-smi, with proper handling of CUDA_VISIBLE_DEVICES remapping
  • get_gpu_pci_address(): Obtains the PCI bus address for a GPU, which is used in kernel XID messages to identify physical hardware
  • get_gpu_info(): Retrieves comprehensive GPU information including name, PCI address, memory, and driver version
  • get_processes_on_gpu(): Lists all process IDs running compute workloads on a specific GPU

The module includes comprehensive error handling, logging, and documentation. All functions are exported through the helpers package __init__.py for easy import.

Where should the reviewer start?

Start with tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py to review the core GPU discovery logic, particularly:

  • The get_gpu_id_for_process() function which handles the critical process-to-GPU mapping
  • Error handling and edge cases (no GPUs found, process not using GPU yet, etc.)
  • The logic for parsing nvidia-smi output across different formats

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-1124

Summary by CodeRabbit

  • Tests
    • Added GPU discovery utilities to support fault tolerance testing, enabling queries for GPU availability, process-to-GPU mapping, PCI addresses, and per-GPU information in test environments.

✏️ Tip: You can customize this high-level summary in your review settings.

@tzulingk tzulingk requested review from a team as code owners December 2, 2025 06:06
@github-actions github-actions bot added the feat label Dec 2, 2025
@tzulingk tzulingk requested a review from nv-oviya December 2, 2025 06:06
@tzulingk tzulingk enabled auto-merge (squash) December 2, 2025 06:06
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

Walkthrough

This PR introduces GPU discovery utilities for fault-tolerance testing by creating a new gpu_discovery.py module containing five functions to query GPU information within Kubernetes pods via nvidia-smi, and exposes these functions through the helpers package public API.

Changes

Cohort / File(s) Summary
GPU Discovery Utilities
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py
New module introducing five GPU utility functions: get_available_gpu_ids() queries available GPUs, get_gpu_id_for_process() maps process PIDs to GPUs, get_gpu_pci_address() retrieves PCI addresses, get_gpu_info() fetches per-GPU details, and get_processes_on_gpu() lists process PIDs on a specific GPU. All functions execute nvidia-smi commands via pod exec and handle error scenarios with logging and safe defaults.
Public API Expansion
tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py
Updated to expose five GPU discovery functions (get_available_gpu_ids, get_gpu_id_for_process, get_gpu_pci_address, get_gpu_info, get_processes_on_gpu) through new import block while preserving existing exports unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • nvidia-smi command parsing: Verify CSV output parsing logic correctly handles varied GPU configurations and non-sequential GPU IDs
  • Error handling and defaults: Ensure all functions gracefully handle empty outputs, command failures, and edge cases (e.g., process not found on GPU)
  • Pod exec integration: Confirm subprocess execution and error propagation work correctly within Kubernetes pod context
  • PID matching logic in get_gpu_id_for_process(): Validate the iteration and matching logic, especially with CUDA_VISIBLE_DEVICES remapping scenarios

Poem

🐰✨ A hop through GPU lands so bright,
Discovery utilities, utilities taking flight,
With nvidia-smi and pod commands keen,
We map the processes and GPU scenes,
Let testing faults hop on GPUs free! 🚀

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Add GPU discovery utilities' directly and concisely describes the main change in the PR—adding new GPU discovery utility functions for the fault injection testing framework.
Description check ✅ Passed The PR description follows the template structure with all required sections completed: Overview explains the purpose, Details list the new functions and their capabilities, 'Where should the reviewer start' provides guidance, and a related issue is referenced.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py (3)

56-58: Use logging.exception for better debugging.

Replace logging.error with logging.exception to automatically include the full traceback. This pattern applies to all exception handlers in this module (lines 57, 134, 175, 226, 270).

Apply this diff:

     except Exception as e:
-        logger.error(f"Failed to get GPU IDs from pod {pod.name}: {e}")
+        logger.exception(f"Failed to get GPU IDs from pod {pod.name}: {e}")
         return []

Repeat the same change for exception handlers in:

  • get_gpu_id_for_process (line 134)
  • get_gpu_pci_address (line 175)
  • get_gpu_info (line 226)
  • get_processes_on_gpu (line 270)

61-135: Clarify the return value semantics in edge cases.

The function returns 0 when no GPUs exist (line 89) or on exception (line 135), but returns gpu_ids[0] when the process isn't found (line 131). This creates ambiguity: does 0 mean "GPU 0" or "error/not found"?

Consider documenting this behavior more explicitly in the docstring, especially since callers might not distinguish between "no GPUs available" and "process is on GPU 0".

For example, update the Returns section:

     Returns:
-        GPU ID (0-N) where the process is running, or 0 if not found
+        GPU ID (0-N) where the process is running. Returns 0 if no GPUs exist or 
+        on error. Returns the first available GPU if the process isn't found on 
+        any GPU (process may not have initialized CUDA yet).

56-56: Consider catching more specific exceptions.

All functions catch broad Exception, which is flagged by static analysis. While this provides robustness for a testing utility (returning safe defaults on any failure), consider catching more specific exceptions like subprocess.CalledProcessError, AttributeError, or ValueError to better distinguish between different failure modes.

However, given this is a fault-tolerance testing utility where robustness is critical, the current approach is acceptable if paired with logging.exception (already suggested) for full tracebacks.

Also applies to: 133-133, 174-174, 225-225, 269-269

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb4432e and 93efd7f.

📒 Files selected for processing (2)
  • tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1 hunks)
  • tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4695/merge) by tzulingk.
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py

[error] 1-1: Black formatting check failed. The hook reformatted 1 file. Run 'black' to fix code style issues in this file.

🪛 Ruff (0.14.7)
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py

54-54: Consider moving this statement to an else block

(TRY300)


56-56: Do not catch blind exception: Exception

(BLE001)


57-57: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


133-133: Do not catch blind exception: Exception

(BLE001)


134-134: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


172-172: Consider moving this statement to an else block

(TRY300)


174-174: Do not catch blind exception: Exception

(BLE001)


175-175: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


225-225: Do not catch blind exception: Exception

(BLE001)


226-226: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


267-267: Consider moving this statement to an else block

(TRY300)


269-269: Do not catch blind exception: Exception

(BLE001)


270-270: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: sglang (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
tests/fault_tolerance/hardware/fault-injection-service/helpers/gpu_discovery.py (3)

138-176: LGTM: PCI address retrieval logic is sound.

The function correctly queries nvidia-smi for the PCI bus ID, handles empty output, and returns None on failure.


230-271: LGTM: Process listing logic is sound.

The function correctly queries nvidia-smi for processes, parses PIDs line by line with validation, and returns an empty list on failure.


210-223: The CSV parsing approach is correct for nvidia-smi's format. nvidia-smi's --format=csv,noheader output is plain comma-separated without field quoting (unlike RFC 4180 CSV), and GPU model names do not contain commas in practice. The existing validation (len(parts) < 5) adequately handles unexpected output.

tests/fault_tolerance/hardware/fault-injection-service/helpers/__init__.py (1)

11-34: LGTM: Public API exports are correctly configured.

The new GPU discovery functions are properly added to __all__ and imported from .gpu_discovery. The categorization with comments improves readability.

…ult injection tests

Signed-off-by: tzulingk@nvidia.com <tzulingk@nvidia.com>
Signed-off-by: tzulingk@nvidia.com <tzulingk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants