-
Notifications
You must be signed in to change notification settings - Fork 549
Rewrite hadoop-ai regex to match gpu info #2681
Conversation
@@ -16,7 +16,7 @@ index 8801b4a940f..30d33086516 100644 | |||
*/ | |||
Pattern GPU_INFO_FORMAT = | |||
- Pattern.compile("\\s+([0-9]{1,2})\\s+[\\s\\S]*\\s+(0|1|N/A|Off)\\s+"); | |||
+ Pattern.compile("\\s+([0-9]{1,2})\\s+[\\s\\S]*\\s+(\\d+|N/A|Off)\\s+"); | |||
+ Pattern.compile("[|]\\s+([0-9]{1,2})[^|]*[|][^|]*[|]\\s+(\\d+|N/A|Off)\\s+[|]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please parse the structured output instead, e.g.
- for gpu memory, use
nvidia-smi -q -d MEMORY
Attached GPUs : 16
GPU 00000000:34:00.0
FB Memory Usage
Total : 32480 MiB
Used : 0 MiB
Free : 32480 MiB
BAR1 Memory Usage
Total : 32768 MiB
Used : 2 MiB
Free : 32766 MiB
- for gpu ecc, use
nvidia-smi -q -d ECC
Attached GPUs : 16
GPU 00000000:34:00.0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Or use the xml output. Otherwise, the changes are useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mitigation, we could create a issue for the todo items and evaluate its priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question, we havn't found a detailed explanation about the structural output, similar issue in #2534
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find the output details in nvidia-smi
docs.
nvidia-smi is a part of NVIDIA System Management Interface (NVML), and there's also an Python bindings, which is backwards compatible for the NVML. It's better to use nvidia management library API to query the status.
TODO:
Accquire structural gpu info by
nvidia-smi -x