Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Rewrite hadoop-ai regex to match gpu info #2681

Merged
merged 2 commits into from
May 7, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/hadoop-ai/build/build-pre.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ pushd $(dirname "$0") > /dev/null
hadoopBinaryDir="/hadoop-binary"

# When changing the patch id, please update it.
patchId="12940533-12933562-docker_executor-12944563-fix1-20190410"
patchId="12940533-12933562-docker_executor-12944563-fix1-20190426"

hadoopBinaryPath="${hadoopBinaryDir}/hadoop-2.9.0.tar.gz"
cacheVersion="${hadoopBinaryDir}/${patchId}-done"
Expand Down
4 changes: 2 additions & 2 deletions src/hadoop-ai/build/hadoop-ai-fix.patch
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ index 8801b4a940f..30d33086516 100644
*/
Pattern GPU_INFO_FORMAT =
- Pattern.compile("\\s+([0-9]{1,2})\\s+[\\s\\S]*\\s+(0|1|N/A|Off)\\s+");
+ Pattern.compile("\\s+([0-9]{1,2})\\s+[\\s\\S]*\\s+(\\d+|N/A|Off)\\s+");
+ Pattern.compile("[|]\\s+([0-9]{1,2})[^|]*[|][^|]*[|]\\s+(\\d+|N/A|Off)\\s+[|]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please parse the structured output instead, e.g.

  • for gpu memory, use nvidia-smi -q -d MEMORY
Attached GPUs                       : 16
GPU 00000000:34:00.0
    FB Memory Usage
        Total                       : 32480 MiB
        Used                        : 0 MiB
        Free                        : 32480 MiB
    BAR1 Memory Usage
        Total                       : 32768 MiB
        Used                        : 2 MiB
        Free                        : 32766 MiB
  • for gpu ecc, use nvidia-smi -q -d ECC
Attached GPUs                       : 16
GPU 00000000:34:00.0
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : 0
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : 0
                Total               : 0

Or use the xml output. Otherwise, the changes are useless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mitigation, we could create a issue for the todo items and evaluate its priority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question, we havn't found a detailed explanation about the structural output, similar issue in #2534

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find the output details in nvidia-smi docs.

nvidia-smi is a part of NVIDIA System Management Interface (NVML), and there's also an Python bindings, which is backwards compatible for the NVML. It's better to use nvidia management library API to query the status.

Pattern GPU_MEM_FORMAT =
Pattern.compile("([0-9]+)MiB\\s*/\\s*([0-9]+)MiB");

Expand All @@ -36,7 +36,7 @@ index 8801b4a940f..30d33086516 100644
gpuAttributeCapacity |= (1L << index);
} else {
- LOG.error("ignored error: gpu " + index + " ECC code is 1, will make this gpu unavailable");
+ LOG.error("ignored error: gpu " + index + " ECC code is " + mat.group(2) + ", will make this gpu unavailable");
+ LOG.error("GPU error: gpu " + index + " ECC code is " + mat.group(2) + ", will make this gpu unavailable");
}
}
}
Expand Down