Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
[submodule "sw/nic/third-party/abseil-cpp"]
path = sw/nic/third-party/abseil-cpp
url = git@github.com:abseil/abseil-cpp.git
branch = 20250512.1
branch = master
[submodule "sw/nic/third-party/boost"]
path = sw/nic/third-party/boost_1_88_0
url = git@github.com:boostorg/boost.git
Expand Down
79 changes: 52 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,83 @@
GPU Agent provides programmable APIs to configure and monitor AMD Instinct GPUs
# GPU Agent provides programmable APIs to configure and monitor AMD Instinct GPUs

To build GPU Agent, follow the steps below:
## To build GPU Agent, follow the steps below:

1. setup workspace (required once)
### setup workspace (required once)

```
# git submodule update --init --recursive -f
```bash
$ git submodule update --init --recursive -f
```

2. create build container image (required once)
### create build container image (required once)

```
# make build-container
```bash
$ make build-container
```

3. vendor setup workspace (required once)
### Building artifacts

Follow either of the two methods below to build gpuagent and gpuctl binaries

#### Manual Steps

vendor setup workspace for manual building (required once)

- choose build/developer environment
- rhel9
- rhel9
```bash
$ GPUAGENT_BLD_CONTAINER_IMAGE=gpuagent-builder-rhel:9 make docker-shell
[user@host]# GPUAGENT_BLD_CONTAINER_IMAGE=gpuagent-builder-rhel:9 make docker-shell
```

- ubuntu 22.04
- ubuntu 22.04
```bash
$ GPUAGENT_BLD_CONTAINER_IMAGE=gpuagent-bldr-ubuntu:22.04 make docker-shell
[user@host]# GPUAGENT_BLD_CONTAINER_IMAGE=gpuagent-bldr-ubuntu:22.04 make docker-shell
```

- golang dependency setup (required once)
```bash
[user@build-container ]# make gopkglist
```
- golang vendor setup
```bash
[root@dev gpu-agent]# cd sw/nic/gpuagent/
[root@dev gpuagent]# go mod vendor

```

4. building artifacts
- choose build base os
- rhel9
```bash
# GPUAGENT_BLD_CONTAINER_IMAGE=gpuagent-builder-rhel:9 make gpuagent
[user@build-container ]# cd sw/nic/gpuagent/
[user@build-container ]# go mod vendor
```

- ubuntu 22.04
- bild gpuagent (within build-container)
```bash
$ GPUAGENT_BLD_CONTAINER_IMAGE=gpuagent-bldr-ubuntu:22.04 make docker-shell
[user@build-container ]# make
```

5. artifacts location
#### Full target build in single step (from host)

Choose build base os

- rhel9
```bash
[user@host]# GPUAGENT_BLD_CONTAINER_IMAGE=gpuagent-builder-rhel:9 make gpuagent
```

- ubuntu 22.04
```bash
[user@host]# GPUAGENT_BLD_CONTAINER_IMAGE=gpuagent-bldr-ubuntu:22.04 make gpuagent
```

### Artifacts location
- gpuagent binary can be found at - ${TOP_DIR}/sw/nic/build/x86_64/sim/bin/gpuagent
- gpuctl binary can be found at - ${TOP_DIR}/sw/nic/build/x86_64/sim/bin/gpuctl

6. To clean the build artifacts (run it within build-container)
### To clean the build artifacts (run it within build-container)

```bash
[root@dev gpu-agent]# make -C sw/nic/gpuagent clean
[root@dev gpu-agent]#
```

# Things to note
- For updating any amdsmi library to any other version, make sure the libamdsmi.so libraries are built correctly and are available in sw/nic/build/x86_64/sim/lib/ path. These are required during runtime, mismatch in library version may lead to runtime issues. These libraries are built from [amdsmi git](https://github.com/rocm/amdsmi/). The commit/tag the current gpuagent is built on can be found in [file](sw/nic/third-party/rocm/amd_smi_lib/version.txt)
- apply patches on amdsmi found in [here](patch/amdsmi)
- amdsmi build instructions are available [here](sw/nic/gpuagent/api/smi/amdsmi/README.md)

# Troubleshooting
- If you face any issue with golang dependencies, re-run `make gopkglist` and `go mod vendor` command.
- some go files are generated during build time, if you face any issue related to missing files, run `make gpuagent` command within build-container, then re-run `go mod vendor` command.
59 changes: 59 additions & 0 deletions patch/amdsmi/slow.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
diff --git a/src/amd_smi/amd_smi.cc b/src/amd_smi/amd_smi.cc
index 2bb04732..a169b0bc 100644
--- a/src/amd_smi/amd_smi.cc
+++ b/src/amd_smi/amd_smi.cc
@@ -635,6 +635,11 @@ amdsmi_get_gpu_device_uuid(amdsmi_processor_handle processor_handle,
return status;
}

+// Add a static cache for KFD nodes with initialization flag
+static std::once_flag kfd_nodes_initialized;
+static std::map<uint64_t, std::shared_ptr<amd::smi::KFDNode>> cached_nodes;
+static uint32_t cached_smallest_node_id = std::numeric_limits<uint32_t>::max();
+
amdsmi_status_t
amdsmi_get_gpu_enumeration_info(amdsmi_processor_handle processor_handle,
amdsmi_enumeration_info_t *info){
@@ -663,25 +668,26 @@ amdsmi_get_gpu_enumeration_info(amdsmi_processor_handle processor_handle,
info->drm_render = gpu_device->get_drm_render_minor();

// Retrieve HIP ID (difference from the smallest node ID) and HSA ID
- std::map<uint64_t, std::shared_ptr<amd::smi::KFDNode>> nodes;
- if (amd::smi::DiscoverKFDNodes(&nodes) == 0) {
- uint32_t smallest_node_id = std::numeric_limits<uint32_t>::max();
- for (const auto& node_pair : nodes) {
- uint32_t node_id = 0;
- if (node_pair.second->get_node_id(&node_id) == 0) {
- smallest_node_id = std::min(smallest_node_id, node_id);
+ // Initialize KFD nodes once
+ std::call_once(kfd_nodes_initialized, []() {
+ if (amd::smi::DiscoverKFDNodes(&cached_nodes) == 0) {
+ for (const auto& node_pair : cached_nodes) {
+ uint32_t node_id = 0;
+ if (node_pair.second->get_node_id(&node_id) == 0) {
+ cached_smallest_node_id = std::min(cached_smallest_node_id, node_id);
+ }
}
}
+ });

- // Default to 0xffffffff as not supported
- info->hsa_id = std::numeric_limits<uint32_t>::max();
- info->hip_id = std::numeric_limits<uint32_t>::max();
- amdsmi_kfd_info_t kfd_info;
- status = amdsmi_get_gpu_kfd_info(processor_handle, &kfd_info);
- if (status == AMDSMI_STATUS_SUCCESS) {
- info->hsa_id = kfd_info.node_id;
- info->hip_id = kfd_info.node_id - smallest_node_id;
- }
+ // Default to 0xffffffff as not supported
+ info->hsa_id = std::numeric_limits<uint32_t>::max();
+ info->hip_id = std::numeric_limits<uint32_t>::max();
+ amdsmi_kfd_info_t kfd_info;
+ status = amdsmi_get_gpu_kfd_info(processor_handle, &kfd_info);
+ if (status == AMDSMI_STATUS_SUCCESS) {
+ info->hsa_id = kfd_info.node_id;
+ info->hip_id = kfd_info.node_id - cached_smallest_node_id;
}

// Retrieve HIP UUID
6 changes: 3 additions & 3 deletions sw/nic/gpuagent/api/smi/amdsmi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,12 @@
5. sudo make install
```

this generates amd-smi .so file under /opt/rocm/lib64/, example /opt/rocm/lib64/libamd_smi.so.24.6
this generates amd-smi .so file under /opt/rocm/lib64/, example /opt/rocm/lib64/libamd_smi.so.2.*

### upload new amd-smi library to assets using the following steps:

1. cp /opt/rocm/lib64/libamd_smi.so.24.6 /sw/nic/third-party/rocm/amd_smi_lib/x86_64/lib/
1. cp /opt/rocm/lib64/libamd_smi.so.2* /sw/nic/third-party/rocm/amd_smi_lib/x86_64/lib/
2. fix symlinks in /sw/nic/third-party/rocm/amd_smi_lib/x86_64/lib/ as required
3. copy the required version of amdsmi.h from https://github.com/ROCm/amdsmi/ to /sw/nic/third-party/rocm/amd_smi_lib/include/amd_smi/amdsmi.h
4. upload assets to minio server using
4. upload assets to minio server using (internal)
1. tar cvz $(cat /sw/minio/third_party_libs.txt) | /bin/asset-push -ac assets-colo.pensando.io:9000 sw-repository third_party_libs ${NEW_VERSION} /dev/stdin
Loading