Skip to content

Commit

Permalink
[SWDEV-488276/SWDEV-497613] Update memory partition set functionality
Browse files Browse the repository at this point in the history
Changes:
  - Added warning screen to ROCm SMI users
    setting memory partition
  - Added new API (rsmi_dev_memory_partition_capabilities_get)
    to retrieve memory partition capabilities
    (What users can set memory partition modes to)
  - Increased time-bar for CLI sets display to 40 seconds
  - API now waits until the driver reloads with SYSFS files active
  - [SWDEV-475712] [CLI/API] Fixed target_graphics_version field
    not properly displaying for MI2x or Navi 3x ASICs.
  - Updated tests

Change-Id: Iaf89d1b7ad9ceb449b289bc82ea198fe3b23992e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
(cherry picked from commit 4690227)
  • Loading branch information
charis-poag-amd authored and Maisam Arif committed Nov 13, 2024
1 parent 57f3f84 commit 55c0f58
Show file tree
Hide file tree
Showing 10 changed files with 595 additions and 123 deletions.
24 changes: 19 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ Full documentation for rocm_smi_lib is available at [https://rocm.docs.amd.com/]

### Added

- **Added `rsmi_dev_memory_partition_capabilities_get` which returns driver memory partition capablities.**
Driver now has the ability to report what the user can set memory partition modes to. User can now see available
memory partition modes upon an invalid argument return from memory partition mode set (`rsmi_dev_memory_partition_set`).


- **Added support for GPU metrics 1.6 to `rsmi_dev_gpu_metrics_info_get()`**
Updated `rsmi_dev_gpu_metrics_info_get()` and structure `rsmi_gpu_metrics_t` to include new fields for PVIOL / TVIOL, XCP (Graphics Compute Partitions) stats, and pcie_lc_perf_other_end_recovery:
- `uint64_t accumulation_counter` - used for all throttled calculations
Expand All @@ -27,17 +32,26 @@ Updated `rsmi_dev_gpu_metrics_info_get()` and structure `rsmi_gpu_metrics_t` to
- **Added ability to view raw GPU metrics`rocm-smi --showmetrics`**
Users can now view GPU metrics from our new `rocm-smi --showmetrics`. Unlike AMD SMI (or other ROCM-SMI interfaces), these values are ***not*** converted into applicable units as users may see in `amd-smi metric`. Units listed display as indicated by the driver, they are not converted (eg. in other AMD SMI/ROCm SMI interfaces which use the data provided). It is important to note, that fields displaying `N/A` data mean this ASIC does not support or backward compatibility was not provided in a newer ASIC's GPU metric structure.

### Changed

- **Added back in C++ tests for `memorypartition_read_write`**.
Due to driver adding in all needed features for memory partition write. We have re-enabled memorypartition_read_write.

- **Updated `rsmi_dev_memory_partition_set` to not return until a successful restart of AMD GPU Driver.**
This change keeps checking for ~ up to 40 seconds for a successful restart of the AMD GPU driver. Additionally, the API call continues to check if memory partition (NPS) SYSFS files are successfully updated to reflect the user's requested memory partition (NPS) mode change. Otherwise, reports an error back to the user. Due to these changes, we have updated ROCm SMI's CLI to reflect the maximum wait of 40 seconds, while memory partition change is in progress.

- **All APIs now have the ability to catch driver reporting invalid arguments.**
Now ROCm SMI APIs can show RSMI_STATUS_INVALID_ARGS when driver returns EINVAL.

### Removed

- **Removed `--resetcomputepartition`, and `--resetmemorypartition` options and associated APIs**.
- This change is part of the partition feature redesign.
- The related APIs `rsmi_dev_compute_partition_reset()` and `rsmi_dev_memory_partition_reset()`.

- **Temporary Disabled C++ tests for `memorypartition_read_write`**.
- This change is part of the partition feature redesign.
- SMI's workflow needs to be adjusted in order to accomidate incoming driver changes to enable
Dynamic memory partition feature. We plan on re-enabling testing for this feature during ROCm
6.4.
### Resolved issues

- **Fixed `rsmi_dev_target_graphics_version_get`, `rocm-smi --showhw`, and `rocm-smi --showprod` not displaying properly for MI2x or Navi 3x ASICs.**

### Upcoming changes

Expand Down
33 changes: 33 additions & 0 deletions include/rocm_smi/rocm_smi.h
Original file line number Diff line number Diff line change
Expand Up @@ -4181,6 +4181,39 @@ rsmi_status_t
rsmi_dev_memory_partition_get(uint32_t dv_ind, char *memory_partition,
uint32_t len);

/**
* @brief Retrieves the available memory partition capabilities
* for a desired device
*
* @details
* Given a device index @p dv_ind and a string @p memory_partition_caps ,
* and uint32 @p len , this function will attempt to obtain the device's
* available memory partition capabilities string. Upon successful
* retreival, the obtained device's available memory partition capablilities
* string shall be stored in the passed @p memory_partition_caps
* char string variable.
*
* @param[in] dv_ind a device index
*
* @param[inout] memory_partition_caps a pointer to a char string variable,
* which the device's available memory partition capabilities will be written to.
*
* @param[in] len the length of the caller provided buffer @p len ,
* suggested length is 30 or greater.
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
* @retval ::RSMI_STATUS_UNEXPECTED_DATA data provided to function is not valid
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function
* @retval ::RSMI_STATUS_INSUFFICIENT_SIZE is returned if @p len bytes is not
* large enough to hold the entire memory partition value. In this case,
* only @p len bytes will be written.
*
*/
rsmi_status_t rsmi_dev_memory_partition_capabilities_get(
uint32_t dv_ind, char *memory_partition_caps, uint32_t len);

/**
* @brief Modifies a selected device's current memory partition setting.
*
Expand Down
6 changes: 5 additions & 1 deletion include/rocm_smi/rocm_smi_device.h
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,8 @@ enum DevInfoTypes {
kDevGpuReset,
kDevAvailableComputePartition,
kDevComputePartition,
kDevMemoryPartition
kDevMemoryPartition,
kDevAvailableMemoryPartition,
};

typedef struct {
Expand Down Expand Up @@ -227,6 +228,8 @@ class Device {
bool DeviceAPISupported(std::string name, uint64_t variant,
uint64_t sub_variant);
rsmi_status_t restartAMDGpuDriver(void);
rsmi_status_t isRestartInProgress(bool *isRestartInProgress,
bool *isAMDGPUModuleLive);
rsmi_status_t storeDevicePartitions(uint32_t dv_ind);
template <typename T> std::string readBootPartitionState(uint32_t dv_ind);
rsmi_status_t check_amdgpu_property_reinforcement_query(uint32_t dev_idx, AMDGpuVerbTypes_t verb_type);
Expand All @@ -244,6 +247,7 @@ class Device {
static const std::map<DevInfoTypes, const char*> devInfoTypesStrings;
void set_smi_device_id(uint32_t i) { m_device_id = i; }
void set_smi_partition_id(uint32_t i) { m_partition_id = i; }
static const char* get_type_string(DevInfoTypes type);

private:
std::shared_ptr<Monitor> monitor_;
Expand Down
5 changes: 4 additions & 1 deletion include/rocm_smi/rocm_smi_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,8 @@ std::pair<bool, std::string> executeCommand(std::string command,
rsmi_status_t storeTmpFile(uint32_t dv_ind, std::string parameterName,
std::string stateName, std::string storageData);
std::vector<std::string> getListOfAppTmpFiles();
bool containsString(std::string originalString, std::string substring);
bool containsString(std::string originalString, std::string substring,
bool displayComparisons = false);
std::tuple<bool, std::string> readTmpFile(
uint32_t dv_ind,
std::string stateName,
Expand Down Expand Up @@ -138,6 +139,8 @@ std::string removeNewLines(const std::string &s);

std::string removeString(const std::string origStr,
const std::string &removeMe);
void system_wait(int milli_seconds);
int countDigit(uint64_t n);
template <typename T>
std::string print_int_as_hex(T i, bool showHexNotation = true,
int overloadBitSize = 0) {
Expand Down
72 changes: 65 additions & 7 deletions python_smi_tools/rocm_smi.py
Original file line number Diff line number Diff line change
Expand Up @@ -391,11 +391,16 @@ def getTargetGfxVersion(device, silent=False):
:param silent: Turn on to silence error output
(you plan to handle manually). Default is off.
"""
gfx_version = c_uint64()
target_graphics_version = c_uint64()
market_name = str(getDeviceName(device, True))
gfx_ver_ret = "N/A"
ret = rocmsmi.rsmi_dev_target_graphics_version_get(device, byref(gfx_version))
ret = rocmsmi.rsmi_dev_target_graphics_version_get(device, byref(target_graphics_version))
target_graphics_version = str(target_graphics_version.value)
if rsmi_ret_ok(ret, device, 'get_target_gfx_version', silent=silent):
gfx_ver_ret = "gfx" + str(gfx_version.value)
if len(target_graphics_version) == 4 and ("Instinct MI2" in market_name):
hex_part = str(hex(int(str(target_graphics_version)[2:]))).replace("0x", "")
target_graphics_version = str(target_graphics_version)[:2] + hex_part
gfx_ver_ret = "gfx" + str(target_graphics_version)
return gfx_ver_ret

def getNodeId(device, silent=False):
Expand Down Expand Up @@ -753,6 +758,19 @@ def getMemoryPartition(device, silent=True):
return str(currentMemoryPartition.value.decode())
return "N/A"

def getMemoryPartitionCapabilities(device, silent=True):
""" Return the current memory partition capablities of a given device
:param device: DRM device identifier
:param silent: Turn on to silence error output
(you plan to handle manually). Default is on.
"""
memoryPartitionCapabilities = create_string_buffer(MAX_BUFF_SIZE)
ret = rocmsmi.rsmi_dev_memory_partition_capabilities_get(device, memoryPartitionCapabilities, MAX_BUFF_SIZE)
if rsmi_ret_ok(ret, device, 'get_compute_partition', silent) and memoryPartitionCapabilities.value.decode():
return str(memoryPartitionCapabilities.value.decode())
return "N/A"


def print2DArray(dataArray):
""" Print 2D Array with uniform spacing """
Expand Down Expand Up @@ -1823,14 +1841,20 @@ def showProgressbar(title="", timeInSeconds=13):
time.sleep(1)


def setMemoryPartition(deviceList, memoryPartition):
def setMemoryPartition(deviceList, memoryPartition, autoRespond):
""" Sets memory partition (memory partition) for a list of devices
:param deviceList: List of DRM devices (can be a single-item list)
:param memoryPartition: Memory Partition type to set as
"""
addExtraLine=False
printLogSpacer(' Set memory partition to %s ' % (str(memoryPartition).upper()))
confirmChangingMemoryPartitionAndReloadingAMDGPU(autoRespond)
for device in deviceList:
current_memory_partition = getMemoryPartition(device, silent=True)
if current_memory_partition == 'N/A':
printLog(device, 'Not supported on the given system', None, addExtraLine)
continue
memoryPartition = memoryPartition.upper()
if memoryPartition not in memory_partition_type_l:
printErrLog(device, 'Invalid memory partition type %s'
Expand All @@ -1839,8 +1863,9 @@ def setMemoryPartition(deviceList, memoryPartition):
(', '.join(map(str, memory_partition_type_l))) ))
return (None, None)

kTimeWait = 40
t1 = multiprocessing.Process(target=showProgressbar,
args=("Updating memory partition",13,))
args=("Updating memory partition",kTimeWait,))
t1.start()
addExtraLine=True
start=time.time()
Expand All @@ -1862,12 +1887,19 @@ def setMemoryPartition(deviceList, memoryPartition):
printLog(device, 'Permission denied', None, addExtraLine)
elif ret == rsmi_status_t.RSMI_STATUS_NOT_SUPPORTED:
printLog(device, 'Not supported on the given system', None, addExtraLine)
elif ret == rsmi_status_t.RSMI_STATUS_INVALID_ARGS:
printLog(device, 'Device does not support setting to ' + str(memoryPartition).upper(), None, addExtraLine)
memory_partition_caps = getMemoryPartitionCapabilities(device, silent=True)
printLog(device, 'Available memory partition modes: ' + str(memory_partition_caps).upper(), None, addExtraLine)
elif ret == rsmi_status_t.RSMI_STATUS_BUSY:
printLog(device, 'Device is currently busy, try again later',
None, addExtraLine)
elif ret == rsmi_status_t.RSMI_STATUS_AMDGPU_RESTART_ERR:
printLog(device, 'Issue reloading driver, please check dmsg for errors',
None, addExtraLine)
else:
rsmi_ret_ok(ret, device, 'set_memory_partition')
printErrLog(device, 'Failed to retrieve memory partition, even though device supports it.')
printErrLog(device, 'Failed to set memory partition, even though device supports it.')
printLogSpacer()

def showVersion(isCSV=False):
Expand Down Expand Up @@ -3844,6 +3876,32 @@ def confirmOutOfSpecWarning(autoRespond):
else:
sys.exit('Confirmation not given. Exiting without setting value')

def confirmChangingMemoryPartitionAndReloadingAMDGPU(autoRespond):
""" Print the warning for running outside of specification and prompt user to accept the terms.
:param autoRespond: Response to automatically provide for all prompts
"""
print('''
******WARNING******\n
Setting Dynamic Memory (NPS) partition modes require users to quit all GPU workloads.
ROCm SMI will then attempt to change memory (NPS) partition mode.
Upon a successful set, ROCm SMI will then initiate an action to restart amdgpu driver.
This action will change all GPU's in the hive to the requested memory (NPS) partition mode.
Please use this utility with caution.
''')
if not autoRespond:
user_input = input('Do you accept these terms? [Y/N] ')
else:
user_input = autoRespond
if user_input in ['Yes', 'yes', 'y', 'Y', 'YES']:
print('')
return
else:
print('Confirmation not given. Exiting without setting value')
printLogSpacer()
sys.exit(1)


def doesDeviceExist(device):
""" Check whether the specified device exists
Expand Down Expand Up @@ -4503,7 +4561,7 @@ def isConciseInfoRequested(args):
if args.setcomputepartition:
setComputePartition(deviceList, args.setcomputepartition[0])
if args.setmemorypartition:
setMemoryPartition(deviceList, args.setmemorypartition[0])
setMemoryPartition(deviceList, args.setmemorypartition[0], args.autorespond)
if args.resetprofile:
resetProfile(deviceList)
if args.resetxgmierr:
Expand Down
Loading

0 comments on commit 55c0f58

Please sign in to comment.