-
Notifications
You must be signed in to change notification settings - Fork 887
740 add generic support for different gpu hardware #3371
740 add generic support for different gpu hardware #3371
Conversation
70c500c to
54ef2bd
Compare
54ef2bd to
bc96fa7
Compare
agunapal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. We are still reviewing it
| # For reference: | ||
| # https://docs.docker.com/develop/develop-images/build_enhancements/ | ||
|
|
||
| ARG BASE_IMAGE=ubuntu:24.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of TorchServe images are still on ubuntu 20.04 as we had issues with github runners with versions greater. Haven't tried this in a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @agunapal
| --index-url https://download.pytorch.org/whl/rocm6.2 | ||
| torch==2.5.1+rocm6.2; sys_platform == 'linux' | ||
| torchvision==0.20.1+rocm6.2; sys_platform == 'linux' | ||
| torchaudio==2.5.1+rocm6.2; sys_platform == 'linux' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't yet updated the Pytorch version for the rest of the project. But this should be ok. I will update it for other platforms too
|
@smedegaard Looks like the PR is breaking TorchServe on |
It seems like this test is failing in CI https://github.com/nod-ai/serve/blob/31824434aa2acd3ff8261bd18cf6f1d925b8e22a/frontend/server/src/test/java/org/pytorch/serve/util/ConfigManagerTest.java#L110 |
|
smedegaard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Co-authored-by: Samu Tamminen <stammine@amd.com>
…leUtil GPU env value
844806c to
cc0809d
Compare
agunapal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Tested with the changes manually on Graviton 3 and there are no issues.
The failures with the runners can be debugged at a later point

Description
This PR decouples the hardware layer from the front- and backend of TorchServe.
Relates to #740
'Add AMD backend support'
Rony Leppänen rleppane@amd.com
'Add AMD frontend support'
Anders Smedegaard Pedersen asmedega@amd.com
'Add Dockerfile.rocm'
Samu Tamminen stammine@amd.com
Jarkko Lehtiranta jlehtira@amd.com
'Add AMD documentation'
Anders Smedegaard Pedersen asmedega@amd.com
Rony Leppänen rleppane@amd.com
Jarkko Lehtiranta jlehtira@amd.com
Other contributions:
Bipradip Chowdhury bichowdh@amd.com
Jarkko Vainio javainio@amd.com
Tero Kemppi tekemppi@amd.com
Requirement Files
Added
requirements/torch_rocm62.txt,requirements/torch_rocm61.txtandrequirements/torch_rocm60.txtfor easy install of dependencies needed for AMD support.Backend
The Python backend supports currently NVIDIA GPUs using hardware specific libraries. There were also a number of functions that could be refactored using more generalized interfaces.
Changes Made to Backend
print_env_infofor AMD GPUs and reimplement a number of functionsFrontend
The Java frontend that acts as the workload manager had calls to SMIs hard-coded in a few places. This made it difficult for TorchServe to support multiple hardware vendors in a graceful manner.
Changes Made to Frontend
We've introduced a new package
org.pytorch.serve.devicewith the classesSystemInfoandAccelerator.SystemInfoholds an array list ofAcceleratorobjects that holds static information about the specific accelerators on a machine, and the relevant metrics.Instead of calling the SMIs directly in multiple places in the frontend code we have abstracted the hardware away by adding an instance of
SystemInfoto the pre-existingConfigManager. Now the frontend can get data from the hardware via the methods onSystemInfowithout knowing about the specifics of the hardware and SMIs.To implement the specifics for each of the vendors that was already partially supported we have created a number of utility classes that communicates with the hardware via the relevant SMI.
The following steps are taken in the
SystemInfoconstructor.which {relevant smi}for each of the supported vendors.This is how vendor detection was done previously. There might be more robust ways.
whereis used on Windows systems.ROCmUtilityfor AMD.HIP_VISIBLE_DEVICESfor AMD,CUDA_VISIBLE_DEVICESfor nvidia andXPU_VISIBLE_DEVICESfor Intel. All devices are detected if the relevant environment variable is not set.The following is a class diagram showing how the new classes relate to the existing code
classDiagram class Accelerator { +Integer id +AcceleratorVendor vendor +String model +IAcceleratorUtility acceleratorUtility +Float usagePercentage +Float memoryUtilizationPercentage +Integer memoryAvailableMegabytes +Integer memoryUtilizationMegabytes +getVendor() +getAcceleratorModel() +getAcceleratorId() +getMemoryAvailableMegaBytes() +getUsagePercentage() +getMemoryUtilizationPercentage() +getMemoryUtilizationMegabytes() +setMemoryAvailableMegaBytes() +setUsagePercentage() +setMemoryUtilizationPercentage() +setMemoryUtilizationMegabytes() +utilizationToString() +updateDynamicAttributes() } class SystemInfo { -AcceleratorVendor acceleratorVendor -ArrayList<Accelerator> accelerators -IAcceleratorUtility acceleratorUtil +hasAccelerators() +getNumberOfAccelerators() +getAccelerators() +updateAcceleratorMetrics() } class AcceleratorVendor { <<enumeration>> AMD NVIDIA INTEL APPLE UNKNOWN } class IAcceleratorUtility { <<interface>> +getGpuEnvVariableName() +getUtilizationSmiCommand() +getAvailableAccelerators() +smiOutputToUpdatedAccelerators() +getUpdatedAcceleratorsUtilization() } class ICsvSmiParser { <<interface>> +csvSmiOutputToAccelerators() } class IJsonSmiParser { <<interface>> +jsonOutputToAccelerators() +extractAcceleratorId() +jsonObjectToAccelerator() +extractAccelerators() } class CudaUtil { +getGpuEnvVariableName() +getUtilizationSmiCommand() +getAvailableAccelerators() +smiOutputToUpdatedAccelerators() +parseAccelerator() +parseUpdatedAccelerator() } class ROCmUtil { +getGpuEnvVariableName() +getUtilizationSmiCommand() +getAvailableAccelerators() +smiOutputToUpdatedAccelerators() +extractAccelerators() +extractAcceleratorId() +jsonObjectToAccelerator() } class XpuUtil { +getGpuEnvVariableName() +getUtilizationSmiCommand() +getAvailableAccelerators() +smiOutputToUpdatedAccelerators() +parseDiscoveryOutput() +parseUtilizationOutput() } class AppleUtil { +getGpuEnvVariableName() +getUtilizationSmiCommand() +getAvailableAccelerators() +smiOutputToUpdatedAccelerators() +jsonObjectToAccelerator() +extractAcceleratorId() +extractAccelerators() } class ConfigManager { -SystemInfo systemInfo +init(Arguments args) } class WorkerThread { #ConfigManager configManager #WorkerLifeCycle lifeCycle } class AsyncWorkerThread { #boolean loadingFinished #CountDownLatch latch +run() #connect() } class SystemInfo { -Logger logger -AcceleratorVendor acceleratorVendor -ArrayList<Accelerator> accelerators -IAcceleratorUtility acceleratorUtil +SystemInfo() -createAcceleratorUtility() IAcceleratorUtility -populateAccelerators() +hasAccelerators() boolean +getNumberOfAccelerators() Integer +static detectVendorType() AcceleratorVendor -static isCommandAvailable(String) boolean +getAccelerators() ArrayList<Accelerator> -updateAccelerators(List<Accelerator>) +updateAcceleratorMetrics() +getAcceleratorVendor() AcceleratorVendor +getVisibleDevicesEnvName() String } class Accelerator { +Integer id +AcceleratorVendor vendor +String model +Float usagePercentage +Float memoryUtilizationPercentage +Integer memoryAvailableMegabytes +Integer memoryUtilizationMegabytes +getVendor() AcceleratorVendor +getAcceleratorModel() String +getAcceleratorId() Integer +getUsagePercentage() Float +setUsagePercentage(Float) +setMemoryUtilizationPercentage(Float) +setMemoryUtilizationMegabytes(Integer) } class WorkerLifeCycle { -ConfigManager configManager -ModelManager modelManager -Model model } class WorkerThread { #ConfigManager configManager #int port #Model model #WorkerState state #WorkerLifeCycle lifeCycle } WorkerLifeCycle --> "1" ConfigManager WorkerLifeCycle --> "1" Model WorkerLifeCycle --> "1" Connector WorkerThread --> "1" WorkerLifeCycle ConfigManager "1" --> "1" SystemInfo ConfigManager "1" --> "*" Accelerator WorkerThread --> "1" ConfigManager WorkerThread --> "1" WorkerLifeCycle AsyncWorkerThread --|> WorkerThread SystemInfo --> "0..*" Accelerator SystemInfo --> "1" IAcceleratorUtility SystemInfo --> "1" AcceleratorVendor Accelerator --> "1" AcceleratorVendor CudaUtil ..|> IAcceleratorUtility CudaUtil ..|> ICsvSmiParser ROCmUtil ..|> IAcceleratorUtility ROCmUtil ..|> IJsonSmiParser XpuUtil ..|> IAcceleratorUtility XpuUtil ..|> ICsvSmiParser AppleUtil ..|> IAcceleratorUtility AppleUtil ..|> IJsonSmiParserDocumentation
serve/docs/hardware_support/and added them under "Hardware Support" in the TOCType of change
Please delete options that are not relevant.
Feature/Issue validation/testing
We build new docker container for ROCm using Dockerfile.rocm and build argument
USE_ROCM_VERSION. For other platforms we usedbuild_image.shscript.Run containers
Tests
Logs:
Logs:
Logs:
Logs:
Logs:
Logs:
Logs:
Logs:
Logs:
OBS! The test
test_handler.py::test_huggingface_bert_model_parallel_inferencefails due to:ValueError: Input length of input_ids is 150, but max_length is set to 50. This can lead to unexpected behavior. You should consider increasing max_length or, better yet, setting max_new_tokens.This indicates that preprocessing uses a different
max_lengththan inference, which can be verified when looking at the handler when the test was originally implemented: model.generate() hasmax_length=50by default, while tokenizer uses max_length from setup_config (max_length=150). It seems that the BERT-basedTextgeneration.marneeds an update.Checklist: