Batch inference example crashes when loading resnet152 model #2467

mmeendez8 · 2023-07-17T14:25:36Z

🐛 Describe the bug

Tried to run the batch inference example following the documentation.

Launch torchserve with docker

docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -v PATH:/mnt/models/model-store  -v PATH:/home/model-server/config.properties  pytorch/torchserve:latest-gpu

Register model with curl

curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=3&max_batch_delay=10&initial_workers=1"

Error logs

023-07-17T14:20:37,171 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - Failed to load model resnet-152-batch_v2, exception Attempted to set the storage of a tensor on device "cpu" to a storage on different device "cuda:0".  This is no longer allowed; the devices must match.
2023-07-17T14:20:37,171 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-07-17T14:20:37,171 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_service_worker.py", line 131, in load_model
2023-07-17T14:20:37,171 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     service = model_loader.load(
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_loader.py", line 135, in load
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     initialize_fn(service.context)
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/torch_handler/vision_handler.py", line 23, in initialize
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     super().initialize(context)
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/torch_handler/base_handler.py", line 157, in initialize
2023-07-17T14:20:37,173 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     self.model = self._load_pickled_model(
2023-07-17T14:20:37,173 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/torch_handler/base_handler.py", line 258, in _load_pickled_model
2023-07-17T14:20:37,173 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     state_dict = torch.load(model_pt_path, map_location=map_location)
2023-07-17T14:20:37,173 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/torch/serialization.py", line 815, in load
2023-07-17T14:20:37,174 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
2023-07-17T14:20:37,174 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/torch/serialization.py", line 1018, in _legacy_load
2023-07-17T14:20:37,174 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     return legacy_load(f)
2023-07-17T14:20:37,174 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/torch/serialization.py", line 945, in legacy_load
2023-07-17T14:20:37,175 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     tensor = torch.tensor([], dtype=storage.dtype).set_(
2023-07-17T14:20:37,175 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "cuda:0".  This is no longer allowed; the devices must match.
2023-07-17T14:20:37,178 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - Backend worker process died.
2023-07-17T14:20:37,179 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-07-17T14:20:37,179 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_service_worker.py", line 253, in <module>
2023-07-17T14:20:37,179 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     worker.run_server()
2023-07-17T14:20:37,180 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_service_worker.py", line 221, in run_server
2023-07-17T14:20:37,180 [DEBUG] W-9000-resnet-152-batch_v2_2.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-07-17T14:20:37,180 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2023-07-17T14:20:37,180 [INFO ] W-9000-resnet-152-batch_v2_2.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 11997
2023-07-17T14:20:37,180 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_service_worker.py", line 189, in handle_connection
2023-07-17T14:20:37,181 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     raise RuntimeError("{} - {}".format(code, result))
2023-07-17T14:20:37,180 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED

Installation instructions

Used docker

Model Packaing

https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar

config.properties

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
enable_metrics_api=true
metrics_format=prometheus
number_of_netty_threads=4
job_queue_size=10
model_store=/mnt/models/model-store
max_request_size=100000000
max_response_size=100000000

Versions

Latest docker

Repro instructions

Launch torchserve with docker

docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -v PATH:/mnt/models/model-store  -v PATH:/home/model-server/config.properties  pytorch/torchserve:latest-gpu

Register model with curl

curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=3&max_batch_delay=10&initial_workers=1"

Possible Solution

No response

The text was updated successfully, but these errors were encountered:

jagadeeshi2i · 2023-08-22T12:11:06Z

@mmeendez8 seems the weights need to be updated - #2467

sachanub · 2023-10-26T21:05:25Z

As mentioned by @jagadeeshi2i , the weights in the ResNet-152 MAR file had to be updated. The weights in the MAR file had to be changed from resnet152-b121ed2d.pth to resnet152-394f9c45.pth. The updated MAR file is present here: https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar

Steps to perform successful inference:

Download model artifact:

wget https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar

Pull latest TorchServe GPU container:

docker pull pytorch/torchserve:latest-gpu

Create config.properties files:

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
enable_metrics_api=true
metrics_format=prometheus
number_of_netty_threads=4
job_queue_size=10
model_store=/home/model-server/model-store
max_request_size=100000000
max_response_size=100000000
load_models=ALL

Launch TorchServe in the Docker container:

docker run --rm -it --gpus all -p 127.0.0.1:8080:8080 -p 127.0.0.1:8081:8081 --name mar -v $(pwd)/resnet-152-batch_v2.mar:/home/model-server/model-store/resnet-152-batch_v2.mar -v $(pwd)/config.properties:/home/model-server/config.properties  pytorch/torchserve:latest-gpu

Perform inference with kitten.jpg image:

curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg
curl http://localhost:8080/predictions/resnet-152-batch_v2 -T kitten.jpg

Inference Output:

{
  "tiger_cat": 0.5798614621162415,
  "tabby": 0.38344162702560425,
  "Egyptian_cat": 0.0342114195227623,
  "lynx": 0.0005819813231937587,
  "quilt": 0.000273319921689108
}

msaroufim added the support label Jul 19, 2023

jagadeeshi2i mentioned this issue Aug 22, 2023

Failing to load the pre-trained weights on multi-gpus. pytorch/vision#3767

Closed

sachanub mentioned this issue Oct 26, 2023

Update inference output for ResNet-152 example #2745

Merged

10 tasks

udaij12 closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch inference example crashes when loading resnet152 model #2467

Batch inference example crashes when loading resnet152 model #2467

mmeendez8 commented Jul 17, 2023

jagadeeshi2i commented Aug 22, 2023 •

edited

Loading

sachanub commented Oct 26, 2023 •

edited

Loading

Batch inference example crashes when loading resnet152 model #2467

Batch inference example crashes when loading resnet152 model #2467

Comments

mmeendez8 commented Jul 17, 2023

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

jagadeeshi2i commented Aug 22, 2023 • edited Loading

sachanub commented Oct 26, 2023 • edited Loading

jagadeeshi2i commented Aug 22, 2023 •

edited

Loading

sachanub commented Oct 26, 2023 •

edited

Loading