Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch inference example crashes when loading resnet152 model #2467

Closed
mmeendez8 opened this issue Jul 17, 2023 · 2 comments
Closed

Batch inference example crashes when loading resnet152 model #2467

mmeendez8 opened this issue Jul 17, 2023 · 2 comments
Labels

Comments

@mmeendez8
Copy link
Contributor

🐛 Describe the bug

Tried to run the batch inference example following the documentation.

  1. Launch torchserve with docker
docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -v PATH:/mnt/models/model-store  -v PATH:/home/model-server/config.properties  pytorch/torchserve:latest-gpu
  1. Register model with curl
curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=3&max_batch_delay=10&initial_workers=1"

Error logs

023-07-17T14:20:37,171 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - Failed to load model resnet-152-batch_v2, exception Attempted to set the storage of a tensor on device "cpu" to a storage on different device "cuda:0".  This is no longer allowed; the devices must match.
2023-07-17T14:20:37,171 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-07-17T14:20:37,171 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_service_worker.py", line 131, in load_model
2023-07-17T14:20:37,171 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     service = model_loader.load(
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_loader.py", line 135, in load
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     initialize_fn(service.context)
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/torch_handler/vision_handler.py", line 23, in initialize
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     super().initialize(context)
2023-07-17T14:20:37,172 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/torch_handler/base_handler.py", line 157, in initialize
2023-07-17T14:20:37,173 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     self.model = self._load_pickled_model(
2023-07-17T14:20:37,173 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/torch_handler/base_handler.py", line 258, in _load_pickled_model
2023-07-17T14:20:37,173 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     state_dict = torch.load(model_pt_path, map_location=map_location)
2023-07-17T14:20:37,173 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/torch/serialization.py", line 815, in load
2023-07-17T14:20:37,174 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
2023-07-17T14:20:37,174 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/torch/serialization.py", line 1018, in _legacy_load
2023-07-17T14:20:37,174 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     return legacy_load(f)
2023-07-17T14:20:37,174 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/torch/serialization.py", line 945, in legacy_load
2023-07-17T14:20:37,175 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     tensor = torch.tensor([], dtype=storage.dtype).set_(
2023-07-17T14:20:37,175 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "cuda:0".  This is no longer allowed; the devices must match.
2023-07-17T14:20:37,178 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - Backend worker process died.
2023-07-17T14:20:37,179 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG - Traceback (most recent call last):
2023-07-17T14:20:37,179 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_service_worker.py", line 253, in <module>
2023-07-17T14:20:37,179 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     worker.run_server()
2023-07-17T14:20:37,180 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_service_worker.py", line 221, in run_server
2023-07-17T14:20:37,180 [DEBUG] W-9000-resnet-152-batch_v2_2.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-07-17T14:20:37,180 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2023-07-17T14:20:37,180 [INFO ] W-9000-resnet-152-batch_v2_2.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 11997
2023-07-17T14:20:37,180 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -   File "/home/venv/lib/python3.9/site-packages/ts/model_service_worker.py", line 189, in handle_connection
2023-07-17T14:20:37,181 [INFO ] W-9000-resnet-152-batch_v2_2.0-stdout MODEL_LOG -     raise RuntimeError("{} - {}".format(code, result))
2023-07-17T14:20:37,180 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED

Installation instructions

Used docker

Model Packaing

https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar

config.properties

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
enable_metrics_api=true
metrics_format=prometheus
number_of_netty_threads=4
job_queue_size=10
model_store=/mnt/models/model-store
max_request_size=100000000
max_response_size=100000000

Versions

Latest docker

Repro instructions

  1. Launch torchserve with docker
docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 -v PATH:/mnt/models/model-store  -v PATH:/home/model-server/config.properties  pytorch/torchserve:latest-gpu
  1. Register model with curl
curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=3&max_batch_delay=10&initial_workers=1"

Possible Solution

No response

@jagadeeshi2i
Copy link
Collaborator

jagadeeshi2i commented Aug 22, 2023

@mmeendez8 seems the weights need to be updated - #2467

@sachanub
Copy link
Collaborator

sachanub commented Oct 26, 2023

As mentioned by @jagadeeshi2i , the weights in the ResNet-152 MAR file had to be updated. The weights in the MAR file had to be changed from resnet152-b121ed2d.pth to resnet152-394f9c45.pth. The updated MAR file is present here: https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar

Steps to perform successful inference:

  • Download model artifact:
wget https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar
  • Pull latest TorchServe GPU container:
docker pull pytorch/torchserve:latest-gpu
  • Create config.properties files:
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
enable_metrics_api=true
metrics_format=prometheus
number_of_netty_threads=4
job_queue_size=10
model_store=/home/model-server/model-store
max_request_size=100000000
max_response_size=100000000
load_models=ALL
  • Launch TorchServe in the Docker container:
docker run --rm -it --gpus all -p 127.0.0.1:8080:8080 -p 127.0.0.1:8081:8081 --name mar -v $(pwd)/resnet-152-batch_v2.mar:/home/model-server/model-store/resnet-152-batch_v2.mar -v $(pwd)/config.properties:/home/model-server/config.properties  pytorch/torchserve:latest-gpu
  • Perform inference with kitten.jpg image:
curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg
curl http://localhost:8080/predictions/resnet-152-batch_v2 -T kitten.jpg

Inference Output:

{
  "tiger_cat": 0.5798614621162415,
  "tabby": 0.38344162702560425,
  "Egyptian_cat": 0.0342114195227623,
  "lynx": 0.0005819813231937587,
  "quilt": 0.000273319921689108
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants