-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compilation lasts long and require huge memory with large input images, error yolov5l6 #434
Comments
Thanks for reporting the issue. We will take a look. |
@jeffhataws for your information, I am using Deep Learning AMI (Ubuntu 18.04) Version 61.0;
|
@hadilou , I tried Deep Learning AMI (Ubuntu 18.04) Version 61.1 and was able to compile and infer 2048x2048 yolov5 using the preinstalled versions and scripts from #253
Setup:
Compilation script:
Inference script:
I will try to run with your versions of Neuron packages and see if there's a difference. In the meantime, please try "sudo rmmod neuron; sudo modprobe neuron" and rerun to see if that helps. |
I am still unable to reproduce the error after updating to latest packages like you have (except for torch/torchvision, which I don't have cu113 versions), and using the scripts above.
If you are able to reproduce the error with the scripts above, and if you are able to, please send the compiled model file "model_converted_2k.pt" to aws-neuron-support@amazon.com . Please also send us the instance id of your instance. Thanks. |
Hi @jeffhataws . Thank you for the replies. I will get back to you next week. |
I timed the compilation script for 2048x2048 image size and see only ~5 minutes on inf1.6xlarge: real 5m23.163s If I use the trace options you have above (verbose="Debug", dynamic_batch_size = True, etc) the time went up a little: real 6m46.196s On c5.4xlarge, I see the following time for compilation with 2048x2048 image size and no additional trace options (memory usage during compilation goes up to about 4.6GB resident memory): real 9m1.348s Perhaps you can check the model size you have. I see: "YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients" |
I believe I am using yolov5l6 . It has 476 layers, 76126356 paramaters, 0 gradients. |
Thanks @hadilou . With yolov5l6 I was able to compile (151 minutes on c5n.18xlarge) and also reproduced the failure that you see ("inference timeout") on inf1.6xlarge. We will take a look. |
Nice, will be waiting. Thanks |
Hi @hadilou , Thanks for filing the issue. We have identified the problem and will fix it in a future release of Neuron SDK. In the meantime, to unblock you for the current production version of Neuron SDK, please disable inplace updates with the following version of compile code which enables successful compilation and inference:
Please let us know if it works for you. |
Hi @jeffhataws Thanks for the update. I will try it out this week and let you know if it works for me. |
I can confirm this solve the issue together with the thread 435. Closing this issue. Thanks @jeffhataws |
Hi @jeffhataws Below is an the message for 7680X7680 input images with 16 neuron cores and 1 batch size on c5.24xLarge instance.
Envs:
My guess is that the error is related to the memory requirement of the compiler but it would be nice if you can have a look at it. Last, I was able to compile the network for 3840X3840 images. Thanks :) |
Hi @hadilou, The fix for this issue will be added to an upcoming Neuron release. We will update this issue once it is available. Regards, |
Thank you @aws-owinop . Regards, |
any news on this? |
Hi hadilou, Thank you for your patience. As mentioned before, the fix for this will be in the upcoming release. We will update this issue once it is available. Thanks. |
Hi we have released a fix for this in our latest release : |
Hi,
I have compiled Yolov5 following this issue. I compiled the network for 640x640 and 2048x2048 input sizes. I was able to run the predictions for 640 inputs once, then tried to run the predictions with 2048 images unsuccessfully; I got the error below. After getting this error, any predictions of 640 images resulted in the same error as well. I am using an Inf.6xlarge instance.
`2022-Jun-16 09:33:25.0186 7803:7803 ERROR TDRV:notification_consume_errors Error notifications found on NC:0; action=INFER_ERROR_SUBTYPE_MODEL; error_id=8; error string:Event double set
2022-Jun-16 09:33:25.0186 7803:7803 ERROR TDRV:model_start Ignoring errors found during model load.
2022-Jun-16 09:33:27.0187 7803:7803 ERROR TDRV:exec_consume_infer_status_notifications Missing infer_status notification: (0:0)
2022-Jun-16 09:33:27.0187 7803:7803 ERROR TDRV:exec_consume_infer_status_notifications Missing infer_status notification: (0:1)
2022-Jun-16 09:33:27.0187 7803:7803 ERROR TDRV:exec_consume_infer_status_notifications Missing infer_status notification: (0:2)
2022-Jun-16 09:33:27.0187 7803:7803 ERROR TDRV:exec_consume_infer_status_notifications Missing infer_status notification: (1:0)
2022-Jun-16 09:33:27.0187 7803:7803 ERROR TDRV:exec_consume_infer_status_notifications Missing infer_status notification: (1:1)
2022-Jun-16 09:33:27.0187 7803:7803 ERROR TDRV:exec_consume_infer_status_notifications Missing infer_status notification: (1:2)
2022-Jun-16 09:33:27.0187 7803:7803 ERROR TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) inference timeout (2000 ms) on Neuron Device 0 NC 0, waiting for execution completion notification
2022-Jun-16 09:33:27.0188 7803:7803 ERROR NMGR:dlr_infer Inference completed with err: 5
Traceback (most recent call last):
File "/home/ubuntu/aws_neuron/infer.py", line 183, in
benchmark()
File "/home/ubuntu/aws_neuron/infer.py", line 150, in benchmark
models = [load_model() for _ in range(n_cores)]
File "/home/ubuntu/aws_neuron/infer.py", line 150, in
models = [load_model() for _ in range(n_cores)]
File "/home/ubuntu/aws_neuron/infer.py", line 115, in load_model
model(image)
File "/home/ubuntu/anaconda3/envs/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/torch_neuron/convert.py", line 11, in forward
_NeuronGraph_0 = getattr(self, "_NeuronGraph#0")
model = _NeuronGraph_0.model
_0 = ops.neuron.forward_v2_1([argument_1], model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return _0
Traceback of TorchScript, original code (most recent call last):
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/decorators.py(373): forward
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(546): call
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(205): run_op
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(194): call
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py(217): forward
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace
/home/ec2-user/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py(183): trace
aws_neuron/convert.py(19):
RuntimeError: Failed to execute the model status=5 message=Timeout Exceeded`
This is what my compilation code looks like
model_neuron = torch.neuron.trace(model, example_inputs=[fake_image], verbose="Debug", # debug compiler_workdir="./neuron_work_dir/", dynamic_batch_size = True, compiler_args=['--neuroncore-pipeline-cores', str(1)])
Any help is much more appreciated.
The text was updated successfully, but these errors were encountered: