Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

inference onnx error #110

Closed
EthanMarx opened this issue Jul 1, 2022 · 13 comments · Fixed by #125
Closed

inference onnx error #110

EthanMarx opened this issue Jul 1, 2022 · 13 comments · Fixed by #125
Assignees

Comments

@EthanMarx
Copy link
Collaborator

Running into the following issue when performing inference:
RuntimeError: in ensemble 'bbhnet-stream', onnx runtime error 1: Non-zero status code returned while running FusedConv node. Name:'Conv_203_Add_204_Relu_205' Status Message: CUDNN error executing cudnnAddTensor(Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data)

whichver onnx runtime version we are using doesn't seem to like the convolutional layer?

@alecgunny
Copy link
Collaborator

Hmm yeah it seems like this is some mismatch between the ONNX opset used during export and the one used at inference time. The release notes for the Triton container version that we use seem to indicate that they're using onnxruntime 1.10.0, which onnx's version documentation seems to indicate should support opset 13 (which gets used by Torch by default), so this is a bit bizarre. Could be worth loading the exported model using the onnxruntime python bindings explicitly and trying to run inference with it.

Then again one interesting question is: does this happen on the very first inference, or is Triton able to process a few inputs before this happens?

@EthanMarx
Copy link
Collaborator Author

So in the export project I actually hard coded opset version 15 which I think is the most recent. This could explain the discrepancy. I'll investigate exporting the model with 13, and if that doesn't work i'll play around with loading it in.

@EthanMarx
Copy link
Collaborator Author

EthanMarx commented Jul 3, 2022

I found this related issue that fixed the problem by adding optimization {graph : {level : -1}} to the config.pbtxt. I manually added this and the requests are getting through. I have not thoroughly looked into exactly how this solves the problem, but for now it seems to work... It looks like this slows down the model so probably not a good long term solution.

@alecgunny
Copy link
Collaborator

Ah ok this is a good find and should be functionality to add into hermes.quiver, I'll take a note to open an issue about this. When I'm back online I'll see if I can actually quantify the slowdown by comparing times on both Triton and by loading in with onnxruntime and post that on the issue you linked to hopefully motivate Triton to increment the onnx build they're using, as it seems like that may be contributing to the issue.

@EthanMarx
Copy link
Collaborator Author

Some statistics on the first run:
5 timeslides of 13000 seconds finished in ~ 7 hours. At an inference rate of 16 Hz thats~ 210,000 inferences per slide, meaning the average inference time is ~.025 seconds. Used 4 instances per gpu.

@alecgunny
Copy link
Collaborator

Ok cool couple questions:

  • Is that 7 hours just for inference, or is that taking into account other factors like data loading?
  • What kind of GPU?
  • How many streams per GPU?
  • How does GPU utilization look during inference? (Consider running watch -n 0.2 "nvidia-smi -i <gpu-index>" in a separate terminal on the same node to get a sense for it)
  • Stillwater actually has a (possibly out of date) tool for requesting inference metrics from the Triton, including per-model latencies and possibly GPU utilization as well (forget if I have this implemented or not but if not it's somewhere in the paper code and I can dust it off pretty quickly). If the answers to the above questions are that this is just inference time and it's on a proper enterprise GPU (e.g. V100), that's really slow and it might be worth implementing this to see where the bottleneck lies. I'll link you to the tool and example usage next time I'm at my laptop.

There are other directions we can go to make things faster if all this doesn't reveal bug (use TensorRT, finally implement batching on the snapshotter, etc.), but that's slow enough that I'm wondering if there's some easier lower hanging fruit we need to pick off first.

@EthanMarx
Copy link
Collaborator Author

That 7 hours includes data loading time as well. However, with only 5 timeslides of 13000 seconds, I can't imagine the dataloading time being the major bottleneck?

The GPUs on ldas-pcdev11:

NVIDIA GeForce GTX 1070
NVIDIA GeForce GTX 1080
NVIDIA TITAN X (Pascal)
NVIDIA GeForce GTX 1080 Ti

If the metric for GPU utilization is the Volatile GPU-Util from calling nvidia-smi, none ever got above 2 percent. I used 4 streams per GPU, but also played around with using 10 and it didn't seem to increase the GPU-util.

@alecgunny
Copy link
Collaborator

Ah ok this is all super interesting. Agreed it can't be data loading, would be curious to see the stats on a V100 if you can get your hands on one (if you can't get one on CIT we can get you access to the ones we use for DeepClean).

Unfortunately there's a fair amount of ugly GPU-related stuff to look at that I'll probably have to dig into when I get back online. Besides the obvious fact that waiting 7 hours for results sucks, to what extent is this gonna bottleneck you over the next week or two?

Also if you have a chance and can share the server.log file that I think should get produced, that might be helpful as well.

@EthanMarx
Copy link
Collaborator Author

Got it. I see a V100 on pcdev10 I can try. Would also be nice to get access to the dedicated DeepClean ones. I don't think this will be too much of a bottleneck. Still have to work on converting the analyze-o2 project for the injections, and a few other things. Of course we can't really make any significant claims until we are able to efficiently process enough background data, but there are still a few things that need to get done in the mean time.

I did not see a server.log file, only the infer.log created by the infer project. Is this something created by hermes or triton? In either case, heres the path to the directory i've been storing outputs: /home/ethan.marx/bbhnet/first_training_run

@alecgunny
Copy link
Collaborator

Ah no that's my bad, you can pass a log_file arg to hermes.aeriel.serve.serve and it will pipe Triton's logs to that file. Not sure why I'm not setting it but if you add something like:

server_log_file = Path(log_file).parent / "server.log"
with serve(..., log_file=server_log_file):

in the main of infer.py that should create one (also not sure why I annotated log_file as Optional[str] instead of Optional[Path] but that should maginally simplify that snippet too)

@EthanMarx
Copy link
Collaborator Author

The server.log for the latest run is located here: /home/ethan.marx/bbhnet/third_training_run/log/server.log. Also just noticed some really weird behavior: For each timeslide, The first 100,000 or so requests take about an hour, while the second 100,000 take 7 minutes.

@alecgunny
Copy link
Collaborator

alecgunny commented Jul 7, 2022

Ahhh ok I think I know what's going on (though I'll have a better idea when I have a chance to look at the logs).

TensorFlow (which is what implements the Snapshotter) dynamically tries to find the fastest kernel for ops at inference time. This means the first couple inferences can take a while; not that long typically, but since we're overloading the system with requests while it's searching and those requests are competing for resources it's probably slowing everything to a crawl.

In principle Triton is supposed to warm up this model, but I haven't observed that it does or if it does that it makes much of a difference. A pretty straightforward fix should be to make some warm up requests up front to each one of the snapshotter models to get them to settle on a fast kernel and not overload them. I can implement this quickly if I find some time.

Longer term I want to move away from using TensorFlow for the snapshotter by using Triton's more recent native state support, this is a Hermes issue. Both for this reason and for the extra unnecessary dependency.

@alecgunny alecgunny self-assigned this Jul 21, 2022
@alecgunny
Copy link
Collaborator

  • Short term for automation purposes: set optimization on config in export script

@alecgunny alecgunny linked a pull request Jul 21, 2022 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants