inference onnx error #110

EthanMarx · 2022-07-01T13:33:52Z

Running into the following issue when performing inference:
RuntimeError: in ensemble 'bbhnet-stream', onnx runtime error 1: Non-zero status code returned while running FusedConv node. Name:'Conv_203_Add_204_Relu_205' Status Message: CUDNN error executing cudnnAddTensor(Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data)

whichver onnx runtime version we are using doesn't seem to like the convolutional layer?

The text was updated successfully, but these errors were encountered:

alecgunny · 2022-07-01T14:43:57Z

Hmm yeah it seems like this is some mismatch between the ONNX opset used during export and the one used at inference time. The release notes for the Triton container version that we use seem to indicate that they're using onnxruntime 1.10.0, which onnx's version documentation seems to indicate should support opset 13 (which gets used by Torch by default), so this is a bit bizarre. Could be worth loading the exported model using the onnxruntime python bindings explicitly and trying to run inference with it.

Then again one interesting question is: does this happen on the very first inference, or is Triton able to process a few inputs before this happens?

EthanMarx · 2022-07-01T15:00:06Z

So in the export project I actually hard coded opset version 15 which I think is the most recent. This could explain the discrepancy. I'll investigate exporting the model with 13, and if that doesn't work i'll play around with loading it in.

EthanMarx · 2022-07-03T16:10:31Z

I found this related issue that fixed the problem by adding optimization {graph : {level : -1}} to the config.pbtxt. I manually added this and the requests are getting through. I have not thoroughly looked into exactly how this solves the problem, but for now it seems to work... It looks like this slows down the model so probably not a good long term solution.

alecgunny · 2022-07-04T12:11:22Z

Ah ok this is a good find and should be functionality to add into hermes.quiver, I'll take a note to open an issue about this. When I'm back online I'll see if I can actually quantify the slowdown by comparing times on both Triton and by loading in with onnxruntime and post that on the issue you linked to hopefully motivate Triton to increment the onnx build they're using, as it seems like that may be contributing to the issue.

EthanMarx · 2022-07-05T01:11:35Z

Some statistics on the first run:
5 timeslides of 13000 seconds finished in ~ 7 hours. At an inference rate of 16 Hz thats~ 210,000 inferences per slide, meaning the average inference time is ~.025 seconds. Used 4 instances per gpu.

alecgunny · 2022-07-05T14:01:38Z

Ok cool couple questions:

Is that 7 hours just for inference, or is that taking into account other factors like data loading?
What kind of GPU?
How many streams per GPU?
How does GPU utilization look during inference? (Consider running watch -n 0.2 "nvidia-smi -i <gpu-index>" in a separate terminal on the same node to get a sense for it)
Stillwater actually has a (possibly out of date) tool for requesting inference metrics from the Triton, including per-model latencies and possibly GPU utilization as well (forget if I have this implemented or not but if not it's somewhere in the paper code and I can dust it off pretty quickly). If the answers to the above questions are that this is just inference time and it's on a proper enterprise GPU (e.g. V100), that's really slow and it might be worth implementing this to see where the bottleneck lies. I'll link you to the tool and example usage next time I'm at my laptop.

There are other directions we can go to make things faster if all this doesn't reveal bug (use TensorRT, finally implement batching on the snapshotter, etc.), but that's slow enough that I'm wondering if there's some easier lower hanging fruit we need to pick off first.

EthanMarx · 2022-07-05T14:19:59Z

That 7 hours includes data loading time as well. However, with only 5 timeslides of 13000 seconds, I can't imagine the dataloading time being the major bottleneck?

The GPUs on ldas-pcdev11:

NVIDIA GeForce GTX 1070
NVIDIA GeForce GTX 1080
NVIDIA TITAN X (Pascal)
NVIDIA GeForce GTX 1080 Ti

If the metric for GPU utilization is the Volatile GPU-Util from calling nvidia-smi, none ever got above 2 percent. I used 4 streams per GPU, but also played around with using 10 and it didn't seem to increase the GPU-util.

alecgunny · 2022-07-05T17:39:38Z

Ah ok this is all super interesting. Agreed it can't be data loading, would be curious to see the stats on a V100 if you can get your hands on one (if you can't get one on CIT we can get you access to the ones we use for DeepClean).

Unfortunately there's a fair amount of ugly GPU-related stuff to look at that I'll probably have to dig into when I get back online. Besides the obvious fact that waiting 7 hours for results sucks, to what extent is this gonna bottleneck you over the next week or two?

Also if you have a chance and can share the server.log file that I think should get produced, that might be helpful as well.

EthanMarx · 2022-07-05T18:30:45Z

Got it. I see a V100 on pcdev10 I can try. Would also be nice to get access to the dedicated DeepClean ones. I don't think this will be too much of a bottleneck. Still have to work on converting the analyze-o2 project for the injections, and a few other things. Of course we can't really make any significant claims until we are able to efficiently process enough background data, but there are still a few things that need to get done in the mean time.

I did not see a server.log file, only the infer.log created by the infer project. Is this something created by hermes or triton? In either case, heres the path to the directory i've been storing outputs: /home/ethan.marx/bbhnet/first_training_run

alecgunny · 2022-07-06T05:57:43Z

Ah no that's my bad, you can pass a log_file arg to hermes.aeriel.serve.serve and it will pipe Triton's logs to that file. Not sure why I'm not setting it but if you add something like:

server_log_file = Path(log_file).parent / "server.log"
with serve(..., log_file=server_log_file):

in the main of infer.py that should create one (also not sure why I annotated log_file as Optional[str] instead of Optional[Path] but that should maginally simplify that snippet too)

EthanMarx · 2022-07-07T14:25:56Z

The server.log for the latest run is located here: /home/ethan.marx/bbhnet/third_training_run/log/server.log. Also just noticed some really weird behavior: For each timeslide, The first 100,000 or so requests take about an hour, while the second 100,000 take 7 minutes.

alecgunny · 2022-07-07T15:53:14Z

Ahhh ok I think I know what's going on (though I'll have a better idea when I have a chance to look at the logs).

TensorFlow (which is what implements the Snapshotter) dynamically tries to find the fastest kernel for ops at inference time. This means the first couple inferences can take a while; not that long typically, but since we're overloading the system with requests while it's searching and those requests are competing for resources it's probably slowing everything to a crawl.

In principle Triton is supposed to warm up this model, but I haven't observed that it does or if it does that it makes much of a difference. A pretty straightforward fix should be to make some warm up requests up front to each one of the snapshotter models to get them to settle on a fast kernel and not overload them. I can implement this quickly if I find some time.

Longer term I want to move away from using TensorFlow for the snapshotter by using Triton's more recent native state support, this is a Hermes issue. Both for this reason and for the extra unnecessary dependency.

alecgunny · 2022-07-21T21:28:22Z

Short term for automation purposes: set optimization on config in export script

alecgunny self-assigned this Jul 21, 2022

alecgunny linked a pull request Jul 21, 2022 that will close this issue

opting out of runtime optimizations in export #125

Merged

EthanMarx closed this as completed in #125 Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference onnx error #110

inference onnx error #110

EthanMarx commented Jul 1, 2022

alecgunny commented Jul 1, 2022

EthanMarx commented Jul 1, 2022

EthanMarx commented Jul 3, 2022 •

edited

Loading

alecgunny commented Jul 4, 2022

EthanMarx commented Jul 5, 2022

alecgunny commented Jul 5, 2022

EthanMarx commented Jul 5, 2022

alecgunny commented Jul 5, 2022

EthanMarx commented Jul 5, 2022

alecgunny commented Jul 6, 2022

EthanMarx commented Jul 7, 2022

alecgunny commented Jul 7, 2022 •

edited

Loading

alecgunny commented Jul 21, 2022

inference onnx error #110

inference onnx error #110

Comments

EthanMarx commented Jul 1, 2022

alecgunny commented Jul 1, 2022

EthanMarx commented Jul 1, 2022

EthanMarx commented Jul 3, 2022 • edited Loading

alecgunny commented Jul 4, 2022

EthanMarx commented Jul 5, 2022

alecgunny commented Jul 5, 2022

EthanMarx commented Jul 5, 2022

alecgunny commented Jul 5, 2022

EthanMarx commented Jul 5, 2022

alecgunny commented Jul 6, 2022

EthanMarx commented Jul 7, 2022

alecgunny commented Jul 7, 2022 • edited Loading

alecgunny commented Jul 21, 2022

EthanMarx commented Jul 3, 2022 •

edited

Loading

alecgunny commented Jul 7, 2022 •

edited

Loading