-
Notifications
You must be signed in to change notification settings - Fork 17
inference onnx error #110
Comments
Hmm yeah it seems like this is some mismatch between the ONNX opset used during export and the one used at inference time. The release notes for the Triton container version that we use seem to indicate that they're using onnxruntime 1.10.0, which onnx's version documentation seems to indicate should support opset 13 (which gets used by Torch by default), so this is a bit bizarre. Could be worth loading the exported model using the Then again one interesting question is: does this happen on the very first inference, or is Triton able to process a few inputs before this happens? |
So in the export project I actually hard coded opset version 15 which I think is the most recent. This could explain the discrepancy. I'll investigate exporting the model with 13, and if that doesn't work i'll play around with loading it in. |
I found this related issue that fixed the problem by adding |
Ah ok this is a good find and should be functionality to add into |
Some statistics on the first run: |
Ok cool couple questions:
There are other directions we can go to make things faster if all this doesn't reveal bug (use TensorRT, finally implement batching on the snapshotter, etc.), but that's slow enough that I'm wondering if there's some easier lower hanging fruit we need to pick off first. |
That 7 hours includes data loading time as well. However, with only 5 timeslides of 13000 seconds, I can't imagine the dataloading time being the major bottleneck? The GPUs on ldas-pcdev11:
If the metric for GPU utilization is the |
Ah ok this is all super interesting. Agreed it can't be data loading, would be curious to see the stats on a V100 if you can get your hands on one (if you can't get one on CIT we can get you access to the ones we use for DeepClean). Unfortunately there's a fair amount of ugly GPU-related stuff to look at that I'll probably have to dig into when I get back online. Besides the obvious fact that waiting 7 hours for results sucks, to what extent is this gonna bottleneck you over the next week or two? Also if you have a chance and can share the |
Got it. I see a V100 on pcdev10 I can try. Would also be nice to get access to the dedicated DeepClean ones. I don't think this will be too much of a bottleneck. Still have to work on converting the I did not see a |
Ah no that's my bad, you can pass a server_log_file = Path(log_file).parent / "server.log"
with serve(..., log_file=server_log_file): in the |
The |
Ahhh ok I think I know what's going on (though I'll have a better idea when I have a chance to look at the logs). TensorFlow (which is what implements the Snapshotter) dynamically tries to find the fastest kernel for ops at inference time. This means the first couple inferences can take a while; not that long typically, but since we're overloading the system with requests while it's searching and those requests are competing for resources it's probably slowing everything to a crawl. In principle Triton is supposed to warm up this model, but I haven't observed that it does or if it does that it makes much of a difference. A pretty straightforward fix should be to make some warm up requests up front to each one of the snapshotter models to get them to settle on a fast kernel and not overload them. I can implement this quickly if I find some time. Longer term I want to move away from using TensorFlow for the snapshotter by using Triton's more recent native state support, this is a Hermes issue. Both for this reason and for the extra unnecessary dependency. |
|
Running into the following issue when performing inference:
RuntimeError: in ensemble 'bbhnet-stream', onnx runtime error 1: Non-zero status code returned while running FusedConv node. Name:'Conv_203_Add_204_Relu_205' Status Message: CUDNN error executing cudnnAddTensor(Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data)
whichver onnx runtime version we are using doesn't seem to like the convolutional layer?
The text was updated successfully, but these errors were encountered: