-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native API returns: -996 invalid_kernel("uses-fp64-math") #285
Comments
thanks for reporting the issue. We will try to reproduce this issue. |
Here is a quick way to reproduce the problem, it works fine on cpu, not on xpu:
Btw, it would be nice if |
@fredlarochelle Thanks for sharing the reproducer. |
I am using an A770 and when I call the function I am using 1.13.10+xpu, on Python 3.10.6 (the rest is the exact setup in the latest installation guide). |
the reason of this error triggered by import torch
import intel_extension_for_pytorch as ipex
device = torch.device('xpu' if torch.xpu.is_available() else 'cpu')
print(f'Using device: {device}, named {torch.xpu.get_device_name(0)}.')
# float16, bfloat16 and float32 get the same error...
x = torch.randn([100], dtype=torch.float16, device=device)
y=x*0.5
y=y.to('cpu')
print(y) Regarding to the device name, it is fully retrieved from driver. May I know your driver info? like version and/or how you installed it. |
Doesn't seem to solve the problem, running your code example, I still get the same error. Here is the full output: But, you are right that it seems related to the print function, the conversion to string actually. Trying to
Here is another weird example:
About the device name, just checked with XPU Manager and it too can't retrieve the full name. Seems like it is a problem upstream with the driver and not with ipex. For the driver install, I followed the exact instructions here and my driver version with |
Oh, sorry. I'll correct in the code snippet as well. ...
y=y.to('cpu')
print(y) I'll check the |
The |
Hi @JustSomeRandomUsername , since your workload is a training task. Double could be used in this case. Arc GPU doesn't support double precision in hardware level. Currently we don't recommend to run training on Arc and ATS-M GPUs. |
Oh yeah, my bad, I didn't catch it either! I can confirm that it works on my side too. And just to make sure, I double checked and I get the -996 error for the A note should be added somewhere in the documentation, all of this is a different behavior than "standard" PyTorch with CUDA. You don't need to transfer the tensor back to the CPU to print it normally. |
Exactly. Thanks for the advice. We will put this as a known issue. |
@jingxu10 Seems likely that the code uses doubles somewhere, ill check if removing the use of doubles solves my problem. I have never heard that training is not recommended on arc, is this temporary ? Do you expect training on Arc to get simpler with software updates ? |
To clarify this statement, comparing to inference, training invokes double precision data type more probably. Since the hardware itself doesn't support double, if the training workload doesn't depend on usage of double, it is OK. If not, there will be accuracy issues. |
I'm seeing this using BoostingMonocularDepth with torch._C._nn.upsample_bilinear2d and Arc A770. Isn't there a way to emulate fp64? An environment variable perhaps? |
Here is a quick reproducer:
Also, the function And the error @tripzero is getting is probably related to all the |
Looks like setting IGC_EnableDPEmulation=1 doesn't help though you would think it would... |
Is it possible that we wrap the _str function in the future release, so that in the future, we can use the print function safely in the future |
It is fixed in the latest code. Would you access |
I am trying to train a yolov7 model on my a770 on Ubuntu 22.04 I have onaApi, intel extension for pytorch installed and have trained a different model successfully.
But with this model iam getting a
"RuntimeError: Native API failed. Native API returns: -996 (Function exists but address is not available)
invalid_kernel("uses-fp64-math")
-996 (Function exists but address is not available)"
I am running the code from
https://github.com/WongKinYiu/yolov7
with slight modifications for running on xpu.
import intel_extension_for_pytorch
changed device to "xpu"
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)
I am still a bit new to work on neuralnets am i missing something obvious or is this a intel specific driver problem ?
The text was updated successfully, but these errors were encountered: