Native API returns: -996 invalid_kernel("uses-fp64-math") #285

JustSomeRandomUsername · 2023-01-18T20:19:41Z

I am trying to train a yolov7 model on my a770 on Ubuntu 22.04 I have onaApi, intel extension for pytorch installed and have trained a different model successfully.

But with this model iam getting a
"RuntimeError: Native API failed. Native API returns: -996 (Function exists but address is not available)
invalid_kernel("uses-fp64-math")
-996 (Function exists but address is not available)"

I am running the code from
https://github.com/WongKinYiu/yolov7
with slight modifications for running on xpu.
import intel_extension_for_pytorch
changed device to "xpu"
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)

I am still a bit new to work on neuralnets am i missing something obvious or is this a intel specific driver problem ?

jingxu10 · 2023-01-25T08:19:33Z

thanks for reporting the issue. We will try to reproduce this issue.

fredlarochelle · 2023-02-08T08:24:23Z

Here is a quick way to reproduce the problem, it works fine on cpu, not on xpu:

import torch
import intel_extension_for_pytorch as ipex

device = torch.device('xpu' if torch.xpu.is_available() else 'cpu')
print(f'Using device: {device}, named {torch.xpu.get_device_name(0)}.')

# float16, bfloat16 and float32 get the same error...
x = torch.randn([100], dtype=torch.float16, device=device)

print(x * 0.5)

Btw, it would be nice if torch.xpu.get_device_name(0) would return the name of the GPU instead of the IDs!

jingxu10 · 2023-02-08T10:02:34Z

@fredlarochelle Thanks for sharing the reproducer.
By the way, could you share the output of that device name print? Also, what is your GPU? Arc? Do you use 1.13.10+xpu or 1.10?
The script run on my environment shows Using device: xpu, named Intel(R) Data Center GPU Flex Series 170 [0x56c0].

fredlarochelle · 2023-02-08T10:49:04Z

I am using an A770 and when I call the function get_device_name(), it returns Intel(R) Graphics [0x56a0]. Running my script, it returns Using device: xpu, named Intel(R) Graphics [0x56a0]..

I am using 1.13.10+xpu, on Python 3.10.6 (the rest is the exact setup in the latest installation guide).

jingxu10 · 2023-02-09T04:07:32Z

the reason of this error triggered by print is that double data type is invoked internally in print, which is not supported by the GPU hardware. Please move the tensor to CPU first then print.

import torch
import intel_extension_for_pytorch as ipex

device = torch.device('xpu' if torch.xpu.is_available() else 'cpu')
print(f'Using device: {device}, named {torch.xpu.get_device_name(0)}.')

# float16, bfloat16 and float32 get the same error...
x = torch.randn([100], dtype=torch.float16, device=device)
y=x*0.5
y=y.to('cpu')
print(y)

Regarding to the device name, it is fully retrieved from driver. May I know your driver info? like version and/or how you installed it.

fredlarochelle · 2023-02-09T06:45:47Z

Doesn't seem to solve the problem, running your code example, I still get the same error. Here is the full output:

But, you are right that it seems related to the print function, the conversion to string actually. Trying to print(x), gets the same error and replacing print() by str() also gets the same error. But, digging around a bit, I found a weird case:

import torch
import intel_extension_for_pytorch as ipex

# works fine
x = torch.ones([10], dtype=torch.float16, device='cpu')
x.to('xpu')
x.to('cpu')
print(x)

# doesn't work
x = torch.ones([10], dtype=torch.float16, device='xpu')
x.to('cpu')
print(x)

# but casting to integer works?
x = torch.ones([10], dtype=torch.float16, device='xpu')
x.to('cpu')
print(x.int())

Here is another weird example:

import torch
import intel_extension_for_pytorch as ipex

# works fine
torch.arange(10, device='cpu')

# works fine
torch.arange(10, device='xpu')

# works fine
torch.arange(10, dtype=torch.float32, device='cpu')

# doesn't works
torch.arange(10, dtype=torch.float32, device='xpu')

About the device name, just checked with XPU Manager and it too can't retrieve the full name. Seems like it is a problem upstream with the driver and not with ipex. For the driver install, I followed the exact instructions here and my driver version with clinfo is 22.49.25018.21.

jingxu10 · 2023-02-09T06:51:59Z

Oh, sorry. I'll correct in the code snippet as well.

...
y=y.to('cpu')
print(y)

I'll check the arange one.

jingxu10 · 2023-02-09T07:16:30Z

The arange one works on my side.
@ashokei @sanchitintel would you double confirm on your side?

jingxu10 · 2023-02-09T07:22:09Z

Hi @JustSomeRandomUsername , since your workload is a training task. Double could be used in this case. Arc GPU doesn't support double precision in hardware level. Currently we don't recommend to run training on Arc and ATS-M GPUs.

fredlarochelle · 2023-02-09T07:26:54Z

Oh yeah, my bad, I didn't catch it either! I can confirm that it works on my side too.

And just to make sure, I double checked and I get the -996 error for the arange one.

A note should be added somewhere in the documentation, all of this is a different behavior than "standard" PyTorch with CUDA. You don't need to transfer the tensor back to the CPU to print it normally.

jingxu10 · 2023-02-09T07:28:54Z

Exactly. Thanks for the advice. We will put this as a known issue.

JustSomeRandomUsername · 2023-02-09T17:53:03Z

@jingxu10 Seems likely that the code uses doubles somewhere, ill check if removing the use of doubles solves my problem. I have never heard that training is not recommended on arc, is this temporary ? Do you expect training on Arc to get simpler with software updates ?

jingxu10 · 2023-02-09T23:18:20Z

To clarify this statement, comparing to inference, training invokes double precision data type more probably. Since the hardware itself doesn't support double, if the training workload doesn't depend on usage of double, it is OK. If not, there will be accuracy issues.

tripzero · 2023-02-20T17:16:36Z

I'm seeing this using BoostingMonocularDepth with torch._C._nn.upsample_bilinear2d and Arc A770. Isn't there a way to emulate fp64? An environment variable perhaps?

fredlarochelle · 2023-02-20T20:03:45Z

Here is a quick reproducer:

import torch
import intel_extension_for_pytorch as ipex

X = torch.rand(9, dtype=torch.float32, device='xpu').reshape((1, 1, 3, 3))
up = torch.nn.UpsamplingBilinear2d((9, 9))
X_up = up(X)

Also, the function upsample_bilinear2d_out_frame() is not implemented for BFloat16. Any plan on implementing it or is it related to pytorch/pytorch#88536?

And the error @tripzero is getting is probably related to all the int64_t in that function. Any plan of diverting from PyTorch and reimplementing the operators using fp64 math to fp32 or emulating fp64 is the way you are planning to go foward?

tripzero · 2023-02-20T20:45:33Z

Looks like setting IGC_EnableDPEmulation=1 doesn't help though you would think it would...

billxc · 2023-03-18T17:35:47Z

the reason of this error triggered by print is that double data type is invoked internally in print, which is not supported by the GPU hardware. Please move the tensor to CPU first then print.
import torch
import intel_extension_for_pytorch as ipex

device = torch.device('xpu' if torch.xpu.is_available() else 'cpu')
print(f'Using device: {device}, named {torch.xpu.get_device_name(0)}.')

# float16, bfloat16 and float32 get the same error...
x = torch.randn([100], dtype=torch.float16, device=device)
y=x*0.5
y=y.to('cpu')
print(y)
Regarding to the device name, it is fully retrieved from driver. May I know your driver info? like version and/or how you installed it.

Is it possible that we wrap the _str function in the future release, so that in the future, we can use the print function safely in the future

jingxu10 · 2023-03-18T20:31:02Z

It is fixed in the latest code. Would you access xpu-master branch and try use the compile_bundle.sh script under scripts folder to build a binary with the latest code base? Thank you.

fredlarochelle mentioned this issue Feb 8, 2023

Adding Arc model names to get_device_name() #295

Closed

JustSomeRandomUsername closed this as completed Feb 13, 2023

jingxu10 added ARC ARC GPU Crash Execution crashes labels Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native API returns: -996 invalid_kernel("uses-fp64-math") #285

Native API returns: -996 invalid_kernel("uses-fp64-math") #285

JustSomeRandomUsername commented Jan 18, 2023

jingxu10 commented Jan 25, 2023

fredlarochelle commented Feb 8, 2023 •

edited

Loading

jingxu10 commented Feb 8, 2023 •

edited

Loading

fredlarochelle commented Feb 8, 2023 •

edited

Loading

jingxu10 commented Feb 9, 2023 •

edited

Loading

fredlarochelle commented Feb 9, 2023

jingxu10 commented Feb 9, 2023

jingxu10 commented Feb 9, 2023

jingxu10 commented Feb 9, 2023

fredlarochelle commented Feb 9, 2023 •

edited

Loading

jingxu10 commented Feb 9, 2023

JustSomeRandomUsername commented Feb 9, 2023

jingxu10 commented Feb 9, 2023

tripzero commented Feb 20, 2023

fredlarochelle commented Feb 20, 2023

tripzero commented Feb 20, 2023

billxc commented Mar 18, 2023 •

edited

Loading

jingxu10 commented Mar 18, 2023

Native API returns: -996 invalid_kernel("uses-fp64-math") #285

Native API returns: -996 invalid_kernel("uses-fp64-math") #285

Comments

JustSomeRandomUsername commented Jan 18, 2023

jingxu10 commented Jan 25, 2023

fredlarochelle commented Feb 8, 2023 • edited Loading

jingxu10 commented Feb 8, 2023 • edited Loading

fredlarochelle commented Feb 8, 2023 • edited Loading

jingxu10 commented Feb 9, 2023 • edited Loading

fredlarochelle commented Feb 9, 2023

jingxu10 commented Feb 9, 2023

jingxu10 commented Feb 9, 2023

jingxu10 commented Feb 9, 2023

fredlarochelle commented Feb 9, 2023 • edited Loading

jingxu10 commented Feb 9, 2023

JustSomeRandomUsername commented Feb 9, 2023

jingxu10 commented Feb 9, 2023

tripzero commented Feb 20, 2023

fredlarochelle commented Feb 20, 2023

tripzero commented Feb 20, 2023

billxc commented Mar 18, 2023 • edited Loading

jingxu10 commented Mar 18, 2023

fredlarochelle commented Feb 8, 2023 •

edited

Loading

jingxu10 commented Feb 8, 2023 •

edited

Loading

fredlarochelle commented Feb 8, 2023 •

edited

Loading

jingxu10 commented Feb 9, 2023 •

edited

Loading

fredlarochelle commented Feb 9, 2023 •

edited

Loading

billxc commented Mar 18, 2023 •

edited

Loading