Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGEMM broken with 1.6.2 on Intel ARC #533

Closed
0cc4m opened this issue Feb 20, 2024 · 24 comments
Closed

SGEMM broken with 1.6.2 on Intel ARC #533

0cc4m opened this issue Feb 20, 2024 · 24 comments

Comments

@0cc4m
Copy link

0cc4m commented Feb 20, 2024

output_1.6.1.txt
output_1.6.2.txt

In 1.6.1 SGEMM gives the correct results on Intel ARC (tested on an A770 on Arch Linux), but with 1.6.2 the results are wrong.

Let me know if more data is needed.

@CNugteren
Copy link
Owner

This is most likely because of #503, which added new tuning results for the A770, as shared here. Apparently they are not (always?) valid, or there is a bug in either of your A770 drivers.
I think the best path forward is to remove those tuning results for now, and then later optionally re-tune on @0cc4m's set-up and see if that leads to errors as well.

@tangjinchuan
Copy link

Dear Cedric,
I have put the tuning results based on my A770 card today for you to look at or update.
#1 (comment)

@CNugteren
Copy link
Owner

I have put the tuning results based on my A770 card today for you to look at or update.

Thanks, but I don't have an A770 to test it. Could you or @0cc4m apply those tuning results, re-compile the library, and then run the tests?

@tangjinchuan
Copy link

I suggest @0cc4m give it a try to check urgent test cases.

@0cc4m
Copy link
Author

0cc4m commented Feb 23, 2024

No, that tuning doesn't work for me. I even tuned it myself and it still fails the tests. It seems the Linux driver (mesa version 24.0.1) for A770 has some issue that means the results are wrong when CLBlast is tuned. Very odd.

@tangjinchuan
Copy link

I see. Please try installing intel compute runtime if this is not the case. The tuning is optimized for the Intel NEO Platform, not the open-source Mesa Platform.
https://github.com/intel/compute-runtime
The command would be similar to
apt-get install intel-opencl-icd
Also, please provide clinfo if the problem persists.
Intel team told me both Linux and Windows used the same codebase (https://community.intel.com/t5/GPU-Compute-Software/Driver-problem-with-OpenCL-kernel/m-p/1362525), hence the tuning should work properly.
Let's see if any goodness comes out of this.

@0cc4m
Copy link
Author

0cc4m commented Feb 24, 2024

@tangjinchuan Sorry, my bad. Somehow I had it in my head that the Intel OpenCL driver is in mesa, but of course it's separate. I'm using intel-compute-runtime 23.48.27912.11-1.

@tangjinchuan
Copy link

@0cc4m Please give the latest version [24.05.28454.6] a go: https://github.com/intel/compute-runtime/releases, all the install steps are there. If the problem persists try turning the PC on and off again for I once had such a problem after updating in Windows. Meanwhile, it would be very helpful to have a test case on two matrices multiplication (just give two matrices that can have such a problem) so that I can see if it produces the same problems on my card. Locating problems in LLM inference framework would be too much for me.
By the way, thanks for your contributions to llama.cpp.

@0cc4m
Copy link
Author

0cc4m commented Feb 25, 2024

@0cc4m Please give the latest version [24.05.28454.6] a go: https://github.com/intel/compute-runtime/releases, all the install steps are there. If the problem persists try turning the PC on and off again for I once had such a problem after updating in Windows. Meanwhile, it would be very helpful to have a test case on two matrices multiplication (just give two matrices that can have such a problem) so that I can see if it produces the same problems on my card. Locating problems in LLM inference framework would be too much for me. By the way, thanks for your contributions to llama.cpp.

I'm not using Windows, the issue is persistent across reboots and I'm not using llama.cpp to test it. I'm just running clblast_test_xgemm as described in how to test the CLBlast library for correctness.

It still fails on intel-compute-runtime version 24.05.28454.6 once tuned for A770. It works fine if not tuned.

@tangjinchuan
Copy link

Thanks! Will have the test case a try next day. In the meantime, for any windows users, you can also give it a go.
clblast_test_xgemm.zip

@tangjinchuan
Copy link

@CNugteren Could you please have a look at the results? It only produced ":". Is this correct? I used openBLAS as the comparison library and the exe did not show any test cases like @0cc4m.
A770 test.txt

@tangjinchuan
Copy link

tangjinchuan commented Feb 26, 2024

@0cc4m I remembered one test update from 1.6.1 to 1.6.2 is here afb3d8a
If you were using 1.6.2 previously with newly tunned results, I guess this worths trying: Replacing the xgemm folder of 1.6.1 by 1.6.2 so that 1.6.1 could have the xgemm tuning results. 6e2ab6e#diff-aad42a9516d39f8f4e703348e73e0de9a15f068041676d7a3d32510864abb9d9
Then try testing 1.6.1 to see if it is the problem related to updated preprocessor.
Sorry for I could not do it for it's getting late today here in my place.

@CNugteren
Copy link
Owner

@CNugteren Could you please have a look at the results? It only produced ":". Is this correct? I used openBLAS as the comparison library and the exe did not show any test cases like @0cc4m. A770 test.txt

No that is not testing anything. I think it just crashes silently? The test results should look like output_1.6.1.txt in the first message in this thread.

BTW, it could also be that there is a bug in CLBlast or in the tuner. If @0cc4m has time, perhaps he/she could change some values in the tuning database, re-compile, and re-run the tests. E.g. starting from the GEMM values themselves in https://github.com/CNugteren/CLBlast/pull/503/files#diff-aad42a9516d39f8f4e703348e73e0de9a15f068041676d7a3d32510864abb9d9R164. If we know which line in that PR causes the issues, then we are one step further. Then, next we could play with individual values of the tuning parameters, and try to find out which combinations of parameters are invalid, and which ones are still valid. But that might be some work.

@tangjinchuan
Copy link

The Xe laptop used to compile the test cases also only produce one ":". I guess it is not a single problem for A770 only.

@CNugteren
Copy link
Owner

The Xe laptop used to compile the test cases also only produce one ":". I guess it is not a single problem for A770 only.

It could be an issue with your reference library (your CBLAS library). Perhaps you can open a separate issue for this, but it might not be CLBlast related. Please report the output of the test binary with the -verbose flag given and give some details about the reference library and your system.

@tangjinchuan
Copy link

tangjinchuan commented Feb 27, 2024

@CNugteren Mistaken for tuner as a test case. It is still not working for neither openBLAS nor clBLAS.

  • Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -full_test [false]
    -verbose [false]
    -clblas 1 [=default]

  • Running on OpenCL device 'Intel(R) Iris(R) Xe Graphics'.

  • Starting tests for the 'SGEMM' routine. Legend:
    : -> Test produced correct results
    . -> Test returned the correct error code
    X -> Test produced incorrect results
    / -> Test returned an incorrect error code
    \ -> Test not executed: OpenCL-kernel compilation error
    o -> Test not executed: Unsupported precision

    • -> Test not completed: Reference CBLAS doesn't output error codes
  • Testing with error margins of 0.5% (relative) and 0.001 (absolute)

  • Testing 'regular behaviour' for '101 (row-major) 111 (regular) 111 (regular)':
    :

@CNugteren
Copy link
Owner

@tangjinchuan Perhaps you can open a separate issue for this. Please report the output of the test binary with the -verbose flag given and give some details about the reference library and your system.

@tangjinchuan
Copy link

tangjinchuan commented Feb 27, 2024

@CNugteren It seems that A770 has problems with versions 153 -162. I only tried these 3 versions.
F_2.zip

For my own app, I only used SGEMM with A*B while other parameters are set to satisfy this case. As a result, I did not discover something wrong. In the meantime, I would like to mention that intel has 32bit support while long type is done by emulations.

@tangjinchuan
Copy link

Dear all,
I have filed a report to seek help from Intel.
https://community.intel.com/t5/GPU-Compute-Software/bd-p/gpu-compute-software

@CNugteren
Copy link
Owner

@tangjinchuan Thanks for testing and filing a report.

@0cc4m Thanks for reporting the issue. Since you didn't provide any further debug info and since I don't have a A770 myself, I decided to simply revert the tuning results for now and add a note to the README about this. See #539.

I'll leave this issue open if there is anyone who wants to debug this further.

@CNugteren
Copy link
Owner

This issue is very likely solved with #543. If anyone with an A770 has the time to build CLBlast and the tests from source and run them, that would be great.

@0cc4m
Copy link
Author

0cc4m commented May 27, 2024

This issue is very likely solved with #543. If anyone with an A770 has the time to build CLBlast and the tests from source and run them, that would be great.

Thank you, I'll test it within the next few days if noone else does it first.

@tangjinchuan
Copy link

That's wonderful.
Please do test it. I would like to test it, but I am recovering from a heavy flu.

@0cc4m
Copy link
Author

0cc4m commented Jun 2, 2024

@CNugteren I built the latest master branch and ran clblast_test_xgemm. I don't see any reported errors anymore.

output_master.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants