Skip to content

ggml-vulkan: adds support for op CONV_TRANSPOSE_1D #13813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

etasnadi
Copy link

  • ggml-vulkan: adds support for op CONV_TRANSPOSE_1D

  • test-backend-ops: adds additional tests for CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 26, 2025
Number of additional tests reduced to 108.
@etasnadi
Copy link
Author

etasnadi commented May 27, 2025

The Ubuntu 22 runner stopped because test-backend-ops crashes with segfault. I've observed this behavior on a raw build before as well so I don't know if it has anything to do with my modifications or if it makes this happen more likely.

Edit: when the app crashes with segfault it is always during the exit after the tests were executed.

@jeffbolznv
Copy link
Collaborator

It's showing failures in some of the conv_transpose_1d tests, e.g.:

29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=1,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.459058786 > 0.000000100 FAIL
29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=2,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.222932697 > 0.000000100 FAIL
29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=3,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.167964696 > 0.000000100 FAIL

Then it appears to crash on the first test after the conv_transpose_1d tests, which does seem like it's related to this change somehow.

@etasnadi
Copy link
Author

It's showing failures in some of the conv_transpose_1d tests, e.g.:

29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=1,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.459058786 > 0.000000100 FAIL
29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=2,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.222932697 > 0.000000100 FAIL
29:   CONV_TRANSPOSE_1D(ne_input=[2173,1,1,1],ne_kernel=[1337,1,1,1],s0=3,p0=0,d0=1): [CONV_TRANSPOSE_1D] NMSE = 0.167964696 > 0.000000100 FAIL

Then it appears to crash on the first test after the conv_transpose_1d tests, which does seem like it's related to this change somehow.

That's interesting. I could finally reproduce these errors with llvmpipe, but not with the discrete GPU. I need time to investigate what's the root cause.

@jeffbolznv
Copy link
Collaborator

Wild guess - any chance of uninitialized shared memory?

@etasnadi
Copy link
Author

Wild guess - any chance of uninitialized shared memory?

It can be, but there is a chance that the code is correct but llvmpipe what's the root cause of the failing tests.

I executed parameter sweeps (Cin, K, L) and turns out that if one (or combination of parameters) are large enough, the test fails. Even if I test with Cin=20000, L=64, K=4 the test fails that's suspicious because Cin "does not influence the logic too much". Cin=20000, L=64, K=3 passes.

Argmax also segfaults on llvmpipe on my computer for some reason.

etasnadi added 2 commits May 28, 2025 23:06
* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.
@etasnadi
Copy link
Author

etasnadi commented May 28, 2025

Wild guess - any chance of uninitialized shared memory?

No, I was not aware of the fact that iterations in loops are simply skipped after reaching a certain limit on llvmpipe, so I decided to reduce the maximum test problem size to 1337x13 (input x kernel) in order to to pass the tests on llvmpipee in commit 2813cf4. Llvmpipe starts to trim the loops after ~30k iterations so this number should be safe and also realistic opposed to my previous edge cases. 5120x512 also worked for me with 128 threads resulting in ~20k iterations per thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants