Skip to content

Conv2D: Add CPU version #14320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

am17an
Copy link
Collaborator

@am17an am17an commented Jun 21, 2025

Adding as draft because at the moment it doesn't seem to be always faster than doing im2col, but in some cases it is. Looking to optimize this solution as it's currently completely unoptimized, but it might be useful for #14316

Input Size Kernel Config IM2COL (ms) SIMD (ms) Speedup
8x8x3 3x3x3→16 s1 p0 0.300 0.013 23.08x SIMD
8x8x3 3x3x3→16 s1 p1 0.020 0.017 1.18x SIMD
16x16x8 5x5x8→32 s2 p2 0.066 0.070 1.06x IM2COL
32x32x64 1x1x64→128 s1 p0 0.930 6.485 6.97x IM2COL
16x16x16 3x3x16→32 s1 p1 0.359 0.485 1.35x IM2COL
64x64x3 3x3x3→32 s1 p1 1.387 2.757 1.99x IM2COL
128x128x16 3x3x16→32 s1 p1 9.760 73.721 7.55x IM2COL
128x128x32 3x3x32→64 s1 p1 20.337 187.484 9.22x IM2COL
64x64x64 3x3x64→128 s1 p1 11.420 235.696 20.64x IM2COL
224x224x3 3x3x3→32 s1 p1 14.899 25.178 1.69x IM2COL
224x224x3 7x7x3→64 s2 p3 10.947 69.425 6.34x IM2COL
512x512x3 3x3x3→16 s1 p1 46.892 53.811 1.15x IM2COL
512x512x3 3x3x3→16 s2 p1 13.348 17.387 1.30x IM2COL
56x56x64 1x1x64→128 s1 p0 2.848 5.834 2.05x IM2COL
28x28x128 1x1x128→256 s1 p0 1.460 4.830 3.31x IM2COL
14x14x256 1x1x256→512 s1 p0 0.897 5.093 5.68x IM2COL
256x256x8 3x3x8→8 s1 p1 17.228 7.223 2.39x SIMD
512x512x4 3x3x4→4 s1 p1 36.965 10.013 3.69x SIMD

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 21, 2025
@etasnadi
Copy link
Contributor

etasnadi commented Jun 22, 2025

Check memory usage too. Naive im2col might use tons of memory (maybe not the CPU version?), so even if your code is slower it's worth it to add such an inplace version especially for training conv layers where memory size counts a lot.

Doesn't vec_dot_f16/f32 faster than omp for computing the inner products?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants