Extremely slow performance on Ryzen 7950X3D #7

n00mkrad · 2023-08-15T21:06:49Z

Running the line from the readme, I get this:

step 1 sampling completed, taking 50.97s

Compiled with cmake on Windows. Shouldn't it be a little bit faster?

The text was updated successfully, but these errors were encountered:

klosax · 2023-08-15T22:05:07Z

See my tests here #6

leejet · 2023-08-16T00:31:07Z

Yes, I'm trying to modify GGML to make it run faster. Could you add the -v parameter to print out your System Info and Options so I can take a look?

n00mkrad · 2023-08-16T07:32:37Z

Yes, I'm trying to modify GGML to make it run faster. Could you add the -v parameter to print out your System Info and Options so I can take a look?

Option:
    n_threads:       32
    mode:            txt2img
    model_path:      models/sd-1.5-ggml-model-q4_1.bin
    output_path:     output.png
    init_img:
    prompt:          photo of a lovely cat, high quality
    negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
    cfg_scale:       7.50
    width:           512
    height:          512
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            1
System Info:
    BLAS = 0
    SSE3 = 0
    AVX = 0
    AVX2 = 0
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 0
    NEON = 0
    ARM_FMA = 0
    F16C = 0
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2500 - loading model from 'models/sd-1.5-ggml-model-q4_1.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO]  stable-diffusion.cpp:2525 - ftype: q4_1
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO]  stable-diffusion.cpp:2570 - params ctx size =  1454.75 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[DEBUG] stable-diffusion.cpp:2712 - model size =  1454.34MB
[INFO]  stable-diffusion.cpp:2715 - loading model from 'models/sd-1.5-ggml-model-q4_1.bin' completed, taking 1.03s
[DEBUG] stable-diffusion.cpp:353  - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2752 - condition context need 1.46MB static memory, with work_size needing 0.28MB
[DEBUG] stable-diffusion.cpp:2776 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.72s
[INFO]  stable-diffusion.cpp:2796 - condition graph use 4.39MB of memory: static 1.46MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2752 - condition context need 1.46MB static memory, with work_size needing 0.28MB
[DEBUG] stable-diffusion.cpp:2776 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.37s
[INFO]  stable-diffusion.cpp:2796 - condition graph use 4.39MB of memory: static 1.46MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3243 - get_learned_condition completed, taking 1.10s
[INFO]  stable-diffusion.cpp:3253 - start sampling
[DEBUG] stable-diffusion.cpp:2848 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB
[INFO]  stable-diffusion.cpp:2989 - step 1 sampling completed, taking 42.43s
[DEBUG] stable-diffusion.cpp:2993 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet

Green-Sky · 2023-08-16T08:06:52Z

System Info:
BLAS = 0
SSE3 = 0
AVX = 0
AVX2 = 0
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 0
NEON = 0
ARM_FMA = 0
F16C = 0
FP16_VA = 0

how did you build your sd ? some features here should be enabled on any platform (AVX2 on almost all x86 cpus out there)

n00mkrad · 2023-08-16T08:08:00Z

System Info:
BLAS = 0
SSE3 = 0
AVX = 0
AVX2 = 0
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 0
NEON = 0
ARM_FMA = 0
F16C = 0
FP16_VA = 0

how did you build your sd ? some features here should be enabled on any platform (AVX2 on almost all x86 cpus out there)

Installed cmake and ran the commands from the readme.
I'm trying it again right now after installing CUDA and using cmake .. -DGGML_CUBLAS=ON.

Green-Sky · 2023-08-16T08:11:16Z

also, your number of threads seems excessive, try reducing that to match the physical core count.

n00mkrad · 2023-08-16T08:12:37Z

also, your number of threads seems excessive, try reducing that to match the physical core count.

The default only gave me around 60% utilization. But yeah I think 32 is too much. Didn't impact performance either way though.

n00mkrad · 2023-08-16T08:13:55Z

My compile log:

MSBuild version 17.6.3+07e294721 for .NET Framework

  1>Checking Build System
  Building Custom Rule stable-diffusion.cpp/ggml/src/CMakeLists.txt
  Compiling CUDA source file ..\..\..\ggml\src\ggml-cuda.cu...

  stable-diffusion.cpp\build\ggml\src>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\
  HostX64\x64" -x cu   -I"stable-diffusion.cpp\ggml\src\." -I"stable-diffusion.cpp\ggml\src\..\include" -I"stable-diffusion.cpp\ggml\src\..\include\ggml" -I"C:\Program Files\NVIDIA GPU Comput
  ing Toolkit\CUDA\v11.8\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include"     --keep-dir x64\Release  -maxrregcount=0  --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,s
  m_52] --generate-code=arch=compute_61,code=[compute_61,sm_61] -Xcompiler="/EHsc -Ob2"   -D_WINDOWS -DNDEBUG -DGGML_USE_CUBLAS -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DGGML_USE_CUBLAS -D"CMAKE_INTDIR=\"Release
  \"" -Xcompiler "/EHsc /W3 /nologo /O2 /Fdstable-diffusion.cpp\build\ggml\src\Release\ggml.pdb /FS   /MD /GR" -o ggml.dir\Release\ggml-cuda.obj "stable-diffusion.cpp\ggml\src\ggml-cuda.cu"
  ggml-cuda.cu
cl : command line  warning D9002: ignoring unknown option '-mfma' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
  ggml.c
cl : command line  warning D9002: ignoring unknown option '-mf16c' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx2' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
  ggml.vcxproj -> stable-diffusion.cpp\build\ggml\src\Release\ggml.lib
  Building Custom Rule stable-diffusion.cpp/CMakeLists.txt
  stable-diffusion.cpp
  stable-diffusion.vcxproj -> stable-diffusion.cpp\build\Release\stable-diffusion.lib
  Building Custom Rule stable-diffusion.cpp/CMakeLists.txt
  main.cpp
stable-diffusion.cpp\stb_image_write.h(776,13): warning C4996: 'sprintf': This function or variable may be unsafe. Consider using sprintf_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for
 details. [stable-diffusion.cpp\build\sd.vcxproj]
  sd.vcxproj -> stable-diffusion.cpp\build\Release\sd.exe
  Building Custom Rule stable-diffusion.cpp/CMakeLists.txt

It's at 30 seconds per sampling step now.
I wonder about this part:

 cl : command line  warning D9002: ignoring unknown option '-mfma' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
  ggml.c
cl : command line  warning D9002: ignoring unknown option '-mf16c' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx2' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]

Does that imply it failed to enable AVX/AVX2 stuff?

Green-Sky · 2023-08-16T08:17:14Z

It's at 30 seconds per sampling step now.
I wonder about this part:
....
Does that imply it failed to enable AVX/AVX2 stuff?

no, i think thats the cuda compiler.

Green-Sky · 2023-08-16T08:17:59Z

or maybe not? hm what is your platform/what platform are you building for

Green-Sky · 2023-08-16T08:18:48Z

ran it with my built + adjusted threads to 10 (i have 12 physical)

$ ./sd -t 10 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v
Option:
    n_threads:       10
    mode:            txt2img
    model_path:      ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin
    output_path:     output.png
    init_img:
    prompt:          photo of a lovely cat, high quality
    negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            42
System Info:
    BLAS = 0
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2500 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO]  stable-diffusion.cpp:2525 - ftype: q8_0
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO]  stable-diffusion.cpp:2570 - params ctx size =  1618.72 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[DEBUG] stable-diffusion.cpp:2712 - model size =  1618.31MB
[INFO]  stable-diffusion.cpp:2715 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.44s
[DEBUG] stable-diffusion.cpp:353  - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.05s
[INFO]  stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.05s
[INFO]  stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3243 - get_learned_condition completed, taking 0.10s
[INFO]  stable-diffusion.cpp:3253 - start sampling
[DEBUG] stable-diffusion.cpp:2846 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB
[INFO]  stable-diffusion.cpp:2989 - step 1 sampling completed, taking 15.96s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 2 sampling completed, taking 15.68s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 3 sampling completed, taking 15.83s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 4 sampling completed, taking 15.90s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 5 sampling completed, taking 15.93s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 6 sampling completed, taking 15.79s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 7 sampling completed, taking 15.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 8 sampling completed, taking 15.66s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 9 sampling completed, taking 15.71s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 10 sampling completed, taking 15.85s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 11 sampling completed, taking 15.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 12 sampling completed, taking 15.76s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 13 sampling completed, taking 15.85s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 14 sampling completed, taking 15.90s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 15 sampling completed, taking 15.84s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 16 sampling completed, taking 16.07s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 17 sampling completed, taking 15.88s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 18 sampling completed, taking 15.98s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 19 sampling completed, taking 15.89s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 20 sampling completed, taking 15.76s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3001 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:3005 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3258 - sampling completed, taking 316.81s
[DEBUG] stable-diffusion.cpp:3153 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB
[DEBUG] stable-diffusion.cpp:3179 - computing vae graph completed, taking 50.49s
[INFO]  stable-diffusion.cpp:3188 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[DEBUG] stable-diffusion.cpp:3192 - 3145728 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3265 - decode_first_stage completed, taking 50.53s
[INFO]  stable-diffusion.cpp:3266 - txt2img completed in 367.45s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'

edit: also i used q8_0 instead of q4_1

n00mkrad · 2023-08-16T08:21:16Z

Well this part is definitely off:

System Info:
    BLAS = 1
    SSE3 = 0
    AVX = 0
    AVX2 = 0
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 0
    NEON = 0
    ARM_FMA = 0
    F16C = 0
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0

I assume the lack of AVX is a compiler issue. But no idea how to fix that, it seems to be up to date.

Green-Sky · 2023-08-16T08:23:59Z

OH, tell me more about your build environment/process
because
it is trying to enable gcc/clang avx2 (-mavx2) on a msvc compiler

Green-Sky · 2023-08-16T08:25:56Z

If you poke around in your build directoy, you should fine a CMakeCache.txt, inside there you can add /arch:AVX2 to CMAKE_CXX_FLAGS:STRING= and CMAKE_C_FLAGS:STRING=

n00mkrad · 2023-08-16T09:25:49Z

OH, tell me more about your build environment/process because it is trying to enable gcc/clang avx2 (-mavx2) on a msvc compiler

Windows 10 22H2, VS 2022 with Build Tools installed, CUDA Toolkit 11.8 installed, cmake installed using their setup.

If you poke around in your build directoy, you should fine a CMakeCache.txt, inside there you can add /arch:AVX2 to CMAKE_CXX_FLAGS:STRING= and CMAKE_C_FLAGS:STRING=

That didn't seem to change anything. I ran cmake --build . --config Release again and same result.

Green-Sky · 2023-08-16T10:53:17Z

very funky, @leejet i will probably make a pr later with improved cmake (by copying from llama.cpp)

leejet · 2023-08-16T12:30:04Z

very funky, @leejet i will probably make a pr later with improved cmake (by copying from llama.cpp)

The latest GGML code has already fixed this issue. I will rebase my code onto the latest GGML code.

leejet · 2023-08-16T14:30:41Z

Does that imply it failed to enable AVX/AVX2 stuff?

@n00mkrad the issue has been fixed. You can pull the latest code and give it a try. Don't forget to update the submodule as well.

git pull origin master
git submodule update

n00mkrad · 2023-08-22T16:20:04Z

Works. Still very slow, but I guess that's expected.
About 7 sec per step with CuBLAS, 30 sec without.

n00mkrad closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow performance on Ryzen 7950X3D #7

Extremely slow performance on Ryzen 7950X3D #7

n00mkrad commented Aug 15, 2023

klosax commented Aug 15, 2023

leejet commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

Green-Sky commented Aug 16, 2023 •

edited

Loading

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

leejet commented Aug 16, 2023

leejet commented Aug 16, 2023

n00mkrad commented Aug 22, 2023

Extremely slow performance on Ryzen 7950X3D #7

Extremely slow performance on Ryzen 7950X3D #7

Comments

n00mkrad commented Aug 15, 2023

klosax commented Aug 15, 2023

leejet commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

Green-Sky commented Aug 16, 2023 • edited Loading

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

n00mkrad commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

leejet commented Aug 16, 2023

leejet commented Aug 16, 2023

n00mkrad commented Aug 22, 2023

Green-Sky commented Aug 16, 2023 •

edited

Loading