Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow performance on Ryzen 7950X3D #7

Closed
n00mkrad opened this issue Aug 15, 2023 · 19 comments
Closed

Extremely slow performance on Ryzen 7950X3D #7

n00mkrad opened this issue Aug 15, 2023 · 19 comments

Comments

@n00mkrad
Copy link

Running the line from the readme, I get this:

step 1 sampling completed, taking 50.97s

Compiled with cmake on Windows. Shouldn't it be a little bit faster?

@klosax
Copy link

klosax commented Aug 15, 2023

See my tests here #6

@leejet
Copy link
Owner

leejet commented Aug 16, 2023

Yes, I'm trying to modify GGML to make it run faster. Could you add the -v parameter to print out your System Info and Options so I can take a look?

@n00mkrad
Copy link
Author

Yes, I'm trying to modify GGML to make it run faster. Could you add the -v parameter to print out your System Info and Options so I can take a look?

Option:
    n_threads:       32
    mode:            txt2img
    model_path:      models/sd-1.5-ggml-model-q4_1.bin
    output_path:     output.png
    init_img:
    prompt:          photo of a lovely cat, high quality
    negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
    cfg_scale:       7.50
    width:           512
    height:          512
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            1
System Info:
    BLAS = 0
    SSE3 = 0
    AVX = 0
    AVX2 = 0
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 0
    NEON = 0
    ARM_FMA = 0
    F16C = 0
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2500 - loading model from 'models/sd-1.5-ggml-model-q4_1.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO]  stable-diffusion.cpp:2525 - ftype: q4_1
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO]  stable-diffusion.cpp:2570 - params ctx size =  1454.75 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[DEBUG] stable-diffusion.cpp:2712 - model size =  1454.34MB
[INFO]  stable-diffusion.cpp:2715 - loading model from 'models/sd-1.5-ggml-model-q4_1.bin' completed, taking 1.03s
[DEBUG] stable-diffusion.cpp:353  - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2752 - condition context need 1.46MB static memory, with work_size needing 0.28MB
[DEBUG] stable-diffusion.cpp:2776 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.72s
[INFO]  stable-diffusion.cpp:2796 - condition graph use 4.39MB of memory: static 1.46MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2752 - condition context need 1.46MB static memory, with work_size needing 0.28MB
[DEBUG] stable-diffusion.cpp:2776 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.37s
[INFO]  stable-diffusion.cpp:2796 - condition graph use 4.39MB of memory: static 1.46MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3243 - get_learned_condition completed, taking 1.10s
[INFO]  stable-diffusion.cpp:3253 - start sampling
[DEBUG] stable-diffusion.cpp:2848 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB
[INFO]  stable-diffusion.cpp:2989 - step 1 sampling completed, taking 42.43s
[DEBUG] stable-diffusion.cpp:2993 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet

@Green-Sky
Copy link
Contributor

System Info:
BLAS = 0
SSE3 = 0
AVX = 0
AVX2 = 0
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 0
NEON = 0
ARM_FMA = 0
F16C = 0
FP16_VA = 0

how did you build your sd ? some features here should be enabled on any platform (AVX2 on almost all x86 cpus out there)

@n00mkrad
Copy link
Author

System Info:
BLAS = 0
SSE3 = 0
AVX = 0
AVX2 = 0
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 0
NEON = 0
ARM_FMA = 0
F16C = 0
FP16_VA = 0

how did you build your sd ? some features here should be enabled on any platform (AVX2 on almost all x86 cpus out there)

Installed cmake and ran the commands from the readme.
I'm trying it again right now after installing CUDA and using cmake .. -DGGML_CUBLAS=ON.

@Green-Sky
Copy link
Contributor

also, your number of threads seems excessive, try reducing that to match the physical core count.

@n00mkrad
Copy link
Author

also, your number of threads seems excessive, try reducing that to match the physical core count.

The default only gave me around 60% utilization. But yeah I think 32 is too much. Didn't impact performance either way though.

@n00mkrad
Copy link
Author

My compile log:

MSBuild version 17.6.3+07e294721 for .NET Framework

  1>Checking Build System
  Building Custom Rule stable-diffusion.cpp/ggml/src/CMakeLists.txt
  Compiling CUDA source file ..\..\..\ggml\src\ggml-cuda.cu...

  stable-diffusion.cpp\build\ggml\src>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\
  HostX64\x64" -x cu   -I"stable-diffusion.cpp\ggml\src\." -I"stable-diffusion.cpp\ggml\src\..\include" -I"stable-diffusion.cpp\ggml\src\..\include\ggml" -I"C:\Program Files\NVIDIA GPU Comput
  ing Toolkit\CUDA\v11.8\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include"     --keep-dir x64\Release  -maxrregcount=0  --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,s
  m_52] --generate-code=arch=compute_61,code=[compute_61,sm_61] -Xcompiler="/EHsc -Ob2"   -D_WINDOWS -DNDEBUG -DGGML_USE_CUBLAS -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DGGML_USE_CUBLAS -D"CMAKE_INTDIR=\"Release
  \"" -Xcompiler "/EHsc /W3 /nologo /O2 /Fdstable-diffusion.cpp\build\ggml\src\Release\ggml.pdb /FS   /MD /GR" -o ggml.dir\Release\ggml-cuda.obj "stable-diffusion.cpp\ggml\src\ggml-cuda.cu"
  ggml-cuda.cu
cl : command line  warning D9002: ignoring unknown option '-mfma' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
  ggml.c
cl : command line  warning D9002: ignoring unknown option '-mf16c' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx2' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
  ggml.vcxproj -> stable-diffusion.cpp\build\ggml\src\Release\ggml.lib
  Building Custom Rule stable-diffusion.cpp/CMakeLists.txt
  stable-diffusion.cpp
  stable-diffusion.vcxproj -> stable-diffusion.cpp\build\Release\stable-diffusion.lib
  Building Custom Rule stable-diffusion.cpp/CMakeLists.txt
  main.cpp
stable-diffusion.cpp\stb_image_write.h(776,13): warning C4996: 'sprintf': This function or variable may be unsafe. Consider using sprintf_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for
 details. [stable-diffusion.cpp\build\sd.vcxproj]
  sd.vcxproj -> stable-diffusion.cpp\build\Release\sd.exe
  Building Custom Rule stable-diffusion.cpp/CMakeLists.txt

It's at 30 seconds per sampling step now.
I wonder about this part:

 cl : command line  warning D9002: ignoring unknown option '-mfma' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
  ggml.c
cl : command line  warning D9002: ignoring unknown option '-mf16c' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx2' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]

Does that imply it failed to enable AVX/AVX2 stuff?

@Green-Sky
Copy link
Contributor

It's at 30 seconds per sampling step now.
I wonder about this part:

....

Does that imply it failed to enable AVX/AVX2 stuff?

no, i think thats the cuda compiler.

@Green-Sky
Copy link
Contributor

or maybe not? hm what is your platform/what platform are you building for

@Green-Sky
Copy link
Contributor

Green-Sky commented Aug 16, 2023

ran it with my built + adjusted threads to 10 (i have 12 physical)

$ ./sd -t 10 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v
Option:
    n_threads:       10
    mode:            txt2img
    model_path:      ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin
    output_path:     output.png
    init_img:
    prompt:          photo of a lovely cat, high quality
    negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            42
System Info:
    BLAS = 0
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2500 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO]  stable-diffusion.cpp:2525 - ftype: q8_0
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO]  stable-diffusion.cpp:2570 - params ctx size =  1618.72 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[DEBUG] stable-diffusion.cpp:2712 - model size =  1618.31MB
[INFO]  stable-diffusion.cpp:2715 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.44s
[DEBUG] stable-diffusion.cpp:353  - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.05s
[INFO]  stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.05s
[INFO]  stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3243 - get_learned_condition completed, taking 0.10s
[INFO]  stable-diffusion.cpp:3253 - start sampling
[DEBUG] stable-diffusion.cpp:2846 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB
[INFO]  stable-diffusion.cpp:2989 - step 1 sampling completed, taking 15.96s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 2 sampling completed, taking 15.68s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 3 sampling completed, taking 15.83s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 4 sampling completed, taking 15.90s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 5 sampling completed, taking 15.93s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 6 sampling completed, taking 15.79s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 7 sampling completed, taking 15.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 8 sampling completed, taking 15.66s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 9 sampling completed, taking 15.71s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 10 sampling completed, taking 15.85s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 11 sampling completed, taking 15.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 12 sampling completed, taking 15.76s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 13 sampling completed, taking 15.85s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 14 sampling completed, taking 15.90s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 15 sampling completed, taking 15.84s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 16 sampling completed, taking 16.07s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 17 sampling completed, taking 15.88s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 18 sampling completed, taking 15.98s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 19 sampling completed, taking 15.89s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 20 sampling completed, taking 15.76s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3001 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:3005 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3258 - sampling completed, taking 316.81s
[DEBUG] stable-diffusion.cpp:3153 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB
[DEBUG] stable-diffusion.cpp:3179 - computing vae graph completed, taking 50.49s
[INFO]  stable-diffusion.cpp:3188 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[DEBUG] stable-diffusion.cpp:3192 - 3145728 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3265 - decode_first_stage completed, taking 50.53s
[INFO]  stable-diffusion.cpp:3266 - txt2img completed in 367.45s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'

output

edit: also i used q8_0 instead of q4_1

@n00mkrad
Copy link
Author

Well this part is definitely off:

System Info:
    BLAS = 1
    SSE3 = 0
    AVX = 0
    AVX2 = 0
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 0
    NEON = 0
    ARM_FMA = 0
    F16C = 0
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0

I assume the lack of AVX is a compiler issue. But no idea how to fix that, it seems to be up to date.

@Green-Sky
Copy link
Contributor

OH, tell me more about your build environment/process
because
it is trying to enable gcc/clang avx2 (-mavx2) on a msvc compiler

@Green-Sky
Copy link
Contributor

If you poke around in your build directoy, you should fine a CMakeCache.txt, inside there you can add /arch:AVX2 to CMAKE_CXX_FLAGS:STRING= and CMAKE_C_FLAGS:STRING=

@n00mkrad
Copy link
Author

OH, tell me more about your build environment/process because it is trying to enable gcc/clang avx2 (-mavx2) on a msvc compiler

Windows 10 22H2, VS 2022 with Build Tools installed, CUDA Toolkit 11.8 installed, cmake installed using their setup.

If you poke around in your build directoy, you should fine a CMakeCache.txt, inside there you can add /arch:AVX2 to CMAKE_CXX_FLAGS:STRING= and CMAKE_C_FLAGS:STRING=

That didn't seem to change anything. I ran cmake --build . --config Release again and same result.

@Green-Sky
Copy link
Contributor

very funky, @leejet i will probably make a pr later with improved cmake (by copying from llama.cpp)

@leejet
Copy link
Owner

leejet commented Aug 16, 2023

very funky, @leejet i will probably make a pr later with improved cmake (by copying from llama.cpp)

The latest GGML code has already fixed this issue. I will rebase my code onto the latest GGML code.

@leejet
Copy link
Owner

leejet commented Aug 16, 2023

Does that imply it failed to enable AVX/AVX2 stuff?

@n00mkrad the issue has been fixed. You can pull the latest code and give it a try. Don't forget to update the submodule as well.

git pull origin master
git submodule update

@n00mkrad
Copy link
Author

Works. Still very slow, but I guess that's expected.
About 7 sec per step with CuBLAS, 30 sec without.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants