Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6600/6600 XT/6650 XT gfx1032 libraries for compilation of Kobold.cpp #655

Open
jasyuiop opened this issue Feb 1, 2024 · 19 comments
Open

Comments

@jasyuiop
Copy link

jasyuiop commented Feb 1, 2024

Information

I have a rx 6600(gfx1032) video card, I can use rocblas on linux using "export HSA_OVERRIDE_GFX_VERSION=10.3.0" But there is no kernel and Tensilelibrary support for rocblas gfx1032 on windows.

I had version 5.5.1 Rocm installed on my system. I used rocm-5.5.1 branches of rocBLAS and Tensile.

I applied this patch to Tensile; https://raw.githubusercontent.com/ulyssesrr/docker-rocm-xtra/f25f12835c1d0a5efa80763b5381accf175b200e/rocm-xtra-rocblas-builder/patches/Tensile-fix-fallback-arch-build.patch

Resources I follow

ggerganov#1087 (comment)
#441
https://www.reddit.com/r/LocalLLaMA/comments/16d1hi0/guide_build_llamacpp_on_windows_with_amd_gpus_and/

using the information here I was able to create a "non-lazy merged library" for gfx1032. I could not create the "lazy" one no matter what I did.

Results

using the generated Kernels.so-000-gfx1032.hsaco and TensileLibrary.dat files I was able to load 7b llm completely on the gpu in koboldcpp-rocm, I got an average speed of 25t/s in a new chat.

Progress

I installed version 5.7.1 ROCm, I am trying to make lazy and non-lazy versions for gfx1032 without any patches using release/rocm-rel-5.7 branches of tensile and rocblas. I don't know if I can compile it successfully, if I succeed I will add those files.

The last word

I would appreciate if you add these files to the pre-builds in future releases. @YellowRoseCx

Attachments

gfx1032_none_lazy.zip

@jasyuiop jasyuiop changed the title 6600/6600 XT/6650 XT Gfx1032 libraries for compilation of Kobold.cpp 6600/6600 XT/6650 XT gfx1032 libraries for compilation of Kobold.cpp Feb 1, 2024
@jasyuiop
Copy link
Author

jasyuiop commented Feb 1, 2024

Information

EDIT: The one I created as "lazy" seems to be missing, I created "non-lazy" for rocblas and tensile rel-5.7.1 and I am attaching it.

with this commit that was merge last week I was able to generate "lazy" for gfx1032 without any patch. i used rocblas's develop branch. i will explain step by step how i did it below. ROCm/Tensile@efbe0c0

Setup

Install

Git for Windows
Visual Studio 2022 Build Tools

  • Tick “Desktop development with C++” workload.

ROCm Windows SDK (i used 5.7.1)
Strawberry perl
python 3.11

ADD PATH

Cmake and Ninja:

  • C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin
  • C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\Ninja

Git:

  • C:\Program Files\Git\bin

Perl:

  • C:\Strawberry\perl\bin

RC (when compiling koboldcpp I get an error saying rc not found, so I added it to path):

  • C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\x64

ROCM

  • C:\Program Files\AMD\ROCm\5.7\bin

vcpkg

Rocblas

go to another folder for example downloads etc.

Open x64 native tools as ADMIN and go to the rocblas folder

  • python rmake.py

Now let me explain here, I have been struggling with the rmake.py command for two days, even if I pass -a with gtx1032 or other parameters, I still get an error. If you are not as unlucky as me, you may not get an error here.

It doesn't matter if you also get an error, the command just needs to generate some things and put them in place

After the rmake.py command is finished(Continue with x64 native tools console)

for non-lazy

  • .\build\release\virtualenv\Scripts\activate.bat
  • TensileCreateLibrary --architecture gfx1032 --code-object-version default --merge-files --library-format msgpack .\library\src\blas3\Tensile\Logic\asm_full C:\SomeOutputFolder HIP

for lazy

  • .\build\release\virtualenv\Scripts\activate.bat
  • TensileCreateLibrary --architecture gfx1032 --code-object-version default --merge-files --separate-architectures --lazy-library-loading --library-format msgpack .\library\src\blas3\Tensile\Logic\asm_full C:\SomeOutputFolder HIP

Generated kernel and tensilelibrary files with TensileCreateLibrary without any error.

We now have our kernel and tensilelibrary files in the C:\SomeOutputFolder folder.

Attachments

files I generated for gfx1032;
gfx1032_non-lazy-rocblas-dev-branch.zip
gfx1032_lazy-rocblas-dev-branch.zip
gfx1032_none_lazy-rocm-5.7.1.zip

@jasyuiop
Copy link
Author

jasyuiop commented Feb 1, 2024

I use openhermes-2.5-mistral-7b.Q6_K.gguf, I put the kernel and TensileLibrary file I shared above https://github.com/LostRuins/koboldcpp/files/14129073/gfx1032_none_lazy-rocm-5.7.1.zip under AMD\ROCm\5.7\bin\rocblas\library. I compiled the latest koboldcpp-rocm version myself for gfx1032. I am using HIP SDK 5.7.1

My initial parameters for openhermes are as follows(kcpps)

{"model": null, "model_param": "D:/Ai/models/openhermes-2.5-mistral-7b.Q6_K.gguf", "port": 5001, "port_param": 5000, "host": "", "launch": false, "lora": null, "config": null, "threads": 8, "blasthreads": 8, "highpriority": false, "contextsize": 8192, "blasbatchsize": 512, "ropeconfig": [1.0, 10000.0], "smartcontext": false, "noshift": false, "bantokens": null, "forceversion": 0, "nommap": false, "usemlock": false, "noavx2": false, "debugmode": 0, "skiplauncher": false, "hordeconfig": null, "noblas": false, "useclblast": null, "usecublas": ["normal", "0"], "usevulkan": null, "gpulayers": 33, "tensor_split": null, "onready": "", "multiuser": 1, "remotetunnel": false, "foreground": false, "preloadstory": null, "quiet": false, "checkforupdates": 0, "ssl": null}

This is the result:

Processing Prompt [BLAS] (316 / 316 tokens)
Generating (250 / 250 tokens)
ContextLimit: 566/8192, Processing:2.01s (6.4ms/T), Generation:11.61s (46.4ms/T), Total:13.62s (54.5ms/T = 18.36T/s)

I see 7.7gb vram usage in task manager, I think the result is great. I can say that I got rid of dual-booting for llm :)

If you want to save those who have gfx1032 cards and compile their own .exe like me, you can add these files to the pre-build binaries 😄 @YellowRoseCx

@YellowRoseCx
Copy link

Adding them into KoboldCpp-ROCm 1.57.1.yr1, hopefully everything works as intended xD
Thanks!

@jasyuiop
Copy link
Author

jasyuiop commented Feb 11, 2024

Adding them into KoboldCpp-ROCm 1.57.1.yr1, hopefully everything works as intended xD Thanks!

I realized later that the "lazy" one I shared was a bit incomplete and even unusable, so I added information at the top of this post #655 (comment), then I created and added "none-lazy" for the 5.7.1 HIP SDK version. The "none-lazy" one works smoothly and properly, I recommend adding the "none-lazy" one in the new version. I saw that the "lazy" one was added in the new version, which unfortunately will not work :( I am adding the link again to avoid confusion @YellowRoseCx
https://github.com/LostRuins/koboldcpp/files/14129073/gfx1032_none_lazy-rocm-5.7.1.zip

@YellowRoseCx
Copy link

Adding them into KoboldCpp-ROCm 1.57.1.yr1, hopefully everything works as intended xD Thanks!

I realized later that the "lazy" one I shared was a bit incomplete and even unusable, so I added information at the top of this post #655 (comment), then I created and added "none-lazy" for the 5.7.1 HIP SDK version. The "none-lazy" one works smoothly and properly, I recommend adding the "none-lazy" one in the new version. I saw that the "lazy" one was added in the new version, which unfortunately will not work :( I am adding the link again to avoid confusion @YellowRoseCx
https://github.com/LostRuins/koboldcpp/files/14129073/gfx1032_none_lazy-rocm-5.7.1.zip

I cant use the none lazy one because then I cant use the other ones from gfx1031 because it would overwrite the file Tensilelibrary.dat

@jasyuiop
Copy link
Author

jasyuiop commented Feb 11, 2024

Adding them into KoboldCpp-ROCm 1.57.1.yr1, hopefully everything works as intended xD Thanks!

I realized later that the "lazy" one I shared was a bit incomplete and even unusable, so I added information at the top of this post #655 (comment), then I created and added "none-lazy" for the 5.7.1 HIP SDK version. The "none-lazy" one works smoothly and properly, I recommend adding the "none-lazy" one in the new version. I saw that the "lazy" one was added in the new version, which unfortunately will not work :( I am adding the link again to avoid confusion @YellowRoseCx
https://github.com/LostRuins/koboldcpp/files/14129073/gfx1032_none_lazy-rocm-5.7.1.zip

I cant use the none lazy one because then I cant use the other ones from gfx1031 because it would overwrite the file Tensilelibrary.dat

yes, that would be a problem, I didn't think about that. gfx1032 owners will compile it themselves then, I wrote how to compile and create an exe on discord and I'll share it here;

make_pyinstaller_exe_rocm_only.bat copy create a new .bat change rocm version from 5.5 to 5.7 only

then run that bat file. it will create exe under koboldcpp-rocm\dist

@YellowRoseCx
Copy link

YellowRoseCx commented Feb 11, 2024

Adding them into KoboldCpp-ROCm 1.57.1.yr1, hopefully everything works as intended xD Thanks!

I realized later that the "lazy" one I shared was a bit incomplete and even unusable, so I added information at the top of this post #655 (comment), then I created and added "none-lazy" for the 5.7.1 HIP SDK version. The "none-lazy" one works smoothly and properly, I recommend adding the "none-lazy" one in the new version. I saw that the "lazy" one was added in the new version, which unfortunately will not work :( I am adding the link again to avoid confusion @YellowRoseCx
https://github.com/LostRuins/koboldcpp/files/14129073/gfx1032_none_lazy-rocm-5.7.1.zip

I cant use the none lazy one because then I cant use the other ones from gfx1031 because it would overwrite the file Tensilelibrary.dat

yes, that would be a problem, I didn't think about that. gfx1032 owners will compile it themselves then, I wrote how to compile and create an exe on discord and I'll share it here;

make_pyinstaller_exe_rocm_only.bat copy create a new .bat change rocm version from 5.5 to 5.7 only

then run that bat file. it will create exe under koboldcpp-rocm\dist

Could you try compiling for gpu targets gfx1031 and gfx1032? It should output only 1 tensilelibrary.dat then

@jasyuiop
Copy link
Author

Adding them into KoboldCpp-ROCm 1.57.1.yr1, hopefully everything works as intended xD Thanks!

I realized later that the "lazy" one I shared was a bit incomplete and even unusable, so I added information at the top of this post #655 (comment), then I created and added "none-lazy" for the 5.7.1 HIP SDK version. The "none-lazy" one works smoothly and properly, I recommend adding the "none-lazy" one in the new version. I saw that the "lazy" one was added in the new version, which unfortunately will not work :( I am adding the link again to avoid confusion @YellowRoseCx
https://github.com/LostRuins/koboldcpp/files/14129073/gfx1032_none_lazy-rocm-5.7.1.zip

I cant use the none lazy one because then I cant use the other ones from gfx1031 because it would overwrite the file Tensilelibrary.dat

yes, that would be a problem, I didn't think about that. gfx1032 owners will compile it themselves then, I wrote how to compile and create an exe on discord and I'll share it here;

make_pyinstaller_exe_rocm_only.bat copy create a new .bat change rocm version from 5.5 to 5.7 only
then run that bat file. it will create exe under koboldcpp-rocm\dist

Could you try compiling for gpu targets gfx1031 and gfx1032? It should output only 1 tensilelibrary.dat then

I'm glad you told me that :) I compiled it without any problems, I used rocblas and tensile rel-5.7.1 branches.

python rmake.py -a gfx1031;gfx1032 --merge-architectures --no-lazy-library-loading -t "D:\Ai\5-7-1\Tensile" -d -j 16 -v

I'll explain step by step how I compiled it a little later, just for information :)

Attachments

gfx1031_gfx1032_none-lazy-rocm.5.7.1.zip

@jasyuiop
Copy link
Author

Install

Git for Windows
Visual Studio 2022 Build Tools

  • Tick “Desktop development with C++” workload.

ROCm Windows SDK (i used 5.7.1)
Strawberry perl
python 3.11

ADD PATH

Cmake and Ninja:

  • C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin
  • C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\Ninja

Git:

  • C:\Program Files\Git\bin

Perl:

  • C:\Strawberry\perl\bin

ROCM:

  • C:\Program Files\AMD\ROCm\5.7\bin

vcpkg

Rocblas

go to another folder for example downloads etc.

Tensile

go to another folder

Open x64 native tools(without Admin) and go to the rocblas folder

  • python rdeps.py
  • python rmake.py -a gfx1031;gfx1032 --merge-architectures --no-lazy-library-loading -t "D:\Ai\5-7-1\Tensile" -d -j 16 -v

After the rmake.py command is finished open x64 native tools console with ADMİN

  • cmake --install build\release --prefix "C:\Program Files\AMD\ROCm\5.7"

@jasyuiop
Copy link
Author

I always get this error when compiling with the parameters --lazy-library-loading --no-merge-architectures, if someone can tell me how to solve this error I can also compile the "lazy" one for gfx1031 and gfx1032.

I don't understand why, it compiles with --merge-architectures --no-lazy-library-loading without any error.

Reading logic files: Launching 16 threads for 108 tasks...
Reading logic files: Done.
[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||] 100% (0.4 secs elapsed)
Using fallback for arch: gfx1031
Using fallback for arch: gfx1032
# Writing Custom CMake
# Writing Kernels...
Generating kernels: Launching 16 threads...
Generating kernels: Done.
*
Compiling source kernels: Launching 16 threads...
Compiling source kernels: Done.
# Kernel Building elapsed time = 82.0 secs
# Tensile Library Writer DONE
################################################################################

[4/257] library\src\CMakeFiles\TENSILE_LIBRARY_TARGET.dir\utility.bat ecc6f16db1efb076
FAILED: library/src/CMakeFiles/TENSILE_LIBRARY_TARGET.util
library\src\CMakeFiles\TENSILE_LIBRARY_TARGET.dir\utility.bat ecc6f16db1efb076
Error copying file (if different) from "D:\Ai\5-7-1\rocBLAS\build\release\Tensile\library\TensileLibrary_lazy_gfx1032.dat" to "D:/Ai/5-7-1/rocBLAS/build/release/Tensile/library".
Batch file failed at line 61 with errorcode 1
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "D:\Ai\5-7-1\rocBLAS\rmake.py", line 512, in <module>
    main()
  File "D:\Ai\5-7-1\rocBLAS\rmake.py", line 505, in main
    if run_cmd(exe, opts):
       ^^^^^^^^^^^^^^^^^^
  File "D:\Ai\5-7-1\rocBLAS\rmake.py", line 468, in run_cmd
    proc = subprocess.run(program, check=True, stderr=subprocess.STDOUT, shell=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2288.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'ninja.exe -j 16 --verbose all' returned non-zero exit status 1

@YellowRoseCx
Copy link

Adding them into KoboldCpp-ROCm 1.57.1.yr1, hopefully everything works as intended xD Thanks!

I realized later that the "lazy" one I shared was a bit incomplete and even unusable, so I added information at the top of this post #655 (comment), then I created and added "none-lazy" for the 5.7.1 HIP SDK version. The "none-lazy" one works smoothly and properly, I recommend adding the "none-lazy" one in the new version. I saw that the "lazy" one was added in the new version, which unfortunately will not work :( I am adding the link again to avoid confusion @YellowRoseCx
https://github.com/LostRuins/koboldcpp/files/14129073/gfx1032_none_lazy-rocm-5.7.1.zip

I cant use the none lazy one because then I cant use the other ones from gfx1031 because it would overwrite the file Tensilelibrary.dat

yes, that would be a problem, I didn't think about that. gfx1032 owners will compile it themselves then, I wrote how to compile and create an exe on discord and I'll share it here;

make_pyinstaller_exe_rocm_only.bat copy create a new .bat change rocm version from 5.5 to 5.7 only
then run that bat file. it will create exe under koboldcpp-rocm\dist

Could you try compiling for gpu targets gfx1031 and gfx1032? It should output only 1 tensilelibrary.dat then

I'm glad you told me that :) I compiled it without any problems, I used rocblas and tensile rel-5.7.1 branches.

python rmake.py -a gfx1031;gfx1032 --merge-architectures --no-lazy-library-loading -t "D:\Ai\5-7-1\Tensile" -d -j 16 -v

I'll explain step by step how I compiled it a little later, just for information :)

Attachments

gfx1031_gfx1032_none-lazy-rocm.5.7.1.zip

I'm building a new koboldcpp version now to see if it works

@jasyuiop
Copy link
Author

jasyuiop commented Feb 11, 2024

I'm building a new koboldcpp version now to see if it works

By the way, one thing I noticed is that the tensilelibrary.dat file may be related to the Tensile version regardless of the cards.

When I do SHA check, it gives the same result as my previous build. I also compared it with the kernel and library file from your first build where you supported gfx1031, I think the compiler(rocblas, tensile) used HIP SDK version 5.5.1 and that's why both kernel and tensilelibrary SHAs are not consistent.

With new HIP versions and card support, if you take a base version(sdk, tensile, rocblas) and tell the card owners to compile in that version and send the kernel file, it seems to work fine.

@hiepxanh
Copy link

I have a 6600XT card now, should I can use the zip file or I have to do build step like you? @jasyuiop I think it little overhead for me

@jasyuiop
Copy link
Author

I have a 6600XT card now, should I can use the zip file or I have to do build step like you? @jasyuiop I think it little overhead for me

You don't need to bother with compiling the kernel or koboldcpp, I compiled the kernel for gfx1032 and @YellowRoseCx added it to the new releases, just do the following and you're good

@hiepxanh
Copy link

Aw so sweet, thank you so much @jasyuiop

@GoldenNocturne
Copy link

Now let me explain here, I have been struggling with the rmake.py command for two days, even if I pass -a with gtx1032 or other parameters, I still get an error. If you are not as unlucky as me, you may not get an error here.

It doesn't matter if you also get an error, the command just needs to generate some things and put them in place

@jasyuiop for me it is stuck at:
[0/2] Re-checking globbed directories...
[2/400] Generating prototypes from C:/AI/rocBLAS/library/src.

Should i try waiting even longer or has the command finished doing what's needed?

@jasyuiop
Copy link
Author

jasyuiop commented Feb 20, 2024

Now let me explain here, I have been struggling with the rmake.py command for two days, even if I pass -a with gtx1032 or other parameters, I still get an error. If you are not as unlucky as me, you may not get an error here.
It doesn't matter if you also get an error, the command just needs to generate some things and put them in place

@jasyuiop for me it is stuck at: [0/2] Re-checking globbed directories... [2/400] Generating prototypes from C:/AI/rocBLAS/library/src.

Should i try waiting even longer or has the command finished doing what's needed?

no, you should wait, but if you proceed as in the message you quoted, you may get an error

If you follow all the steps as I describe here, you should not get any error #655 (comment)

the reason I got an error there was because I was missing something, I realized it too late :)

@GoldenNocturne
Copy link

Now let me explain here, I have been struggling with the rmake.py command for two days, even if I pass -a with gtx1032 or other parameters, I still get an error. If you are not as unlucky as me, you may not get an error here.
It doesn't matter if you also get an error, the command just needs to generate some things and put them in place

@jasyuiop for me it is stuck at: [0/2] Re-checking globbed directories... [2/400] Generating prototypes from C:/AI/rocBLAS/library/src.
Should i try waiting even longer or has the command finished doing what's needed?

no, you should wait, but if you proceed as in the message you quoted, you may get an error

If you follow all the steps as I describe here, you should not get any error #655 (comment)

the reason I got an error there was because I was missing something, I realized it too late :)

Thanks. I'm actually trying to build for gfx1010, how should i adapt the process in the quoted comment?

@jasyuiop
Copy link
Author

Now let me explain here, I have been struggling with the rmake.py command for two days, even if I pass -a with gtx1032 or other parameters, I still get an error. If you are not as unlucky as me, you may not get an error here.
It doesn't matter if you also get an error, the command just needs to generate some things and put them in place

@jasyuiop for me it is stuck at: [0/2] Re-checking globbed directories... [2/400] Generating prototypes from C:/AI/rocBLAS/library/src.
Should i try waiting even longer or has the command finished doing what's needed?

no, you should wait, but if you proceed as in the message you quoted, you may get an error
If you follow all the steps as I describe here, you should not get any error #655 (comment)
the reason I got an error there was because I was missing something, I realized it too late :)

Thanks. I'm actually trying to build for gfx1010, how should i adapt the process in the quoted comment?

if you followed exactly the same path, you only need to change the parameter for gfx1010 (don't forget to change the path for the tensile folder and change the -j parameter depending on how many cores you have)

python rmake.py -a gfx1010 --merge-architectures --no-lazy-library-loading -t "D:\Ai\5-7-1\Tensile" -d -j 16 -v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants