-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature add cublas support #65
Feature add cublas support #65
Conversation
CMakeLists.txt
Outdated
@@ -39,6 +39,7 @@ option(RWKV_FMA "rwkv: enable FMA" | |||
# 3rd party libs | |||
option(RWKV_ACCELERATE "rwkv: enable Accelerate framework" ON) | |||
option(RWKV_OPENBLAS "rwkv: use OpenBLAS" OFF) | |||
option(RWKV_CUBLAS "rwkv: use CuBLAS" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option(RWKV_CUBLAS "rwkv: use CuBLAS" OFF) | |
option(RWKV_CUBLAS "rwkv: use cuBLAS" OFF) |
CMakeLists.txt
Outdated
${GGML_CUDA_SOURCES}) | ||
|
||
target_include_directories(ggml PUBLIC ${CMAKE_SOURCE_DIR}/ggml/include/ggml) | ||
target_compile_features(ggml PUBLIC c_std_11) # don't bump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target_compile_features(ggml PUBLIC c_std_11) # don't bump | |
target_compile_features(ggml PUBLIC c_std_11) # Don't bump |
README.md
Outdated
@@ -63,6 +63,17 @@ cmake --build . --config Release | |||
|
|||
If everything went OK, `bin\Release\rwkv.dll` file should appear. | |||
|
|||
##### WIndows + CuBlas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
##### WIndows + CuBlas | |
##### Windows + cuBLAS | |
README.md
Outdated
@@ -63,6 +63,17 @@ cmake --build . --config Release | |||
|
|||
If everything went OK, `bin\Release\rwkv.dll` file should appear. | |||
|
|||
##### WIndows + CuBlas | |||
copy cuda libs into release | |||
https://github.com/ggerganov/llama.cpp/pull/1271/files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be unclear for users where the actual files are. I suggest leaving a link to a specific release that contains CUDA dlls: https://github.com/ggerganov/llama.cpp/releases/tag/master-fab49c6
README.md
Outdated
cmake .. -DRWKV_CUBLAS=ON | ||
cmake --build . --config Release | ||
``` | ||
**Important** `cudart64_12.dll`, `cublas64_12.dll`, `cublasLt64_12.dll` should be copied from `{CUDA}/bin` into `build/bin/Release` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move it above the build commands, because this is a prerequesite for build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not required for compiling but required for using, since there is no windows static cublas libraries, we need manually copy the dll files
Great PR! Some minor comments and Ubuntu build need to be fixed before merge. BTW, do you have numbers on how much faster it is with cuBLAS? |
@saharNooby Im trying to fix the compile error on linux. about the performance improvement, I also want to discuss with you, as I tested on my local, seems the gpu nerver used even the compile success, working on this, hope could give an answer to you soon |
@saharNooby still get 0% of gpu useage ... will check after a break |
I think you need to offload the model tensors to the GPU. Something like this: |
yea, I found there also no vram usage will try to load the tensors on gpu |
@ggerganov rwkv.cpp is using graph, ggml_map_unary_f32, ggml_map_binary_f32 to calculate, seems like there is memory access issue during access gpu and cpu at sametime |
@saharNooby @ggerganov here is my test result, I am using 3060Ti(8G vram) i7 13700K(24 threads) thread = 24, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = ON
thread = 24, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = OFF
thread = 4, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = ON
thread = 4, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = OFF
thread = 1, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = ON
Since I am using i7 13700K, which is very high performance cpu, it is not so slow running on cpu only 😂 |
I am not able to finish all work today, but will take time to finish all by tomorrow
|
and seems current version of ggml breaks the quantize logic, which mean the under quantilized models created before are not able to use after this release |
You must transform only the tensors that are used by
No other operators are currently supported for GPU, so there is no point in offloading
Yes, there have been some quantization changes recently. It is best to re-quantize all models using the latest version |
it looks like this actually might not conflict at the source code level, but the prints to stderr could probably be surrounded by checks for |
rwkv.cpp
Outdated
{ | ||
const int n_gpu = std::min(n_gpu_layers, model->n_layer); | ||
|
||
fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu); | |
if (ctx->print_errors) | |
fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu); |
rwkv.cpp
Outdated
ggml_cuda_transform_tensor(layer.ffn_receptance); vram_total += ggml_nbytes(layer.ffn_receptance); | ||
} | ||
|
||
fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024); | |
if (ctx->print_errors) | |
fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024); |
rwkv.cpp
Outdated
@@ -227,7 +232,7 @@ struct rwkv_context * rwkv_init_from_file(const char * file_path, const uint32_t | |||
); | |||
|
|||
RWKV_ASSERT_NULL( | |||
model->data_type != 6, | |||
model->data_type != 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Q4_2, which was removed last week, but you should change the warning text to say Q4_2 and also keep one for Q4_3 as well which is still unsupported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix them all
(this PR needs to be rebased on top of the new code) |
@saharNooby your solution to the file reading issue is incorrect and does not work on linux. please use my actual fix #70, which I asked you to try earlier |
it works, thank you for the quick fix |
if you join the discord we can work together more quickly (https://discord.gg/qt9egFA7ve ) |
My plan is to do ggml upgrade myself, so that I can add proper documentation and helpful messages for end users about why their models broke. After that, I think I will have no concerns about this PR. |
Thanks will update the branch after your change |
…feature/add_cublas_support
…feature/add_cublas_support
still has some problem with |
check you are on the latest version of pytorch (I don't know how you could be almost 2 years out of date but worth checking anyway I guess) |
Found out the reason, I was using pytorch 1.12.3, figured out by upgrading to 2.0.1, thanks |
@saharNooby I have updated the branch, it should be fine to review and merge now |
rwkv.cpp
Outdated
for (int i = 0; i < n_gpu; ++i) { | ||
const auto & layer = model->layers[i]; | ||
|
||
ggml_cuda_transform_tensor(layer.ln1_weight); vram_total += ggml_nbytes(layer.ln1_weight); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LN weights are tiny and have the same size as LN biases, I'm 95% sure they should not be on GPU. This will not save much VRAM, but will simplify code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiplies with these tensors will not be optimized if they are not on the GPU though. even very tiny tensors have to be on GPU if other tensors are to interact with them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But after multiplying by ln.weight
x
then is added to ln.bias
, which will be on CPU -- so we will still need to download x
from the GPU, in both cases. And multiplying 1.5K floats (as I remember, largest n_embed
for RWKV) looks very okay for CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you're right, I did not look at the surrounding context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reallly want to get rid of CPU-GPU transfer in first layers, then all tensors of first layers (including biases) needs to be on GPU.
Unfortunately, as I remember, ggml does not support any operations on GPU aside from matmul.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow, by removing ln.weight
the inference speed get increased by 10 - 20%
rwkv.cpp
Outdated
{ | ||
const int n_gpu = std::min(n_gpu_layers, model->n_layer); | ||
|
||
fprintf(stdout, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the library should not write anything to stdout/stderr. Either a proper logging framework needs to be added (and it will not), or the library should be silent and leave actual logging to application code. Or else it will interfere with application output, which, for example, may be a chat bot interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the library should not write anything to stdout/stderr.
agreed
Either a proper logging framework needs to be added (and it will not), or the library should be silent and leave actual logging to application code
I like using callbacks for this, I use a C++ lambda callback in my newest version of some code (only for internal library/implementation use), but there is a way to do it in pure C. For example the Lua scripting language does this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sotred the gpu_layer and vram_total in context, for cpp developers they could easily get the results
Summary of the concerns so far:
I have no other concerns so far. @yorkzero831 Are you okay with me editing this PR to do minor formatting changes? |
sure, please |
I am away from my pc now, will take hours before able to fix the comment, feel free to update it. |
@saharNooby I got home now,so I will try to fix the comment, hope you are not doing samething |
#62
Support cublas feature