Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature add cublas support #65

Merged
merged 22 commits into from
May 29, 2023

Conversation

yorkzero831
Copy link
Contributor

@yorkzero831 yorkzero831 commented May 21, 2023

#62

Support cublas feature

CMakeLists.txt Outdated
@@ -39,6 +39,7 @@ option(RWKV_FMA "rwkv: enable FMA"
# 3rd party libs
option(RWKV_ACCELERATE "rwkv: enable Accelerate framework" ON)
option(RWKV_OPENBLAS "rwkv: use OpenBLAS" OFF)
option(RWKV_CUBLAS "rwkv: use CuBLAS" OFF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
option(RWKV_CUBLAS "rwkv: use CuBLAS" OFF)
option(RWKV_CUBLAS "rwkv: use cuBLAS" OFF)

CMakeLists.txt Outdated
${GGML_CUDA_SOURCES})

target_include_directories(ggml PUBLIC ${CMAKE_SOURCE_DIR}/ggml/include/ggml)
target_compile_features(ggml PUBLIC c_std_11) # don't bump
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
target_compile_features(ggml PUBLIC c_std_11) # don't bump
target_compile_features(ggml PUBLIC c_std_11) # Don't bump

README.md Outdated
@@ -63,6 +63,17 @@ cmake --build . --config Release

If everything went OK, `bin\Release\rwkv.dll` file should appear.

##### WIndows + CuBlas
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### WIndows + CuBlas
##### Windows + cuBLAS

README.md Outdated
@@ -63,6 +63,17 @@ cmake --build . --config Release

If everything went OK, `bin\Release\rwkv.dll` file should appear.

##### WIndows + CuBlas
copy cuda libs into release
https://github.com/ggerganov/llama.cpp/pull/1271/files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be unclear for users where the actual files are. I suggest leaving a link to a specific release that contains CUDA dlls: https://github.com/ggerganov/llama.cpp/releases/tag/master-fab49c6

README.md Outdated
cmake .. -DRWKV_CUBLAS=ON
cmake --build . --config Release
```
**Important** `cudart64_12.dll`, `cublas64_12.dll`, `cublasLt64_12.dll` should be copied from `{CUDA}/bin` into `build/bin/Release`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move it above the build commands, because this is a prerequesite for build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not required for compiling but required for using, since there is no windows static cublas libraries, we need manually copy the dll files

@saharNooby saharNooby linked an issue May 21, 2023 that may be closed by this pull request
@saharNooby
Copy link
Collaborator

Great PR! Some minor comments and Ubuntu build need to be fixed before merge.

BTW, do you have numbers on how much faster it is with cuBLAS?

@yorkzero831
Copy link
Contributor Author

@saharNooby Im trying to fix the compile error on linux. about the performance improvement, I also want to discuss with you, as I tested on my local, seems the gpu nerver used even the compile success, working on this, hope could give an answer to you soon

@yorkzero831 yorkzero831 changed the title Feature/add cublas support [WIP] Feature/add cublas support May 21, 2023
@yorkzero831
Copy link
Contributor Author

@saharNooby still get 0% of gpu useage ... will check after a break

@ggerganov
Copy link
Contributor

@yorkzero831

I think you need to offload the model tensors to the GPU.

Something like this:

https://github.com/ggerganov/llama.cpp/blob/905d87b70aa189623d500a28602d7a3a755a4769/llama.cpp#L1030-L1056

@yorkzero831
Copy link
Contributor Author

@yorkzero831

I think you need to offload the model tensors to the GPU.

Something like this:

https://github.com/ggerganov/llama.cpp/blob/905d87b70aa189623d500a28602d7a3a755a4769/llama.cpp#L1030-L1056

yea, I found there also no vram usage will try to load the tensors on gpu

@yorkzero831
Copy link
Contributor Author

yorkzero831 commented May 21, 2023

@ggerganov
Do you aware of this error
image

rwkv.cpp is using graph, ggml_map_unary_f32, ggml_map_binary_f32 to calculate, seems like there is memory access issue during access gpu and cpu at sametime

@yorkzero831
Copy link
Contributor Author

@saharNooby @ggerganov
since rwkv.cpp is using map functions like https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv.cpp#L141 which may cause problems on compution when access vram and ram at sametime. I could only transform part of tensors to vram.

here is my test result, I am using 3060Ti(8G vram) i7 13700K(24 threads)

thread = 24, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = ON

rwkv_init_from_file: [cublas] offloading 32 layers to GPU
rwkv_init_from_file: [cublas] total VRAM used: 4993 MB
Processing 185 prompt tokens, may take a while
Process time :25458.660364151 ms
Process time per token :137.61438034676218 ms

thread = 24, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = OFF

Loading RWKV model
Processing 185 prompt tokens, may take a while
Process time :34545.93539237976 ms
Process time per token :186.73478590475546 ms

thread = 4, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = ON

Loading RWKV model
rwkv_init_from_file: [cublas] offloading 32 layers to GPU
rwkv_init_from_file: [cublas] total VRAM used: 4993 MB
Processing 185 prompt tokens, may take a while
Process time :13440.271139144897 ms
Process time per token :72.6501142656481 ms

thread = 4, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = OFF

Loading RWKV model
Processing 185 prompt tokens, may take a while
Process time :42272.14694023132 ms
Process time per token :228.49809156881796 ms

thread = 1, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = ON

Loading RWKV model
rwkv_init_from_file: [cublas] offloading 32 layers to GPU
rwkv_init_from_file: [cublas] total VRAM used: 4993 MB
Processing 185 prompt tokens, may take a while
Process time :15258.867740631104 ms
Process time per token :82.48036616557353 ms

Since I am using i7 13700K, which is very high performance cpu, it is not so slow running on cpu only 😂
as the result, when cublas = ON, it is better to keep less thread and could even have almost 100% speed.

@yorkzero831
Copy link
Contributor Author

yorkzero831 commented May 21, 2023

I am not able to finish all work today, but will take time to finish all by tomorrow
todo:

  • make n_gpu_layers as input params
  • since Q4_2 has removed from ggml, need to change readme

@yorkzero831
Copy link
Contributor Author

and seems current version of ggml breaks the quantize logic, which mean the under quantilized models created before are not able to use after this release

@ggerganov
Copy link
Contributor

ggerganov commented May 21, 2023

@saharNooby @ggerganov since rwkv.cpp is using map functions like https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv.cpp#L141 which may cause problems on compution when access vram and ram at sametime. I could only transform part of tensors to vram.

You must transform only the tensors that are used by ggml_mul_mat(), like:

  • layer.att_receptance
  • layer.att_key
  • layer.att_value
  • etc.

No other operators are currently supported for GPU, so there is no point in offloading

and seems current version of ggml breaks the quantize logic, which mean the under quantilized models created before are not able to use after this release

Yes, there have been some quantization changes recently. It is best to re-quantize all models using the latest version

@york-yu-ctw york-yu-ctw mentioned this pull request May 22, 2023
@yorkzero831 yorkzero831 changed the title [WIP] Feature/add cublas support Feature add cublas support May 22, 2023
@LoganDark
Copy link
Contributor

LoganDark commented May 24, 2023

it looks like this actually might not conflict at the source code level, but the prints to stderr could probably be surrounded by checks for ctx->print_errors (which is brand new in my PR)

rwkv.cpp Outdated
{
const int n_gpu = std::min(n_gpu_layers, model->n_layer);

fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);
if (ctx->print_errors)
fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);

rwkv.cpp Outdated
ggml_cuda_transform_tensor(layer.ffn_receptance); vram_total += ggml_nbytes(layer.ffn_receptance);
}

fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
if (ctx->print_errors)
fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);

rwkv.cpp Outdated
@@ -227,7 +232,7 @@ struct rwkv_context * rwkv_init_from_file(const char * file_path, const uint32_t
);

RWKV_ASSERT_NULL(
model->data_type != 6,
model->data_type != 5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Q4_2, which was removed last week, but you should change the warning text to say Q4_2 and also keep one for Q4_3 as well which is still unsupported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix them all

@LoganDark
Copy link
Contributor

(this PR needs to be rebased on top of the new code)

@LoganDark
Copy link
Contributor

LoganDark commented May 24, 2023

@saharNooby your solution to the file reading issue is incorrect and does not work on linux. please use my actual fix #70, which I asked you to try earlier

@yorkzero831
Copy link
Contributor Author

yorkzero831 commented May 24, 2023

@saharNooby your solution to the file reading issue is incorrect and does not work on linux. please use my actual fix #70, which I asked you to try earlier

it works, thank you for the quick fix

@LoganDark
Copy link
Contributor

@saharNooby your solution to the file reading issue is incorrect and does not work on linux. please use my actual fix #70, which I asked you to try earlier

it works, thank your for the quick fix

if you join the discord we can work together more quickly (https://discord.gg/qt9egFA7ve )

@saharNooby
Copy link
Collaborator

My plan is to do ggml upgrade myself, so that I can add proper documentation and helpful messages for end users about why their models broke. After that, I think I will have no concerns about this PR.

@yorkzero831
Copy link
Contributor Author

My plan is to do ggml upgrade myself, so that I can add proper documentation and helpful messages for end users about why their models broke. After that, I think I will have no concerns about this PR.

Thanks will update the branch after your change

@yorkzero831
Copy link
Contributor Author

still has some problem with 'Tensor' object has no attribute 'untyped_storage'

@LoganDark
Copy link
Contributor

still has some problem with 'Tensor' object has no attribute 'untyped_storage'

check you are on the latest version of pytorch (I don't know how you could be almost 2 years out of date but worth checking anyway I guess)

@yorkzero831
Copy link
Contributor Author

yorkzero831 commented May 27, 2023

still has some problem with 'Tensor' object has no attribute 'untyped_storage'

check you are on the latest version of pytorch (I don't know how you could be almost 2 years out of date but worth checking anyway I guess)

still has some problem with 'Tensor' object has no attribute 'untyped_storage'

check you are on the latest version of pytorch (I don't know how you could be almost 2 years out of date but worth checking anyway I guess)

Found out the reason, I was using pytorch 1.12.3, figured out by upgrading to 2.0.1, thanks

@yorkzero831
Copy link
Contributor Author

@saharNooby I have updated the branch, it should be fine to review and merge now

rwkv.h Show resolved Hide resolved
rwkv.cpp Outdated
for (int i = 0; i < n_gpu; ++i) {
const auto & layer = model->layers[i];

ggml_cuda_transform_tensor(layer.ln1_weight); vram_total += ggml_nbytes(layer.ln1_weight);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LN weights are tiny and have the same size as LN biases, I'm 95% sure they should not be on GPU. This will not save much VRAM, but will simplify code

Copy link
Contributor

@LoganDark LoganDark May 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiplies with these tensors will not be optimized if they are not on the GPU though. even very tiny tensors have to be on GPU if other tensors are to interact with them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But after multiplying by ln.weight x then is added to ln.bias, which will be on CPU -- so we will still need to download x from the GPU, in both cases. And multiplying 1.5K floats (as I remember, largest n_embed for RWKV) looks very okay for CPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you're right, I did not look at the surrounding context

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reallly want to get rid of CPU-GPU transfer in first layers, then all tensors of first layers (including biases) needs to be on GPU.

Unfortunately, as I remember, ggml does not support any operations on GPU aside from matmul.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow, by removing ln.weight the inference speed get increased by 10 - 20%

rwkv.cpp Outdated
{
const int n_gpu = std::min(n_gpu_layers, model->n_layer);

fprintf(stdout, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the library should not write anything to stdout/stderr. Either a proper logging framework needs to be added (and it will not), or the library should be silent and leave actual logging to application code. Or else it will interfere with application output, which, for example, may be a chat bot interface.

Copy link
Contributor

@LoganDark LoganDark May 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the library should not write anything to stdout/stderr.

agreed

Either a proper logging framework needs to be added (and it will not), or the library should be silent and leave actual logging to application code

I like using callbacks for this, I use a C++ lambda callback in my newest version of some code (only for internal library/implementation use), but there is a way to do it in pure C. For example the Lua scripting language does this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sotred the gpu_layer and vram_total in context, for cpp developers they could easily get the results

rwkv.cpp Show resolved Hide resolved
@saharNooby
Copy link
Collaborator

Summary of the concerns so far:

  • adding new argument to rwkv_load_from_file: compatibility needs to be tested, and ABI preserved
  • the library should not write to stdout
  • LN bias shoud not be on GPU

I have no other concerns so far.

@yorkzero831 Are you okay with me editing this PR to do minor formatting changes?

@yorkzero831
Copy link
Contributor Author

Summary of the concerns so far:

  • adding new argument to rwkv_load_from_file: compatibility needs to be tested, and ABI preserved
  • the library should not write to stdout
  • LN bias shoud not be on GPU

I have no other concerns so far.

@yorkzero831 Are you okay with me editing this PR to do minor formatting changes?

sure, please

@yorkzero831
Copy link
Contributor Author

Summary of the concerns so far:

  • adding new argument to rwkv_load_from_file: compatibility needs to be tested, and ABI preserved
  • the library should not write to stdout
  • LN bias shoud not be on GPU

I have no other concerns so far.

@yorkzero831 Are you okay with me editing this PR to do minor formatting changes?

I am away from my pc now, will take hours before able to fix the comment, feel free to update it.

@yorkzero831
Copy link
Contributor Author

@saharNooby I got home now,so I will try to fix the comment, hope you are not doing samething

@saharNooby saharNooby merged commit 241350f into RWKV:master May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add CuBLAS support
4 participants