Feature add cublas support #65

yorkzero831 · 2023-05-21T06:05:29Z

Support cublas feature

saharNooby · 2023-05-21T06:51:35Z

CMakeLists.txt

@@ -39,6 +39,7 @@ option(RWKV_FMA "rwkv: enable FMA"
 # 3rd party libs
 option(RWKV_ACCELERATE "rwkv: enable Accelerate framework" ON)
 option(RWKV_OPENBLAS "rwkv: use OpenBLAS" OFF)
+option(RWKV_CUBLAS "rwkv: use CuBLAS" OFF)


Suggested change

option(RWKV_CUBLAS "rwkv: use CuBLAS" OFF)

option(RWKV_CUBLAS "rwkv: use cuBLAS" OFF)

saharNooby · 2023-05-21T06:52:11Z

CMakeLists.txt

+ ${GGML_CUDA_SOURCES})
+
+target_include_directories(ggml PUBLIC ${CMAKE_SOURCE_DIR}/ggml/include/ggml)
+target_compile_features(ggml PUBLIC c_std_11) # don't bump


Suggested change

target_compile_features(ggml PUBLIC c_std_11) # don't bump

target_compile_features(ggml PUBLIC c_std_11) # Don't bump

saharNooby · 2023-05-21T06:53:52Z

README.md

@@ -63,6 +63,17 @@ cmake --build . --config Release

 If everything went OK, `bin\Release\rwkv.dll` file should appear.

+##### WIndows + CuBlas


Suggested change

##### WIndows + CuBlas

##### Windows + cuBLAS

saharNooby · 2023-05-21T06:54:30Z

README.md

@@ -63,6 +63,17 @@ cmake --build . --config Release

 If everything went OK, `bin\Release\rwkv.dll` file should appear.

+##### WIndows + CuBlas
+copy cuda libs into release
+https://github.com/ggerganov/llama.cpp/pull/1271/files


It can be unclear for users where the actual files are. I suggest leaving a link to a specific release that contains CUDA dlls: https://github.com/ggerganov/llama.cpp/releases/tag/master-fab49c6

saharNooby · 2023-05-21T06:55:04Z

README.md

+cmake .. -DRWKV_CUBLAS=ON
+cmake --build . --config Release
+```
+**Important** `cudart64_12.dll`, `cublas64_12.dll`, `cublasLt64_12.dll` should be copied from `{CUDA}/bin` into `build/bin/Release`


Let's move it above the build commands, because this is a prerequesite for build.

this is not required for compiling but required for using, since there is no windows static cublas libraries, we need manually copy the dll files

saharNooby · 2023-05-21T06:56:23Z

Great PR! Some minor comments and Ubuntu build need to be fixed before merge.

BTW, do you have numbers on how much faster it is with cuBLAS?

yorkzero831 · 2023-05-21T06:59:09Z

@saharNooby Im trying to fix the compile error on linux. about the performance improvement, I also want to discuss with you, as I tested on my local, seems the gpu nerver used even the compile success, working on this, hope could give an answer to you soon

yorkzero831 · 2023-05-21T08:47:57Z

@saharNooby still get 0% of gpu useage ... will check after a break

ggerganov · 2023-05-21T09:01:57Z

@yorkzero831

I think you need to offload the model tensors to the GPU.

Something like this:

https://github.com/ggerganov/llama.cpp/blob/905d87b70aa189623d500a28602d7a3a755a4769/llama.cpp#L1030-L1056

yorkzero831 · 2023-05-21T09:07:57Z

@yorkzero831

I think you need to offload the model tensors to the GPU.

Something like this:

https://github.com/ggerganov/llama.cpp/blob/905d87b70aa189623d500a28602d7a3a755a4769/llama.cpp#L1030-L1056

yea, I found there also no vram usage will try to load the tensors on gpu

yorkzero831 · 2023-05-21T13:57:19Z

@ggerganov
Do you aware of this error

rwkv.cpp is using graph, ggml_map_unary_f32, ggml_map_binary_f32 to calculate, seems like there is memory access issue during access gpu and cpu at sametime

yorkzero831 · 2023-05-21T15:52:14Z

@saharNooby @ggerganov
since rwkv.cpp is using map functions like https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv.cpp#L141 which may cause problems on compution when access vram and ram at sametime. I could only transform part of tensors to vram.

here is my test result, I am using 3060Ti(8G vram) i7 13700K(24 threads)

thread = 24, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = ON

rwkv_init_from_file: [cublas] offloading 32 layers to GPU
rwkv_init_from_file: [cublas] total VRAM used: 4993 MB
Processing 185 prompt tokens, may take a while
Process time :25458.660364151 ms
Process time per token :137.61438034676218 ms

thread = 24, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = OFF

Loading RWKV model
Processing 185 prompt tokens, may take a while
Process time :34545.93539237976 ms
Process time per token :186.73478590475546 ms

thread = 4, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = ON

Loading RWKV model
rwkv_init_from_file: [cublas] offloading 32 layers to GPU
rwkv_init_from_file: [cublas] total VRAM used: 4993 MB
Processing 185 prompt tokens, may take a while
Process time :13440.271139144897 ms
Process time per token :72.6501142656481 ms

thread = 4, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = OFF

Loading RWKV model
Processing 185 prompt tokens, may take a while
Process time :42272.14694023132 ms
Process time per token :228.49809156881796 ms

thread = 1, model = RWKV-4-Raven-7B-v11-Q5_1, cublas = ON

Loading RWKV model
rwkv_init_from_file: [cublas] offloading 32 layers to GPU
rwkv_init_from_file: [cublas] total VRAM used: 4993 MB
Processing 185 prompt tokens, may take a while
Process time :15258.867740631104 ms
Process time per token :82.48036616557353 ms

Since I am using i7 13700K, which is very high performance cpu, it is not so slow running on cpu only 😂
as the result, when cublas = ON, it is better to keep less thread and could even have almost 100% speed.

yorkzero831 · 2023-05-21T15:53:37Z

I am not able to finish all work today, but will take time to finish all by tomorrow
todo:

make n_gpu_layers as input params
since Q4_2 has removed from ggml, need to change readme

yorkzero831 · 2023-05-21T16:06:08Z

and seems current version of ggml breaks the quantize logic, which mean the under quantilized models created before are not able to use after this release

ggerganov · 2023-05-21T17:44:41Z

@saharNooby @ggerganov since rwkv.cpp is using map functions like https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv.cpp#L141 which may cause problems on compution when access vram and ram at sametime. I could only transform part of tensors to vram.

You must transform only the tensors that are used by ggml_mul_mat(), like:

layer.att_receptance
layer.att_key
layer.att_value
etc.

No other operators are currently supported for GPU, so there is no point in offloading

and seems current version of ggml breaks the quantize logic, which mean the under quantilized models created before are not able to use after this release

Yes, there have been some quantization changes recently. It is best to re-quantize all models using the latest version

LoganDark · 2023-05-24T11:16:12Z

it looks like this actually might not conflict at the source code level, but the prints to stderr could probably be surrounded by checks for ctx->print_errors (which is brand new in my PR)

LoganDark · 2023-05-24T11:17:28Z

rwkv.cpp

+ {
+ const int n_gpu = std::min(n_gpu_layers, model->n_layer);
+
+ fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);


Suggested change

fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);

if (ctx->print_errors)

fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);

LoganDark · 2023-05-24T11:17:44Z

rwkv.cpp

+ ggml_cuda_transform_tensor(layer.ffn_receptance); vram_total += ggml_nbytes(layer.ffn_receptance);
+ }
+
+ fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);


Suggested change

fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);

if (ctx->print_errors)

fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);

LoganDark · 2023-05-24T11:21:20Z

rwkv.cpp

@@ -227,7 +232,7 @@ struct rwkv_context * rwkv_init_from_file(const char * file_path, const uint32_t
 );

 RWKV_ASSERT_NULL(
- model->data_type != 6,
+ model->data_type != 5,


This is Q4_2, which was removed last week, but you should change the warning text to say Q4_2 and also keep one for Q4_3 as well which is still unsupported

will fix them all

LoganDark · 2023-05-24T11:23:09Z

(this PR needs to be rebased on top of the new code)

LoganDark · 2023-05-24T14:24:28Z

@saharNooby your solution to the file reading issue is incorrect and does not work on linux. please use my actual fix #70, which I asked you to try earlier

yorkzero831 · 2023-05-24T14:31:44Z

@saharNooby your solution to the file reading issue is incorrect and does not work on linux. please use my actual fix #70, which I asked you to try earlier

it works, thank you for the quick fix

LoganDark · 2023-05-24T14:32:44Z

@saharNooby your solution to the file reading issue is incorrect and does not work on linux. please use my actual fix #70, which I asked you to try earlier

it works, thank your for the quick fix

if you join the discord we can work together more quickly (https://discord.gg/qt9egFA7ve )

saharNooby · 2023-05-26T12:34:17Z

My plan is to do ggml upgrade myself, so that I can add proper documentation and helpful messages for end users about why their models broke. After that, I think I will have no concerns about this PR.

yorkzero831 · 2023-05-26T12:37:25Z

My plan is to do ggml upgrade myself, so that I can add proper documentation and helpful messages for end users about why their models broke. After that, I think I will have no concerns about this PR.

Thanks will update the branch after your change

…feature/add_cublas_support

yorkzero831 · 2023-05-27T13:55:52Z

still has some problem with 'Tensor' object has no attribute 'untyped_storage'

LoganDark · 2023-05-27T14:00:14Z

still has some problem with 'Tensor' object has no attribute 'untyped_storage'

check you are on the latest version of pytorch (I don't know how you could be almost 2 years out of date but worth checking anyway I guess)

yorkzero831 · 2023-05-27T14:08:01Z

still has some problem with 'Tensor' object has no attribute 'untyped_storage'

check you are on the latest version of pytorch (I don't know how you could be almost 2 years out of date but worth checking anyway I guess)

Found out the reason, I was using pytorch 1.12.3, figured out by upgrading to 2.0.1, thanks

yorkzero831 · 2023-05-27T14:14:52Z

@saharNooby I have updated the branch, it should be fine to review and merge now

rwkv.h

saharNooby · 2023-05-28T06:19:22Z

rwkv.cpp

+ for (int i = 0; i < n_gpu; ++i) {
+ const auto & layer = model->layers[i];
+
+ ggml_cuda_transform_tensor(layer.ln1_weight); vram_total += ggml_nbytes(layer.ln1_weight);


LN weights are tiny and have the same size as LN biases, I'm 95% sure they should not be on GPU. This will not save much VRAM, but will simplify code

multiplies with these tensors will not be optimized if they are not on the GPU though. even very tiny tensors have to be on GPU if other tensors are to interact with them

But after multiplying by ln.weight x then is added to ln.bias, which will be on CPU -- so we will still need to download x from the GPU, in both cases. And multiplying 1.5K floats (as I remember, largest n_embed for RWKV) looks very okay for CPU.

maybe you're right, I did not look at the surrounding context

If we reallly want to get rid of CPU-GPU transfer in first layers, then all tensors of first layers (including biases) needs to be on GPU.

Unfortunately, as I remember, ggml does not support any operations on GPU aside from matmul.

Somehow, by removing ln.weight the inference speed get increased by 10 - 20%

saharNooby · 2023-05-28T06:21:38Z

rwkv.cpp

+ {
+ const int n_gpu = std::min(n_gpu_layers, model->n_layer);
+
+ fprintf(stdout, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);


I think the library should not write anything to stdout/stderr. Either a proper logging framework needs to be added (and it will not), or the library should be silent and leave actual logging to application code. Or else it will interfere with application output, which, for example, may be a chat bot interface.

I think the library should not write anything to stdout/stderr.

agreed

Either a proper logging framework needs to be added (and it will not), or the library should be silent and leave actual logging to application code

I like using callbacks for this, I use a C++ lambda callback in my newest version of some code (only for internal library/implementation use), but there is a way to do it in pure C. For example the Lua scripting language does this

I sotred the gpu_layer and vram_total in context, for cpp developers they could easily get the results

rwkv.cpp

saharNooby · 2023-05-28T06:41:09Z

Summary of the concerns so far:

adding new argument to rwkv_load_from_file: compatibility needs to be tested, and ABI preserved
the library should not write to stdout
LN bias shoud not be on GPU

I have no other concerns so far.

@yorkzero831 Are you okay with me editing this PR to do minor formatting changes?

yorkzero831 · 2023-05-28T09:19:15Z

Summary of the concerns so far:

adding new argument to rwkv_load_from_file: compatibility needs to be tested, and ABI preserved

the library should not write to stdout

LN bias shoud not be on GPU

I have no other concerns so far.

@yorkzero831 Are you okay with me editing this PR to do minor formatting changes?

sure, please

yorkzero831 · 2023-05-28T09:28:58Z

Summary of the concerns so far:

adding new argument to rwkv_load_from_file: compatibility needs to be tested, and ABI preserved

the library should not write to stdout

LN bias shoud not be on GPU

I have no other concerns so far.

@yorkzero831 Are you okay with me editing this PR to do minor formatting changes?

I am away from my pc now, will take hours before able to fix the comment, feel free to update it.

yorkzero831 · 2023-05-28T11:33:44Z

@saharNooby I got home now，so I will try to fix the comment, hope you are not doing samething

yorkzero831 added 6 commits April 10, 2023 23:31

chore: add ggml import in the head of rwkv.h

7db0e64

chore: add ggml import in the head of rwkv.h

10847f0

Merge branch 'master' of https://github.com/saharNooby/rwkv.cpp

4eb4666

feat: add cublas support

891b89b

feat: update rwkv.cpp

a085dfc

feat: remove unused change

ecf88fb

saharNooby reviewed May 21, 2023

View reviewed changes

saharNooby linked an issue May 21, 2023 that may be closed by this pull request

Add CuBLAS support #62

Closed

yorkzero831 changed the title ~~Feature/add cublas support~~ [WIP] Feature/add cublas support May 21, 2023

chore: fix linux build issue

99cdd71

yorkzero831 added 2 commits May 21, 2023 23:51

chore: sync ggml and offload tensor to gpu

bfb9664

chore: comment out tensors which occurs error on GPU

a06f0b8

york-yu-ctw mentioned this pull request May 22, 2023

Slow Inference #67

Closed

yorkzero831 added 2 commits May 22, 2023 23:23

chore: update comment and readme

3763c99

chore: update ggml to recent

ee8fec3

yorkzero831 changed the title ~~[WIP] Feature/add cublas support~~ Feature add cublas support May 22, 2023

LoganDark suggested changes May 24, 2023

View reviewed changes

chore: merge and fix problem of reading file more than 2 gb

a0501d1

chore: fix problem of reading file more than 2 gb

d9fa21a

yorkzero831 added 3 commits May 26, 2023 22:10

Merge branch 'master' of https://github.com/saharNooby/rwkv.cpp into …

bf1606a

…feature/add_cublas_support

Merge branch 'master' of https://github.com/saharNooby/rwkv.cpp into …

fc3073b

…feature/add_cublas_support

chore: merge master

eea754a

chore: remove unused comment

36bfe07

saharNooby reviewed May 28, 2023

View reviewed changes

rwkv.h Show resolved Hide resolved

saharNooby reviewed May 28, 2023

View reviewed changes

rwkv.cpp Show resolved Hide resolved

yorkzero831 and others added 3 commits May 28, 2023 21:38

chore: fix for comments

3f116aa

Update README.md

f27569a

Update rwkv.cpp

0a2d51b

saharNooby merged commit 241350f into RWKV:master May 29, 2023

	option(RWKV_CUBLAS "rwkv: use CuBLAS" OFF)
	option(RWKV_CUBLAS "rwkv: use cuBLAS" OFF)

	target_compile_features(ggml PUBLIC c_std_11) # don't bump
	target_compile_features(ggml PUBLIC c_std_11) # Don't bump

		@@ -63,6 +63,17 @@ cmake --build . --config Release

		If everything went OK, `bin\Release\rwkv.dll` file should appear.

		##### WIndows + CuBlas

	fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);
	if (ctx->print_errors)
	fprintf(stderr, "%s: [cuBLAS] offloading %d layers to GPU\n", __func__, n_gpu);

	fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);
	if (ctx->print_errors)
	fprintf(stderr, "%s: [cuBLAS] total VRAM used: %zu MB\n", __func__, vram_total / 1024 / 1024);

Feature add cublas support #65

Feature add cublas support #65

Conversation

yorkzero831 commented May 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saharNooby commented May 21, 2023

yorkzero831 commented May 21, 2023

yorkzero831 commented May 21, 2023

ggerganov commented May 21, 2023

yorkzero831 commented May 21, 2023

yorkzero831 commented May 21, 2023 • edited Loading

yorkzero831 commented May 21, 2023

yorkzero831 commented May 21, 2023 • edited Loading

yorkzero831 commented May 21, 2023

ggerganov commented May 21, 2023 • edited Loading

LoganDark commented May 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LoganDark commented May 24, 2023

LoganDark commented May 24, 2023 • edited Loading

yorkzero831 commented May 24, 2023 • edited Loading

LoganDark commented May 24, 2023

saharNooby commented May 26, 2023

yorkzero831 commented May 26, 2023

yorkzero831 commented May 27, 2023

LoganDark commented May 27, 2023

yorkzero831 commented May 27, 2023 • edited Loading

yorkzero831 commented May 27, 2023

Choose a reason for hiding this comment

LoganDark May 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LoganDark May 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saharNooby commented May 28, 2023

yorkzero831 commented May 28, 2023

yorkzero831 commented May 28, 2023

yorkzero831 commented May 28, 2023

yorkzero831 commented May 21, 2023 •

edited

Loading

yorkzero831 commented May 21, 2023 •

edited

Loading

yorkzero831 commented May 21, 2023 •

edited

Loading

ggerganov commented May 21, 2023 •

edited

Loading

LoganDark commented May 24, 2023 •

edited

Loading

LoganDark commented May 24, 2023 •

edited

Loading

yorkzero831 commented May 24, 2023 •

edited

Loading

yorkzero831 commented May 27, 2023 •

edited

Loading

LoganDark May 28, 2023 •

edited

Loading

LoganDark May 28, 2023 •

edited

Loading