-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequence mode prototype #89
Conversation
This is a prototype of sequence mode. Load model ... 1.318s Serial mode to process 30 tokens ... 2.116s Sequence mode to process 30 tokens ... 0.509s Logits total diff = 0.00000 Logits identical = TRUE This is only for testing. It runs into precision and capacity limits at large lengths. The goal is to support sequences of up to 25k tokens. It is also likely that the dedicated single token functions should be brought back. Again, only prototype.
This PR is currently dependent on ggerganov/ggml#229 due to relying on a single computation graph. This dependency will most likely go away |
No implementation advice is needed yet, there are still some low hanging fruit left for me to take care of. This is just to get the news out that this is being worked on. |
Just to be sure -- at line 910 |
Yes, width n_embed and height sequence_len |
Huge matmul should be fast! (some say, I myself don't know) |
OP includes a benchmark that shows 4x speedup on processing 30 tokens :) |
Also, tests are passing ? Pleasant surprise (I guess I no longer use the faulty ggml function) |
we still build one, but that seems necessary for ggml.
This comment was marked as outdated.
This comment was marked as outdated.
This is a huge speedup when the same sequence length is used many times in a row. I intend to clean up this code very soon
It doesn't actually matter
macos fail is probably a false positive, it works fine on my mac, but need someone on apple silicon to test |
I think this pull request is ready for a review (not merge yet) There are still probably some touch-ups I need to do, but the code is almost clean enough to be production ready, I think It does not support anywhere near 25k tokens (in fact going too far above 64 will probably crash upstream builds of ggml) but it can be improved in the future |
Great PR! Aside from some nits, concerns are:
|
I'll test sanitizer builds on a new temporary branch logan-dark-sequence-mode, since I have no permission to push wworkflow changes directly into this PR. |
Lol, just adding |
I guess may as well leave it on since it seems to work fine on my mac. Real macs probably have AVX2 and FMA |
That's an option, but I have another idea I will try tomorrow. I will literally insert a macro at each line printing line number and see on what line it stops executing. I did that before, it is a crude way, but it works. Then, when line is known, I can try to rewrite the part. |
@LoganDark I've tried to debug it, but I have no ideas and/or willingness to go further. Looks like a compiler bug, IDK... Please change the MacOS build command in |
Oops, closed accidentally. @LoganDark You've said the PR is ready for review, but not merge; is it still not ready to be merged? |
Sanitizer is enabled to fix issues discovered when testing RWKV#89. It needs to be disabled as soon as it is possible (that is, master is able to be built on MacOS GitHub runner again)
I'll perform another pass over it personally and make sure. My main concern is that long sequence lengths build computation graphs that are so large that they cannot fit inside the ggml_cgraph struct; you'll hit ggml asserts at runtime that crash the program due to exceeding the node limit. I don't know if there's any way to fix this as long as ggml doesn't support iteration in cgraphs. |
Yeah, 8523841 is what I missed, lmao!! |
I think I'd be okay merging this since further improvements can be made in the future. I am experimenting with making performance graphs that compare sequence mode to serial mode. Here is one in log scale That drop around 32 tokens is so not an outlier: I don't know what in the world cuBLAS is doing to achieve this, but I'm interested and also scared. lol |
I have the last comment (if nothing more changes in this PR) about BTW,
MacOS build was failing because of random ggml asserts. Maybe something overflowed here too. |
Those are unrelated |
For the record, it doesn't make me very happy to discard the files I used while developing this branch. Those deprive future developers of the tests that they would need to do the same work that I've done. But, there isn't a framework for those kinds of tests (except for maybe the |
This is a prototype of sequence mode.
Load model ... 1.318s
Serial mode to process 30 tokens ... 2.116s
Sequence mode to process 30 tokens ... 0.509s
Logits total diff = 0.00000
Logits identical = TRUE
This is only for testing. It runs into precision and capacity limits at large lengths. The goal is to support sequences of up to 25k tokens.
It is also likely that the dedicated single token functions should be brought back. Again, only prototype.