-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add falcon7b example #231
add falcon7b example #231
Conversation
Nice! |
Well done guys! Really excited for this |
Cool - will take a look soon |
does copies with |
Are you working on a 40B branch already ? |
I'm not presently - being that it's big its a bit more inconvenient to hack on as I'd need to use a bigger machine than I'm usually on for dev stuff. |
Did you see https://huggingface.co/jploski/falcon40b-mini-shakespeare ? |
I have now :) I still probably won't get to it soon - but if someone figures out how to support that before this lands I'm happy to incorporate it |
I did some work regarding 40B support today: 27cf1ad After making my head nearly explode several times I reached a point where it generates okay sounding prose from the falcon40b-mini-shakespeare model, but it does not match the Python version output exactly as it should (and as it does for the 7B version). The main obstacle seems to be that I am unable to make ggml_repeat broadcast multiple keys like the "k = torch.broadcast_to(k, q.shape)" in Python does (I get "1,2,1,2" instead of "1,1,2,2" so to say). Another big problem is that the I only got the query matrix to look like the original Python one through some brute force offset calculations and copying of subvectors. It probably won't scale at all. I'm under impression that what needs to be done there can't be done using just reshaping or view operations. The memory format (as stored in Python and written by the conversion script) seems to be very difficult to work with in GGML. Or maybe I'm just too inexperienced in this tensor wrestling... Once again giving up in hope that someone with more clue can pick it up. |
As a further explanation of the code and where the complexity comes from here's a visualization of the fused_kqv weights format (from falcon40b-mini-shakespeare config): https://docs.google.com/spreadsheets/d/1FoM6pIUj23GMW4zO_G1hjmEnUacBxBKN/edit?usp=sharing&ouid=111096390735143611797&rtpof=true&sd=true |
Maybe just make your own repeat operation that works the way you need? Seems like the repeat op is only implemented for float32 so there's just one version of the function required. You could create a new op and just cut-and-paste the existing Line 8773 in f52d2a0
The function looks relatively simple also. |
I added a new ggml_repeat2 function as suggested (3352043) - although the original ggml_repeat also has a backwards pass and I'm not sure if it's the same for what I added. With some more tweaks (commited in 3bc786b) I now have a version which works with all falcon-mini-shakespeare models I have unleashed upon this world (both 7B and 40B configs). At least in 32bit, haven't tested quantized yet. The (known) remaining problem is the for-loop-based splitting of query heads. I suspect it's gonna blow up with a real big model, either being slow or exceeding the max number of tensors (4096) allowed by GGML (or both). (Also it's possible that the implementation does some unnecessary operations like permutes or 4d instead of 3d, but that's minor.)
I tend to agree, tha'ts almost what happened to me. |
Ok I've been too afraid to ask, but how on earth are you doing these commits that aren't on any branch at all? I wanted to clone the repo and check out the commit but I have no idea how to. |
Sorry for the confusion - these commits belong to branch falcon40b of my fork: https://github.com/jploski/ggml/tree/falcon40b - apparently GitHub not clever enough to indicate their source. |
I was able to convert the real 40B model with my change here to reduce memory during HF conversion (only loads a single part into RAM at a time): jploski#1 It required some work to get inference to actually run. I had to increase
Also, uhh...
Although it didn't work, even with the crazy number of nodes it wasn't really that slow. It was about the same as a 65B Q4_K_M LLaMA model with llama.cpp. The mini-Shakespeare model seems fine:
Both models were quantized to Q5_0. |
Thanks for checking! I was able to reproduce wrong output using an unquantized mini version trained with n_embd = 1024, n_head = 128, n_head = 8. So there must still be a bug somewhere, which the previous three configs I used for testing did not catch. |
If the problem is the complicated logic for dealing for the query heads, maybe the easiest way to deal with that is in the conversion tool from the Torch or numpy side. It should be relatively easy to shuffle things around at that point. Reducing the complexity would make issues easier to debug too, I guess. |
Yes, I agree that reshuffling the weights during conversion will perhaps be the final and most elegant/efficient solution. I just haven't wrapped my head around it yet how changing the layout of the query_key_value tensor maps into fused_qkv from which the qkv vectors are extracted (fused_qkv = self.query_key_value(hidden_states)). I'd also like to understand the current bug and have a working (if poorly implemented) version to improve on (even if the "improvement" will mean throwing away the overcomplicated code). |
Understood and fixed in my falcon40b branch. Please recompile and try again. |
It's alliiiiive!
Not a fan of the name it chose though. For reference, these are the changes I need to actually run it: diff --git a/examples/falcon/main.cpp b/examples/falcon/main.cpp
index beac293..c77c610 100644
--- a/examples/falcon/main.cpp
+++ b/examples/falcon/main.cpp
@@ -198,6 +198,7 @@ bool falcon_model_load(const std::string & fname, falcon_model & model, gpt_voca
ggml_type_sizef(GGML_TYPE_F32); // memory_v
ctx_size += (5 + 10 * n_layer) * 256; // object overhead TODO:
+ ctx_size += ((size_t)3) * 1024 * 1024 * 1024;
printf("%s: ggml ctx size = %6.2f MB\n", __func__, ctx_size/(1024.0*1024.0));
}
diff --git a/include/ggml/ggml.h b/include/ggml/ggml.h
index e770603..83b0d84 100644
--- a/include/ggml/ggml.h
+++ b/include/ggml/ggml.h
@@ -194,7 +194,7 @@
#define GGML_QNT_VERSION_FACTOR 1000 // do not change this
#define GGML_MAX_DIMS 4
-#define GGML_MAX_NODES 4096
+#define GGML_MAX_NODES 262144
#define GGML_MAX_PARAMS 256
#define GGML_MAX_CONTEXTS 64
#define GGML_MAX_OPT 4 |
Amazing work guys! So is https://github.com/jploski/ggml/tree/falcon40b the branch I should use to try converting and running GGMLs? |
262144 nodes, wtf :-) Awesome to see it works so well! |
I would suggest not converting them just yet - because if/when the qkv reshuffling during conversion is implemented, the binary format of the tensors would change again... which would make all the already published files incompatible. |
OK fair enough! |
As it turns out, GGML's API does have a Now testing a fix (jploski#2)
Happy to dive into this another day if there's other crashes! |
I just copied that from the gpt-neox example - the common gpt-style tokenizer in an example of why that causes problems is that if there is an fixing this unfortunately requires the file to also contain the tokenizer's "merges" list which is not currently captured in to know what order to combine tokens in to match the original behavior, see #220 (comment) (well, some I've seen fail to decode non-English text but that's because the converter script didn't store the tokens as raw bytes but did a utf-8 decoding/encoding roundtrip on them first - token vocabulary items are bytes, not necessarily valid utf8 strings, and a unicode character may be split across multiple tokens that do not individually contain complete valid utf8 sequences - this is just a problem with those convert scripts though, not the decoding logic, which is pretty simple) |
No crash with any model / quantization after jploski#2,
|
Seems i have successfully compiled falcon.exe for windows, try to run falcon 7b version
So there seems to be no 7b model version available as @TheBloke so far only has the 40b versions (thanks @TheBloke for the great work though) So, while downloading (very slow from current PC) the original files from
but there seems to be two files, |
This error indicates that the GGML file you downloaded is outdated and not compatible with the current implementation. As for DIY quantize, you need to perform two steps:
|
I think something might be a tiny bit off with the new context size calculations:
I tried with other models, including the one I generated myself (which worked the other day). It would probably work on a machine with 6.8 exobytes of RAM but sadly I don't have quite that much. (ignore, it was due to
edit: |
It seems we both independently fixed it the same way (although I don't quite understand what was incorrect about the parameter passing). |
Sorry, that wasn't a good explanation. I think the issue was auto choosing the wrong type, not necessarily the parameter passing itself. I'm pretty sure that auto was choosing a 32bit type, which got passed as a pointer to something expecting a 64bit value. So it would just start reading from the pointer address and of course half of the 64bit value would be basically random so you'd get a crazy result. |
Ah yes, good old C where you can read half a variable and be happy. ;) |
CPU inference is working well here with the latest commits to |
Will be looking into merging this during the weekend. Wondering if #224 would be enough to avoid the extra
Most likely this is due to extra transpose and copies in the attention. Tensor overhead should be computed with We should probably also think about an elegant way to plug this inference in |
@jploski Convert the multiple pytorch*.bin checkpoint files to one GGML file
Okay, created conda environment, installed pytorch, installed transformers, run your line getting:
what's missing? |
@maddes8cht That's an issue with Transformers not the conversion script specifically. You could possibly just comment out that line in |
That would be amazing. I've wished for a long time that llama.cpp would be more like llm.cpp, and support other model types. It feels to me like the gap between the capabilities of llama.cpp and non-Llama GGML is getting wider, just as more people are wishing to use non-Llama models, both because of licensing and because they want the special capabilities of other models, like StarCoder for coding. Anything that could be done to support more model types would be really appreciated by the community I think. Llama.cpp has got amazingly powerful and I'd love if those features could become available for other models too. |
I just gave it a quick try, but no luck. Merged in #224 locally, commented out the application of ggml_repeat2 for K... No more assertion fail, but unfortunately the result from this multiplication differs from expected, so the implicit broadcast must have produced something different than ggml_repeat2. But I do not understand #224 at this point, so maybe it can be achieved with some extra trickery. |
@ggerganov Not sure if that is enough, though: the issue seems to be that ggml_tensor_overhead() + nelements * type_size is different than size_needed as calculated in ggml_new_tensor_impl. |
Added falcon main and library based on llama.cpp CPU inference works (getting ~260ms/token on 7B 16 bit falcon) Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly) TODO/WIP: 1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary - even quantization from 16 -> 16 also fails, something is wrong in the tensors produced 2) mmap should work with quantized binaries once 1) is solved 3) CUDA support is mostly there, it's currently disabled (all CPU backend) 4) memory/context caluculations are off, GPU memory calculations are wrong either 5) the python conversion script is pre GGML 1 version (tokens without scores) 6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both 7) the GGML produced by the current python uses an old ftype method Makfiles: cmake on windows with build tools works the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something Changes to the codebase: * repeat2 has been added to ggml (jploski - ggerganov/ggml#231) including the backward variant (untested, probably fails) * minor changes to work with falcon (name length) * libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp
Added falcon main and library based on llama.cpp CPU inference works (getting ~260ms/token on 7B 16 bit falcon) Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly) TODO/WIP: 1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary - even quantization from 16 -> 16 also fails, something is wrong in the tensors produced 2) mmap should work with quantized binaries once 1) is solved 3) CUDA support is mostly there, it's currently disabled (all CPU backend) 4) memory/context caluculations are off, GPU memory calculations are wrong either 5) the python conversion script is pre GGML 1 version (tokens without scores) 6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both 7) the GGML produced by the current python uses an old ftype method Makfiles: cmake on windows with build tools works the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something Changes to the codebase: * repeat2 has been added to ggml (jploski - ggerganov/ggml#231) including the backward variant (untested, probably fails) * minor changes to work with falcon (name length) * libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp
Oh, if you meant apage43:falcon, then it is out-of-date, as it only supports 7B (for which ggml_repeat2 was indeed not needed because there is only one KV to repeat, so the ordering does not matter). apache43:falcon is where I forked my jploski/ggml (falcon40b) to implement 40B. And later the development moved over to cmp-nct/ggllm.cpp |
Ok, got it. I'll postpone merging this PR then. Want to focus on some Ideally, I think we want to avoid the |
Yes, I agree that we should remove repeat2. If it is done on the cmp-nct/ggllm.cpp branch, I will update my https://github.com/jploski/ggml/tree/falcon40b accordingly. I think it would be helpful if you could check whether the mat_mul broadcast could somehow do the trick, as you are most familiar with the broadcast implementation (I suppose). To understand how it needs to work, see: https://docs.google.com/spreadsheets/d/1FoM6pIUj23GMW4zO_G1hjmEnUacBxBKN/edit#gid=2097305276 What we need to come out of repeat/broadcast (and what repeat2 produces) is: N[0].K[0], N[0].K[0], N[0].K[1], N[0].K[1] |
If by MQA you mean multi-query attention (sorry, you mentioned it earlier, but I did not manage to decipher it), in the original multi-query paper (which Falcon-7B adheres to), there is only one key vector and one value vector (n_head_kv=1), and this k/v vector is reused/shared by all queries (in the sense that each query vector is multiplied by the same key). Contrast this with the traditional GPT approach where there is the same number of queries as keys/values. The motivation from the paper was to save (KV) memory while retaining approximately the same quality of inference. In the generalized n_head_kv>1 implementation, which Falcon-40B implements, and for which I found no paper, there are multiple "kv groups", each consisting of one KV pair and n queries that reuse/share that group's KV pair. This is somewhat of a compromise between having just one KV pair and one KV pair for each query. So the ordering issue in ggml_repeat(2) is to make sure that the right queries are matched to the right keys for the multiplication (and the resulting weights to right values). |
llama.cpp will have the complete support for falcon #2717 |
#217
adapted from gpt-neox example and work started in ggerganov/llama.cpp#1602
only supports 7b right now - 40b multiquery attention gets hairier, as its 128 query heads with 8 k and v heads, as opposed to 7B's 71 query heads with 1 k and v head