-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPT : clone wte to output at load time #3626
Conversation
I don't think that it is a good idea to add what is essentially a hack to gguf to deal with a llama.cpp loader shortcoming. If the tensor really needs to be duplicated (and to be fair, llama.cpp may need to in some cases for partial offloading with CUDA), then that should be handled in the application. The special case for MPT in There may also be an unintended side-effect of this, which is that it will prevent using different quantizations for the token embeddings and the output layer. From what I understand, the quantization of the output tensor has a large impact on generation quality, while the token embeddings has very little impact. For that reason the output tensor is usually quantized at a higher bit rate. |
for models where they're the same data it seems fine to just use the highest you'd choose for either - that is, look up the token embeddings in higher precision for models that do this Mostly I find the file-level hacks more regrettable than code level ones, like before MQA/GQA were well supported some implementations of models that used them would duplicate the the data that would be broadcasted in the file which not only bloated the file unnecessarily but meant the file format was going to change again later if those limitations were dealt with.
maybe this is where this should actually be solved? - this is not the only conceivable case where the same tensor may be used more than once in a single model/graph the model code looking like a model that doesn't do tying is potentially confusing for anyone comparing it against the original code that doesn't know about this workaround, and as I understand it writing it like the original does work when not offloading |
@slaren Is there any better way you think we can accomplish this, or should I close this PR and convert the MPT-based GPT4All models to save the output tensor twice in the GGUF file? |
The quantization issue probably could be solved by exporting the tensor as the output layer rather than as the token embeddings, then the higher quality quantization would be automatically chosen. Ideally, we would also want to keep only one copy of the tensor in memory, but there is no escaping the fact that with partial offloading to GPU, sometimes we would also need a copy of the tensor in VRAM. So we would need to check if the backend of the token embeddings and the output layer is different, and then copy the tensor as required. However, supporting this cleanly in the llama model loader code in a way that isn't just a hack for this specific case would require significant changes. I think this may become somewhat easier to do in the future with |
It seems like upstream isn't interested in this change for the time being [1], and we are going to break compatiblity with Nomic's previous conversion of MPT because of changes to the BPE tokenizer [2], so let's remove this change to minimize the diff. This reverts commit 69c505e. [1] ggml-org#3626 [2] ggml-org#3252
It seems like upstream isn't interested in this change for the time being [1], and we are going to break compatiblity with Nomic's previous conversion of MPT because of changes to the BPE tokenizer [2], so let's remove this change to minimize the diff. This reverts commit 69c505e. [1] ggml-org#3626 [2] ggml-org#3252
Previous attempt was ggml-org#3626 (cherry picked from commit 549fe80) Signed-off-by: Jared Van Bortel <jared@nomic.ai>
This change removes the need to duplicate the token_embd.weight tensor on disk. Instead, it is cloned as output.weight at load time.
The tensor is not cloned while quantizing because quantize does not call llm_load_tensors.