MPT : clone wte to output at load time #3626

cebtenzzre · 2023-10-14T18:44:30Z

This change removes the need to duplicate the token_embd.weight tensor on disk. Instead, it is cloned as output.weight at load time.

The tensor is not cloned while quantizing because quantize does not call llm_load_tensors.

slaren · 2023-10-14T21:36:39Z

I don't think that it is a good idea to add what is essentially a hack to gguf to deal with a llama.cpp loader shortcoming. If the tensor really needs to be duplicated (and to be fair, llama.cpp may need to in some cases for partial offloading with CUDA), then that should be handled in the application. The special case for MPT in llm_load_tensors is not great either, it will make harder to refactor this code in the future.

There may also be an unintended side-effect of this, which is that it will prevent using different quantizations for the token embeddings and the output layer. From what I understand, the quantization of the output tensor has a large impact on generation quality, while the token embeddings has very little impact. For that reason the output tensor is usually quantized at a higher bit rate.

apage43 · 2023-10-14T23:11:49Z

different quantizations for the token embeddings and the output layer

for models where they're the same data it seems fine to just use the highest you'd choose for either - that is, look up the token embeddings in higher precision for models that do this

Mostly I find the file-level hacks more regrettable than code level ones, like before MQA/GQA were well supported some implementations of models that used them would duplicate the the data that would be broadcasted in the file which not only bloated the file unnecessarily but meant the file format was going to change again later if those limitations were dealt with.

really needs to be duplicated (and to be fair, llama.cpp may need to in some cases for partial offloading with CUDA),

maybe this is where this should actually be solved? - this is not the only conceivable case where the same tensor may be used more than once in a single model/graph

the model code looking like a model that doesn't do tying is potentially confusing for anyone comparing it against the original code that doesn't know about this workaround, and as I understand it writing it like the original does work when not offloading

cebtenzzre · 2023-10-22T21:11:09Z

@slaren Is there any better way you think we can accomplish this, or should I close this PR and convert the MPT-based GPT4All models to save the output tensor twice in the GGUF file?

slaren · 2023-10-22T22:24:50Z

The quantization issue probably could be solved by exporting the tensor as the output layer rather than as the token embeddings, then the higher quality quantization would be automatically chosen. Ideally, we would also want to keep only one copy of the tensor in memory, but there is no escaping the fact that with partial offloading to GPU, sometimes we would also need a copy of the tensor in VRAM. So we would need to check if the backend of the token embeddings and the output layer is different, and then copy the tensor as required.

However, supporting this cleanly in the llama model loader code in a way that isn't just a hack for this specific case would require significant changes. I think this may become somewhat easier to do in the future with ggml-backend, since there will be a common interface for copying tensors between different backends, but as it is now, I don't think it is worth adding all this code to llama.cpp just to save a few MB in the model file. That's just my opinion, though.

It seems like upstream isn't interested in this change for the time being [1], and we are going to break compatiblity with Nomic's previous conversion of MPT because of changes to the BPE tokenizer [2], so let's remove this change to minimize the diff. This reverts commit 69c505e. [1] ggml-org#3626 [2] ggml-org#3252

Previous attempt was #3626

Previous attempt was ggml-org#3626 (cherry picked from commit 549fe80) Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre added 2 commits October 14, 2023 14:40

MPT : fix an offload func typo

b577e63

MPT : clone wte to output at load time

4207444

cebtenzzre closed this Oct 29, 2023

cebtenzzre mentioned this pull request Jan 17, 2024

falcon arch fix for tied output embeddings #4978

Merged

cebtenzzre added a commit that referenced this pull request Feb 22, 2024

mpt : do not duplicate token_embd.weight on disk

549fe80

Previous attempt was #3626

cebtenzzre mentioned this pull request Feb 22, 2024

mpt : do not duplicate token_embd.weight on disk #5670

Merged

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Feb 22, 2024

mpt : do not duplicate token_embd.weight on disk

11ed1fb

Previous attempt was ggml-org#3626 (cherry picked from commit 549fe80) Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre mentioned this pull request Feb 22, 2024

models: new MPT model file without duplicated token_embd.weight nomic-ai/gpt4all#2006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPT : clone wte to output at load time #3626

MPT : clone wte to output at load time #3626

cebtenzzre commented Oct 14, 2023 •

edited

Loading

slaren commented Oct 14, 2023

apage43 commented Oct 14, 2023

cebtenzzre commented Oct 22, 2023

slaren commented Oct 22, 2023

MPT : clone wte to output at load time #3626

MPT : clone wte to output at load time #3626

Conversation

cebtenzzre commented Oct 14, 2023 • edited Loading

slaren commented Oct 14, 2023

apage43 commented Oct 14, 2023

cebtenzzre commented Oct 22, 2023

slaren commented Oct 22, 2023

cebtenzzre commented Oct 14, 2023 •

edited

Loading