Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPT : clone wte to output at load time #3626

Closed
wants to merge 2 commits into from

Conversation

cebtenzzre
Copy link
Collaborator

@cebtenzzre cebtenzzre commented Oct 14, 2023

This change removes the need to duplicate the token_embd.weight tensor on disk. Instead, it is cloned as output.weight at load time.

The tensor is not cloned while quantizing because quantize does not call llm_load_tensors.

@slaren
Copy link
Member

slaren commented Oct 14, 2023

I don't think that it is a good idea to add what is essentially a hack to gguf to deal with a llama.cpp loader shortcoming. If the tensor really needs to be duplicated (and to be fair, llama.cpp may need to in some cases for partial offloading with CUDA), then that should be handled in the application. The special case for MPT in llm_load_tensors is not great either, it will make harder to refactor this code in the future.

There may also be an unintended side-effect of this, which is that it will prevent using different quantizations for the token embeddings and the output layer. From what I understand, the quantization of the output tensor has a large impact on generation quality, while the token embeddings has very little impact. For that reason the output tensor is usually quantized at a higher bit rate.

@apage43
Copy link
Contributor

apage43 commented Oct 14, 2023

different quantizations for the token embeddings and the output layer

for models where they're the same data it seems fine to just use the highest you'd choose for either - that is, look up the token embeddings in higher precision for models that do this

Mostly I find the file-level hacks more regrettable than code level ones, like before MQA/GQA were well supported some implementations of models that used them would duplicate the the data that would be broadcasted in the file which not only bloated the file unnecessarily but meant the file format was going to change again later if those limitations were dealt with.

really needs to be duplicated (and to be fair, llama.cpp may need to in some cases for partial offloading with CUDA),

maybe this is where this should actually be solved? - this is not the only conceivable case where the same tensor may be used more than once in a single model/graph

the model code looking like a model that doesn't do tying is potentially confusing for anyone comparing it against the original code that doesn't know about this workaround, and as I understand it writing it like the original does work when not offloading

@cebtenzzre
Copy link
Collaborator Author

@slaren Is there any better way you think we can accomplish this, or should I close this PR and convert the MPT-based GPT4All models to save the output tensor twice in the GGUF file?

@slaren
Copy link
Member

slaren commented Oct 22, 2023

The quantization issue probably could be solved by exporting the tensor as the output layer rather than as the token embeddings, then the higher quality quantization would be automatically chosen. Ideally, we would also want to keep only one copy of the tensor in memory, but there is no escaping the fact that with partial offloading to GPU, sometimes we would also need a copy of the tensor in VRAM. So we would need to check if the backend of the token embeddings and the output layer is different, and then copy the tensor as required.

However, supporting this cleanly in the llama model loader code in a way that isn't just a hack for this specific case would require significant changes. I think this may become somewhat easier to do in the future with ggml-backend, since there will be a common interface for copying tensors between different backends, but as it is now, I don't think it is worth adding all this code to llama.cpp just to save a few MB in the model file. That's just my opinion, though.

@cebtenzzre cebtenzzre closed this Oct 29, 2023
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Dec 1, 2023
It seems like upstream isn't interested in this change for the time
being [1], and we are going to break compatiblity with Nomic's previous
conversion of MPT because of changes to the BPE tokenizer [2], so let's
remove this change to minimize the diff.

This reverts commit 69c505e.

[1] ggml-org#3626
[2] ggml-org#3252
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Dec 1, 2023
It seems like upstream isn't interested in this change for the time
being [1], and we are going to break compatiblity with Nomic's previous
conversion of MPT because of changes to the BPE tokenizer [2], so let's
remove this change to minimize the diff.

This reverts commit 69c505e.

[1] ggml-org#3626
[2] ggml-org#3252
cebtenzzre added a commit that referenced this pull request Feb 22, 2024
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Feb 22, 2024
Previous attempt was ggml-org#3626

(cherry picked from commit 549fe80)

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants