Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge exllama backend into united. #447

Merged
merged 20 commits into from
Sep 10, 2023
Merged

Merge exllama backend into united. #447

merged 20 commits into from
Sep 10, 2023

Conversation

pi6am
Copy link

@pi6am pi6am commented Aug 30, 2023

Add a new inference model backend based on exllama.
Most of the work on this backend was done by Occam. My main contribution was discovering and working around a bug in torch.multinomial, hooking up stoppers, configuring bad_words_ids, and some other minor bug fixes.

pi6am and others added 20 commits May 3, 2023 22:04
The end-of-sequence (</s>) token indicates the end of a generation.
When a token sequence containing </s> is decoded, an extra (wrong)
space is inserted at the beginning of the generation. To avoid this,
strip the eos token out of the result before returning it.
The eos token was getting stripped later, so this doesn't change
the output except to avoid the spurious leading space.
Strip the eos token from exllama generations.
Add stopper hooks suppport to exllama
There is a bug in PyTorch 2.0.1 that allows torch.multinomial to
sometimes choose elements that have zero probability. Since
this is uncommon we can continue to use torch.multinomial as
long as we verify that the results are valid. If they aren't,
try again until the probability of each selected token is positive.
Resample to work around a bug in torch.multinomial
The bos token was already hardcoded as a bad word id.
Store badwords in a list and iterate over them during generation.
Add the Llama eos token to the list of bad words.
Also support "single line mode", which adds newline (13) to badwords.
Add the eos token to exllama bad words.
Read config.json and enable exllama loading if the model has a
`quantization_config` with `quant_methdod` of `gptq`. Note that this
implementation is limited and only supports model.safetensors.
That said, this supports loading popular gptq quantized models
without renaming or symlinking the model file.
Modify exllama to load unrenamed gptq quantized models
Merge henk717/united into exllama
Merge branch henk717/united into exllama
Use the value of the use_default_badwordids setting to configure
bad_words_ids. Also add square brackets to bad_words_ids if the
use_default_badwordids setting is True. Fix an issue with
attempting to use the tokenizer too early, and fix an exception
populating Lua bridge data when zero tokens are generated, which
can now happen if use_default_badwordids is False and the first
token generated is EOS.
Hook up use_default_badwordids in exllama
@henk717 henk717 merged commit 036db07 into henk717:united Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants