-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various script cleanups/fixes + convert merges and special token handling #2842
Various script cleanups/fixes + convert merges and special token handling #2842
Conversation
@ggerganov Anything you hate about these changes so far? Also: llama.cpp/gguf-py/gguf/gguf.py Lines 165 to 169 in 9cc1129
Think I figured it out and this approach is an improvement. Reduces duplicated code a fair bit. |
@@ -111,7 +111,7 @@ def count_model_parts(dir_model: str) -> int: | |||
|
|||
print("gguf: get tokenizer metadata") | |||
|
|||
tokens: List[str] = [] | |||
tokens: List[bytearray] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes shut up the type warnings but I'm not 100% sure they're correct. The alternative would be to leave it List[str]
and then convert the token text to a string. I assume it's already UTF-8 bytes so there probably isn't a functional difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types are correct this way, because when we get to tokens.append(text)
, 'text' is explicitly a bytearray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types are correct this way, because when we get to tokens.append(text), 'text' is explicitly a bytearray.
Right. I meant my change fixes the type warning but the part I wasn't sure about was whether text
actually is supposed to be a bytearray
there and what is supposed to get submitted to gguf
to write out the tokens. The type annotation for add_token_list
method is also just List
and doesn't specify what the element is supposed to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision was made to accept several types as input to the token list depending on type of he tokenizer output (spm vs bpe)
llama.cpp/gguf-py/gguf/gguf.py
Lines 373 to 375 in dd0dc36
def get_type(val): | |
if isinstance(val, str) or isinstance(val, bytes) or isinstance(val, bytearray): | |
return GGUFValueType.STRING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision was made to accept several types as input to the token list
So we'd want
def add_token_list(self, tokens: Union[List[str], List[bytes], List[bytearray]]):
Correct? Or would you want to allow the items to be non-homogenous like List[Union[str, bytes, bytearray]]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the differences of the python types str
, bytes
and bytearray
? If they all resolve to the same when written to disk then any of them could be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it as simple it could be as long there is no difference in how the tokens are written to disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str is unicode text which will be encoded as utf-8 before being written to disk. bytes and bytearray are binary data. Those two are subclasses of ByteString, but we can't use that because it also allows memoryview. The least repetitive way to write this would be to use a TypeVar:
StrOrBytes = TypeVar('StrOrBytes', str, bytes, bytearray)
# ...
def add_token_list(self, tokens: Sequence[StrOrBytes]):
# ...
def add_token_merges(self, merges: Sequence[StrOrBytes]):
# ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should merges actually support all of those?
convert-falcon-hf-to-gguf.py
Outdated
scores: List[float] = [] | ||
toktypes: List[int] = [] | ||
merges: List[str] = [] | ||
|
||
|
||
if Path(dir_model + "/tokenizer.json").is_file(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no tokenizer.json
we just generate a model with no vocabulary at without a warning or anything? Is that behavior deliberate?
It's the same for most of the conversion scripts. If it's not actually supposed to work that way (and I don't understand why it would) I'd change that to bail out immediately if it doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lets change. In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed. But some basic error handling would be useful
Btw, for the falcon script, it would be useful to have a --vocab-only
option since we can add extra tokenizer tests if it is available. Can be done in separate PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed.
Is there any plan to roll the functionality into the normal convert script? I was kind of thinking the the more common stuff I can refactor the easier doing something like that would be.
Ideally convert would also get refactored into something more modular than a gigantic monolithic script. (Probably not something I'd include in this pull since it's already pretty large and complicated.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thank you!
convert-falcon-hf-to-gguf.py
Outdated
scores: List[float] = [] | ||
toktypes: List[int] = [] | ||
merges: List[str] = [] | ||
|
||
|
||
if Path(dir_model + "/tokenizer.json").is_file(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lets change. In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed. But some basic error handling would be useful
Btw, for the falcon script, it would be useful to have a --vocab-only
option since we can add extra tokenizer tests if it is available. Can be done in separate PR though
I guess this could be used in the simpler conversion scripts (the import be moved to top?) Lines 338 to 339 in f55538c
instead of llama.cpp/convert-falcon-hf-to-gguf.py Lines 16 to 36 in dd0dc36
|
We should have one class for extracting the spm parts and another for the bpe parts? The special token mapping (bos eos..) are used by both spm and bpe. |
How about passing it a flag or something to tell it to leave that part out? Maybe it should be renamed to something more general than I was kind of thinking of making it a general class for the non-main vocab stuff. I also wanted to make it so you could access the loaded JSON stuff like |
I've implemented some of the suggested changes. I was pretty aggressive about resolving conversations just to try to keep track of the stuff I dealt with. If anyone doesn't like my approach to implementing the suggestions, please don't hesitate to re-open those resolved conversations. |
The merges are as important for the function of the bpe tokenizer as the scores are for the spm tokenizer. About 33% higher perplexity without them in both tokenizers. IMO the script should determinate the tokenizer type and call the toeknizer specific class that will extract only what is needed for that tokenizer to function properly. The special token mapping could be separate. |
With --check-untyped-defs, mypy doesn't like that lazy_rebuild_tensor_v2 and rebuild_from_type_v2 are not marked as static methods. We should revert the change made by #1327 and use |
There are still some missing type parameters according to mypy --disallow-any-generics:
|
Aquila model uses the LLaMA architecture and the BPE tokenizer. It should be a good test target for BPE in the |
edit2: Using ('torch._tensor', '_rebuild_from_type_v2'): rebuild_from_type_v2.__func__,
('torch._utils', '_rebuild_tensor_v2'): lazy_rebuild_tensor_v2.__func__, This runs but mypy hates it:
(I didn't forget to uncomment the decorators.) edit: python/mypy#11211 and python/mypy#14123 Open since 2021, so I wouldn't hold my breath. What do you suggest as a workaround? edit:
I'm not really sure what this one should be. I guess the simplest way to deal with it is just make it a boolean for using threadpool executor or not instead of being able to pass in the executor. |
It would be nice if I agree that using a boolean would be the simplest solution. |
ce00528 to fix #2865 (comment) |
Some naming issues for future model support: All references to BPE should be changed to |
It looks like mypy wants two more type annotations:
|
I think I'll leave that to a different pull unless it's unanimous that this should happen and clear what to do. Renaming (I also kind of just want to get this merged since it touches so much stuff. Pretty much any change to other scripts is going to create conflicts that I have to mess around with.)
Are you referring to the
If I get a chance, I'll look at that in a different pull. |
It's talking about type argument one - the single TypeVar that it's trying to resolve between the two file arguments. For |
This is why they call you Massive Brain Cebtenzzre. |
@KerfuffleV2 I'm wondering how we properly stage these changes vs rolling out a new gguf pip package release? As it is right now, e.g the
(because the conversion script doesn't use Can we publish that update ASAP? I saw @monatis added a workflow for this in #2896 |
FWIW, there is nothing stopping you from cd'ing to gguf-py and running |
@akawrykow I already made a manual release, and you can upgrade your gguf package with |
@KerfuffleV2 @monatis after upgrading, I get:
any ideas? |
Seems like my python version is outdated (currently on 3.8) and I guess this requires 3.9 and up? edit: working now. Thanks |
Python 3.8 isn't EOL yet. Should we add |
Does this fix the error? temp_file: Optional[tempfile.SpooledTemporaryFile] = None |
We should support python 3.8 definitely. I'll check type hints tomorrow and make sure that it works with 3.8. |
mypy won't like that because SpooledTemporaryFile is generic in the type stubs. As an alternative to the __future__ import we could do this: temp_file: 'Optional[tempfile.SpooledTemporaryFile[bytes]]' = None edit: another problem is that typing.TypeAlias does not exist in Python 3.8. We should make the import conditional on TYPE_CHECKING. |
The current behavior is pretty unintuitive in my behavior. I really feel like if I'm running these scripts from the repo directory it should use the What if we did something like import sys, os
if os.environ.get('NO_LOCAL_GGUF') is None and Path('gguf-py').is_dir():
sys.path.insert(1, str(Path('gguf-py') / 'gguf'))
import gguf in the scripts that use edit: #2927 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some things I noticed while doing some stuff with GGUF today.
@@ -108,7 +112,7 @@ class MODEL_TENSOR(IntEnum): | |||
MODEL_ARCH.MPT: "mpt", | |||
} | |||
|
|||
MODEL_TENSOR_NAMES = { | |||
MODEL_TENSOR_NAMES: Dict[MODEL_ARCH, Dict[MODEL_TENSOR, str]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a dict of dicts is confusing here because the values for a given key are always the same - this is guaranteed by the spec IIUC. Maybe a dict of lists, with a global dict for key->name lookup, would make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant to get back to you on these but it slipped my mind. Also, since I didn't design this stuff I'm not completely sure what to say.
For this one, I agree it's weird. It seems like the only thing you could do with this compared to a non-architecture specific list is check if a type of tensor is in the model. I'm not sure when that would ever be useful.
I only added the type annotation with this pull, and generally I was mostly focused on cleaning up some obvious stuff and adding annotations rather than looking at the design/logic of it.
Fixing it would be a breaking change though.
# Attention and feed-forward blocks | ||
for i in range(0, n_blocks): | ||
class TensorNameMap: | ||
mappings_cfg: Dict[MODEL_TENSOR, Tuple[str, ...]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we go through the trouble of listing out the tensors used per arch, why do we ignore the arch in this table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression is it's just to be able to throw in any tensor name and get the GGUF version out. That's why I argued for #3095 as well. This behavior I can actually see a use for, since general tools could use it without having to actually worry about the model type.
special_token_types: Tuple[str, ...] = tuple(('bos', 'eos', 'unk', 'sep', 'pad')) | ||
special_token_ids: Dict[str, int] = {} | ||
|
||
def __init__(self, path: Path, load_merges: bool = False, special_token_types: Optional[Tuple[str, ...]] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring a pathlib.Path in the API is not user-friendly. Generalizing it to os.PathLike[str] would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
@KerfuffleV2 What do you think about the comments I've made above? |
Changes
convert.py
gguf
Python package as having type info.convert.py
--vocab-only
inconvert.py
can work without the actual model tensors being available or in the correct format (i.e. git LFS links). (But guessed params can no longer be used in vocab only mode.)gguf.py
tensor name mapping refactored. This also allows simpler tensor skipping.--vocab-only
(Not a 100% complete list of changes.)
Status
Completed (unless someone speaks up).
ref: #2820