Various script cleanups/fixes + convert merges and special token handling #2842

KerfuffleV2 · 2023-08-27T18:34:41Z

Changes

Fix incorrect number of args to permute methods in convert.py
Misc fixes/cleanups for various scripts.
Fix/add type hints where possible.
Set the gguf Python package as having type info.
Handle merges and special tokens in convert.py
--vocab-only in convert.py can work without the actual model tensors being available or in the correct format (i.e. git LFS links). (But guessed params can no longer be used in vocab only mode.)
gguf.py tensor name mapping refactored. This also allows simpler tensor skipping.
falcon converter accepts --vocab-only

(Not a 100% complete list of changes.)

Status

Completed (unless someone speaks up).

ref: #2820

KerfuffleV2 · 2023-08-27T19:28:42Z

@ggerganov Anything you hate about these changes so far?

Also:

llama.cpp/gguf-py/gguf/gguf.py

Lines 165 to 169 in 9cc1129

    
           # TODO: the following helper functions should be removed 
        
           #       instead, get_tensor_name_map should return tuples of (name, MODEL_TENSOR) 
        
           #       however, my Python is very bad, and I couldn't figure out how to do this, hence these functions 
        
           # REMOVE 
        
           def should_skip_tensor_TMP(arch: MODEL_ARCH, n_blocks: int, name: str) -> bool:

I don't fully understand what this means, but if you explain a bit more I can probably do it. Do you just mean you want get_tensor_name_map to return Dict[str, Tuple[str, MODEL_TENSOR]] instead of Dict[str, str]?

Think I figured it out and this approach is an improvement. Reduces duplicated code a fair bit.

KerfuffleV2 · 2023-08-27T23:24:00Z

convert-gptneox-hf-to-gguf.py

@@ -111,7 +111,7 @@ def count_model_parts(dir_model: str) -> int:

 print("gguf: get tokenizer metadata")

-tokens: List[str] = []
+tokens: List[bytearray] = []


These changes shut up the type warnings but I'm not 100% sure they're correct. The alternative would be to leave it List[str] and then convert the token text to a string. I assume it's already UTF-8 bytes so there probably isn't a functional difference.

The types are correct this way, because when we get to tokens.append(text), 'text' is explicitly a bytearray.

The types are correct this way, because when we get to tokens.append(text), 'text' is explicitly a bytearray.

Right. I meant my change fixes the type warning but the part I wasn't sure about was whether text actually is supposed to be a bytearray there and what is supposed to get submitted to gguf to write out the tokens. The type annotation for add_token_list method is also just List and doesn't specify what the element is supposed to be.

The decision was made to accept several types as input to the token list depending on type of he tokenizer output (spm vs bpe)

llama.cpp/gguf-py/gguf/gguf.py

Lines 373 to 375 in dd0dc36

def get_type(val):

if isinstance(val, str) or isinstance(val, bytes) or isinstance(val, bytearray):

return GGUFValueType.STRING

The decision was made to accept several types as input to the token list

So we'd want

def add_token_list(self, tokens: Union[List[str], List[bytes], List[bytearray]]):

Correct? Or would you want to allow the items to be non-homogenous like List[Union[str, bytes, bytearray]]?

What is the differences of the python types str, bytes and bytearray? If they all resolve to the same when written to disk then any of them could be used.

Make it as simple it could be as long there is no difference in how the tokens are written to disk.

str is unicode text which will be encoded as utf-8 before being written to disk. bytes and bytearray are binary data. Those two are subclasses of ByteString, but we can't use that because it also allows memoryview. The least repetitive way to write this would be to use a TypeVar:

StrOrBytes = TypeVar('StrOrBytes', str, bytes, bytearray) # ... def add_token_list(self, tokens: Sequence[StrOrBytes]): # ... def add_token_merges(self, merges: Sequence[StrOrBytes]): # ...

Should merges actually support all of those?

convert.py

gguf-py/gguf/gguf.py

KerfuffleV2 · 2023-08-28T02:09:56Z

convert-falcon-hf-to-gguf.py

 scores: List[float] = []
 toktypes: List[int] = []
-merges: List[str] = []
-

 if Path(dir_model + "/tokenizer.json").is_file():


If there's no tokenizer.json we just generate a model with no vocabulary at without a warning or anything? Is that behavior deliberate?

It's the same for most of the conversion scripts. If it's not actually supposed to work that way (and I don't understand why it would) I'd change that to bail out immediately if it doesn't exist.

Ok, lets change. In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed. But some basic error handling would be useful

Btw, for the falcon script, it would be useful to have a --vocab-only option since we can add extra tokenizer tests if it is available. Can be done in separate PR though

In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed.

Is there any plan to roll the functionality into the normal convert script? I was kind of thinking the the more common stuff I can refactor the easier doing something like that would be.

Ideally convert would also get refactored into something more modular than a gigantic monolithic script. (Probably not something I'd include in this pull since it's already pretty large and complicated.)

gguf-py/gguf/gguf.py

convert.py

ggerganov

Nice work, thank you!

convert-llama-hf-to-gguf.py

convert.py

gguf-py/gguf/gguf.py

ggerganov · 2023-08-28T08:18:17Z

convert-falcon-hf-to-gguf.py

 scores: List[float] = []
 toktypes: List[int] = []
-merges: List[str] = []
-

 if Path(dir_model + "/tokenizer.json").is_file():


Ok, lets change. In general, the idea of these example convert scripts are to be as simple as possible and work mostly in the "happy scenario" where all your data is correctly placed. But some basic error handling would be useful

Btw, for the falcon script, it would be useful to have a --vocab-only option since we can add extra tokenizer tests if it is available. Can be done in separate PR though

klosax · 2023-08-28T08:24:53Z

I guess this could be used in the simpler conversion scripts (the import be moved to top?)

llama.cpp/convert.py

Lines 338 to 339 in f55538c

    
           from transformers.models.gpt2 import tokenization_gpt2 
        
           byte_encoder = tokenization_gpt2.bytes_to_unicode()

instead of

llama.cpp/convert-falcon-hf-to-gguf.py

Lines 16 to 36 in dd0dc36

    
           def bytes_to_unicode(): 
        
               # ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py 
        
               """ 
        
               Returns list of utf-8 byte and a corresponding list of unicode strings. 
        
               The reversible bpe codes work on unicode strings. 
        
               This means you need a large # of unicode characters in your vocab if you want to avoid UNKs. 
        
               When you're at something like a 10B token dataset you end up needing around 5K for decent coverage. 
        
               This is a significant percentage of your normal, say, 32K bpe vocab. 
        
               To avoid that, we want lookup tables between utf-8 bytes and unicode strings. 
        
               And avoids mapping to whitespace/control characters the bpe code barfs on. 
        
               """ 
        
               bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1)) 
        
               cs = bs[:] 
        
               n = 0 
        
               for b in range(2**8): 
        
                   if b not in bs: 
        
                       bs.append(b) 
        
                       cs.append(2**8+n) 
        
                       n += 1 
        
               cs = [chr(n) for n in cs] 
        
               return dict(zip(bs, cs))

klosax · 2023-08-28T09:26:10Z

tokens are used by both spm and bpe tokenizers

toktypes are used only by spm tokenizer
scores are used only by spm tokenizer

merges are used only by bpe tokenizer

We should have one class for extracting the spm parts and another for the bpe parts?

The special token mapping (bos eos..) are used by both spm and bpe.
The class SpecialVocab should not include the bpe merges only the special token mapping?

KerfuffleV2 · 2023-08-28T14:02:49Z

The class SpecialVocab should not include the bpe merges only the special token mapping?

How about passing it a flag or something to tell it to leave that part out? Maybe it should be renamed to something more general than SpecialVocab also (any ideas?)

I was kind of thinking of making it a general class for the non-main vocab stuff. I also wanted to make it so you could access the loaded JSON stuff like tokenizer_config.json because there are some places where that gets loaded twice.

KerfuffleV2 · 2023-08-28T14:58:30Z

I've implemented some of the suggested changes. I was pretty aggressive about resolving conversations just to try to keep track of the stuff I dealt with. If anyone doesn't like my approach to implementing the suggestions, please don't hesitate to re-open those resolved conversations.

klosax · 2023-08-28T17:35:44Z

I was kind of thinking of making it a general class for the non-main vocab stuff.

The merges are as important for the function of the bpe tokenizer as the scores are for the spm tokenizer. About 33% higher perplexity without them in both tokenizers.

IMO the script should determinate the tokenizer type and call the toeknizer specific class that will extract only what is needed for that tokenizer to function properly. The special token mapping could be separate.

gguf-py/gguf/gguf.py

cebtenzzre · 2023-08-28T18:12:36Z

With --check-untyped-defs, mypy doesn't like that lazy_rebuild_tensor_v2 and rebuild_from_type_v2 are not marked as static methods. We should revert the change made by #1327 and use lazy_rebuild_tensor_v2.__func__ and rebuild_from_type_v2.__func__ in CLASSES to make sure we aren't using unbound methods.

cebtenzzre · 2023-08-28T18:15:32Z

There are still some missing type parameters according to mypy --disallow-any-generics:

convert.py:450: error: Missing type parameters for generic type "ndarray"  [type-arg]
    def bf16_to_fp32(bf16_arr: np.ndarray) -> NDArray:
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
convert.py:755: error: Missing type parameters for generic type "Callable"  [type-arg]
    ...n], Out], iterable: Iterable[In], concurrency: int, max_workers: Optional[int] = None, factory: Callable = ThreadPool...
convert-lora-to-ggml.py:63: error: Missing type parameters for generic type "dtype"  [type-arg]
        self, name: str, shape: Sequence[int], data_type: np.dtype
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

klosax · 2023-08-28T18:29:10Z

Aquila model uses the LLaMA architecture and the BPE tokenizer. It should be a good test target for BPE in the convert.pyscript:
https://huggingface.co/BAAI/Aquila-7B

KerfuffleV2 · 2023-08-28T20:41:11Z

@cebtenzzre

We should revert the change made by #1327 and use lazy_rebuild_tensor_v2.__func__ and rebuild_from_type_v2.__func__ in CLASSES to make sure we aren't using unbound methods.

edit2: Using gettattr seems okay as a workaround.

        ('torch._tensor', '_rebuild_from_type_v2'): rebuild_from_type_v2.__func__,
        ('torch._utils', '_rebuild_tensor_v2'): lazy_rebuild_tensor_v2.__func__,

This runs but mypy hates it:

convert.py:678: error: "Callable[[Any, Any, Any, Any], Any]" has no attribute "__func__"  [attr-defined]
convert.py:679: error: "Callable[[Any, Any, Any, Any, Any, Any, Any], LazyTensor]" has no attribute "__func__"  [attr-defined]

(I didn't forget to uncomment the decorators.)

edit: python/mypy#11211 and python/mypy#14123

Open since 2021, so I wouldn't hold my breath. What do you suggest as a workaround?

edit:

convert.py:755: error: Missing type parameters for generic type "Callable"  [type-arg]

I'm not really sure what this one should be. I guess the simplest way to deal with it is just make it a boolean for using threadpool executor or not instead of being able to pass in the executor.

cebtenzzre · 2023-08-28T22:04:47Z

I'm not really sure what this one should be.

It would be nice if Type[Executor] worked here, but the base class doesn't support max_workers. Strictly speaking, it's a Type[Union[ThreadPoolExecutor, ProcessPoolExecutor]], but that's a bit verbose.

I agree that using a boolean would be the simplest solution.

gguf-py/gguf/gguf.py

KerfuffleV2 · 2023-08-29T18:07:52Z

ce00528 to fix #2865 (comment)

klosax · 2023-08-29T19:04:17Z

Some naming issues for future model support: All references to BPE should be changed to gpt2 since both the SentencePiece tokenizer and the GPT2 tokenizer is in fact BPE aka Byte-Pair-Encoder tokenizers.The tokenizer the original LLaMA model uses is a special version of the original SentencePiece tokenizer, so it should be referenced as llama or llama-spm The Replit models uses another version of the SentencePiece tokenizer and when support for the replit model is added it should be referenced as replit or replit-spm.

cebtenzzre · 2023-08-29T19:40:40Z

mypy with --disallow-any-generics and --check-untyped-defs still has one problem with gguf.py:

I'm not sure what's up with that, the first argument seems to match SupportsRead[bytes] and the second argument seems to match SupportsWrite[bytes]. You can tack # type: ignore[misc] onto that line and I'll look into filing a mypy issue.
edit: mypy seems to be happy if you replace fout: BinaryIO with fout: BufferedWriter (io.BufferedWriter).
This issue is currently tracked at python/mypy#15031

It looks like mypy wants two more type annotations:

convert-llama-ggmlv3-to-gguf.py:90: error: Incompatible types in assignment (expression has type "tuple[Any, ...]", variable
has type "tuple[()]")  [assignment]
            self.dims = struct.unpack(f'<{n_dims}I', data[offset:offset + (4 * n_dims)])
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
convert-llama-ggmlv3-to-gguf.py:122: error: Need type annotation for "tensors" (hint: "tensors: List[<type>] = ...") 
[var-annotated]
            tensors = []
            ^~~~~~~

KerfuffleV2 · 2023-08-29T19:53:45Z

@klosax

Some naming issues for future model support: All references to BPE should be changed to gpt2 since both the SentencePiece tokenizer and the GPT2 tokenizer is in fact BPE aka Byte-Pair-Encoder tokenizers.

I think I'll leave that to a different pull unless it's unanimous that this should happen and clear what to do. Renaming BpeVocab to Gpt2Vocab (I assume) and changing the commandline argument to gpt2 wouldn't be an issue. I don't know enough about this to make the decision myself though.

(I also kind of just want to get this merged since it touches so much stuff. Pretty much any change to other scripts is going to create conflicts that I have to mess around with.)

@cebtenzzre

edit: mypy seems to be happy if you replace fout: BinaryIO with fout: BufferedWriter (io.BufferedWriter).

Are you referring to the shutil.copyfileobj thing? If so, the issue I saw was referring to argument 1, not argument 2 (fout)

convert-llama-ggmlv3-to-gguf.py:90: error: Incompatible types in assignment

convert-llama-ggmlv3-to-gguf.py just doesn't have type annotations at all. I saw those warnings but I don't think it really makes sense to add just one or two annotations in a script that has none. (Also might be opening a can of worms by doing that.)

If I get a chance, I'll look at that in a different pull.

cebtenzzre · 2023-08-29T19:55:22Z

Are you referring to the shutil.copyfileobj thing? If so, the issue I saw was referring to argument 1, not argument 2 (fout)

It's talking about type argument one - the single TypeVar that it's trying to resolve between the two file arguments. For SupportsRead[AnyStr] and SupportsWrite[AnyStr], it needs to resolve AnyStr to bytes.

KerfuffleV2 · 2023-08-29T20:02:03Z

It's talking about type argument one - the single TypeVar that it's trying to resolve between the two file arguments.

This is why they call you Massive Brain Cebtenzzre.

akawrykow · 2023-08-30T16:23:00Z

@KerfuffleV2 I'm wondering how we properly stage these changes vs rolling out a new gguf pip package release?

As it is right now, e.g the convert-falcon-hf-to-gguf.py expectedly fails with:

Traceback (most recent call last):
  File ".\convert-falcon-hf-to-gguf.py", line 169, in <module>
    special_vocab = gguf.SpecialVocab(dir_model, load_merges = True)
AttributeError: module 'gguf' has no attribute 'SpecialVocab'

(because the conversion script doesn't use gguf from this repo, rather it imports the package)

Can we publish that update ASAP? I saw @monatis added a workflow for this in #2896

cebtenzzre · 2023-08-30T17:18:45Z

because the conversion script doesn't use gguf from this repo

FWIW, there is nothing stopping you from cd'ing to gguf-py and running pip install -e .. That would use the gguf from this repo, and -e will make sure it automatically updates when you git pull.

monatis · 2023-08-30T17:34:21Z

@akawrykow I already made a manual release, and you can upgrade your gguf package with pip install --upgrade gguf. But you are advised to install it in editable mode as @cebtenzzre pointed out. With that said, any future change should be released with the workflow I described in gguf-py/README.md from now on.

akawrykow · 2023-08-30T17:49:43Z

@KerfuffleV2 @monatis after upgrading, I get:

Traceback (most recent call last):
  File ".\convert-falcon-hf-to-gguf.py", line 4, in <module>
    import gguf
  File "C:\Python38\lib\site-packages\gguf\__init__.py", line 1, in <module>
    from .gguf import *
  File "C:\Python38\lib\site-packages\gguf\gguf.py", line 425, in <module>
    class GGUFWriter:
  File "C:\Python38\lib\site-packages\gguf\gguf.py", line 435, in GGUFWriter
    temp_file: Optional[tempfile.SpooledTemporaryFile[bytes]] = None
TypeError: 'type' object is not subscriptable

any ideas?

akawrykow · 2023-08-30T17:52:20Z

Seems like my python version is outdated (currently on 3.8) and I guess this requires 3.9 and up?

edit: working now. Thanks

cebtenzzre · 2023-08-30T19:05:51Z

Python 3.8 isn't EOL yet. Should we add from __future__ import annotations to files with annotations so python doesn't try to interpret them at runtime?

monatis · 2023-08-30T19:19:05Z

Does this fix the error?

    temp_file: Optional[tempfile.SpooledTemporaryFile] = None

monatis · 2023-08-30T19:20:54Z

We should support python 3.8 definitely. I'll check type hints tomorrow and make sure that it works with 3.8.

cebtenzzre · 2023-08-30T19:21:52Z

Does this fix the error?

mypy won't like that because SpooledTemporaryFile is generic in the type stubs. As an alternative to the __future__ import we could do this:

temp_file: 'Optional[tempfile.SpooledTemporaryFile[bytes]]' = None

edit: another problem is that typing.TypeAlias does not exist in Python 3.8. We should make the import conditional on TYPE_CHECKING.

KerfuffleV2 · 2023-08-31T08:33:09Z

I'm wondering how we properly stage these changes vs rolling out a new gguf pip package release?

The current behavior is pretty unintuitive in my behavior. I really feel like if I'm running these scripts from the repo directory it should use the gguf module that's from the repo without having to mess around with virtual environments and stuff. Probably a lot of people also don't know about the pip install -e trick Cebtenzzre mentioned (I didn't).

What if we did something like

import sys, os
if os.environ.get('NO_LOCAL_GGUF') is None and Path('gguf-py').is_dir():
    sys.path.insert(1, str(Path('gguf-py') / 'gguf'))
import gguf

in the scripts that use gguf? That way it'll find the local one if running from the repo root and can also be disabled if necessary.

edit: #2927

cebtenzzre

Just some things I noticed while doing some stuff with GGUF today.

cebtenzzre · 2023-09-25T21:54:25Z

gguf-py/gguf/gguf.py

@@ -108,7 +112,7 @@ class MODEL_TENSOR(IntEnum):
    MODEL_ARCH.MPT:     "mpt",
 }

-MODEL_TENSOR_NAMES = {
+MODEL_TENSOR_NAMES: Dict[MODEL_ARCH, Dict[MODEL_TENSOR, str]] = {


Using a dict of dicts is confusing here because the values for a given key are always the same - this is guaranteed by the spec IIUC. Maybe a dict of lists, with a global dict for key->name lookup, would make more sense?

Sorry, I meant to get back to you on these but it slipped my mind. Also, since I didn't design this stuff I'm not completely sure what to say.

For this one, I agree it's weird. It seems like the only thing you could do with this compared to a non-architecture specific list is check if a type of tensor is in the model. I'm not sure when that would ever be useful.

I only added the type annotation with this pull, and generally I was mostly focused on cleaning up some obvious stuff and adding annotations rather than looking at the design/logic of it.

Fixing it would be a breaking change though.

cebtenzzre · 2023-09-25T21:56:25Z

gguf-py/gguf/gguf.py

-    # Attention and feed-forward blocks
-    for i in range(0, n_blocks):
+class TensorNameMap:
+    mappings_cfg: Dict[MODEL_TENSOR, Tuple[str, ...]] = {


Since we go through the trouble of listing out the tensors used per arch, why do we ignore the arch in this table?

My impression is it's just to be able to throw in any tensor name and get the GGUF version out. That's why I argued for #3095 as well. This behavior I can actually see a use for, since general tools could use it without having to actually worry about the model type.

cebtenzzre · 2023-09-25T21:56:59Z

gguf-py/gguf/gguf.py

+    special_token_types: Tuple[str, ...] = tuple(('bos', 'eos', 'unk', 'sep', 'pad'))
+    special_token_ids: Dict[str, int] = {}
+
+    def __init__(self, path: Path, load_merges: bool = False, special_token_types: Optional[Tuple[str, ...]] = None):


Requiring a pathlib.Path in the API is not user-friendly. Generalizing it to os.PathLike[str] would be better.

Makes sense to me.

cebtenzzre · 2023-09-28T23:55:44Z

@KerfuffleV2 What do you think about the comments I've made above?

KerfuffleV2 mentioned this pull request Aug 27, 2023

convert.py : handle special tokens #2820

Closed

KerfuffleV2 added the script Script related label Aug 27, 2023

KerfuffleV2 changed the title ~~Feat scripts improvements~~ Various script cleanups/fixes + convert merge and special token handling Aug 27, 2023

KerfuffleV2 changed the title ~~Various script cleanups/fixes + convert merge and special token handling~~ Various script cleanups/fixes + convert merges and special token handling Aug 27, 2023

KerfuffleV2 marked this pull request as ready for review August 27, 2023 23:22

KerfuffleV2 commented Aug 27, 2023

View reviewed changes

convert.py Show resolved Hide resolved

KerfuffleV2 commented Aug 27, 2023

View reviewed changes

convert.py Outdated Show resolved Hide resolved

KerfuffleV2 commented Aug 27, 2023

View reviewed changes

gguf-py/gguf/gguf.py Outdated Show resolved Hide resolved

KerfuffleV2 commented Aug 28, 2023

View reviewed changes