-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Introduce new GGUFValueType.OBJ virtual type🌠 #5143
base: master
Are you sure you want to change the base?
Conversation
927bb36
to
a06767b
Compare
I'll need a bit of time to consider this change - I like the implementation, but I'm not yet convinced it is necessary. The way I'm thinking is that the Python writer can directly serialize complex dictionaries into array of KVs. In your example we would write straight up The composed keys (e.g. With the proposed approach here, I imagine that we would have to iteratively parse the OBJ KVs into In your example: {
...,
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{"type": "WhitespaceSplit"},
{"type": "Metaspace","replacement": "▁", ...}
]
}
} How do you imagine the C++ code would function to query the Without the |
Actually, at first, I just wanted to implement flat structure objects in Python using existing types, without introducing new types in CPP. However, I found that for two reasons, I had to add the
Great question! This is what I'm considering: how to add
Thank you for your question! I hope this explanation helps clarify the need for adding an OBJ type in GGUF arrays and how it can be implemented effectively with backward compatibility considerations. Please let me know if there's anything else I can help with. |
Uh, I don't know. Curious if other devs have opinion on this functionality.
I find this extremely complicated. Overall, I have a strong hesitation of supporting all these tokenizer options, templates, configs and what not in Let me think about this for a while, but right now I'd prefer if we just picked 1 or 2 items from the tokenizer options that are more important and useful and just support those with the existing GGUF types (like a boolean for whitespace split, etc.). |
To me this seems too much effort to shoehorn a solution into the current implementation. We could just include the entire tokenizer json file as a string, we are not going to bundle a json parser in llama.cpp, but I think it is safe to assume that any application that wants to support templates has a json parser as well. |
Therefore, it is easiest to use the ready-made tokenizer-cpp library. Of course, it should not be difficult to implement it one by one slowly using CPP. After all, js has been implemented, including the simple Jinja template engine. The advantage of embedding tokenizer, template and other configurations is that they can be provided to js, python, etc. for use. Now this PR can fully support JSON format includes mixed-type array and nested array. See unit-test and updated document. llama_model_loader: loaded meta data with 29 key-value pairs and 3 tensors from tests/test_writer.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = "llama"
llama_model_loader: - kv 1: llama.block_count u32 = 12
llama_model_loader: - kv 2: answer u32 = 42
llama_model_loader: - kv 3: answer_in_float f32 = 42.000000
llama_model_loader: - kv 4: uint8 u8 = 1
llama_model_loader: - kv 5: nint8 i8 = 1
llama_model_loader: - kv 6: dict1 obj[str,3] = {"key1":2, "key2":"hi", "obj":{"k":1}}
llama_model_loader: - kv 11: oArray arr[obj,2] = [{"k":4, "o":{"o1":6}}, {"k":9}]
llama_model_loader: - kv 18: cArray arr[obj,3] = [3, "hi", [1, 2]]
llama_model_loader: - kv 22: arrayInArray arr[arr,2] = [[2, 3, 4], [5, 7, 8]]
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id str = "bos"
llama_model_loader: - kv 26: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 27: tokenizer_config obj[str,2] = {"bos_token":"bos", "add_bos_token":t...
llama_model_loader: - kv 28: general.alignment u32 = 64
llama_model_loader: Dumping metadata keys/values Done. |
84dc536
to
fe25927
Compare
The content of the OBJ type is actually a list of all key names of the object. * Python * `gguf_writer.py`: * Added `def add_kv(self, key: str, val: Any) -> None`: Automatically determines the appropriate value type based on `val`. * Added `def add_dict(self, key: str, val: dict, excludes: Sequence[str] = []) -> None`: Adds object (dict) values, It will recursively add all subkeys. * Added `add_array_ex` to support the nested and mixed-type array. * `constants.py`: * Added `GGUFValueType.get_type_ex(val)`: Added support for numpy's integers and floating-point numbers, selecting the number of digits according to the size of the integer. * `gguf_reader.py`: * Added functionality to retrieve values from specific fields using `ReaderField.get()` method. * Unit test added * CPP * `ggml`: * Added `GGUF_TYPE_OBJ` to the `gguf_type` enum type. * Use `gguf_get_arr_n` and `gguf_get_arr_str` to get the subKey names of `GGUF_TYPE_OBJ`. * Added `gguf_set_obj_str` function to set object subkey names * Added `gguf_set_arr_obj` function to set object array count * Added `gguf_set_arr_arr` function to set nested array count * `llama`: * Modified `gguf_kv_to_str` * Added `LLAMA_API char * gguf_kv_to_c_str` function to get the c_str value as JSON format. * Maybe this API should be moved into `ggml` as `gguf_get_val_json`. (问题是 ggml.c 用的是C语言,而这里大量用了C++的功能) * Added basic support to `GGUF_TYPE_OBJ` and nested array * Unit test added feat: add basic support to GGUF_TYPE_OBJ on cpp feat(gguf.py): add OBJ and mixed-type array supports to GGUF ARRAY feat: add OBJ and mixed-type array supports to GGUF ARRAY(CPP) feat: add nested array supported feat: * Subkey name convention in OBJ types: * If the first letter of the subkey name is "/", it means referencing the full name of other keys. * If there is a ":" colon delimiter, it means that the string after the colon represents the subkey name in this object, otherwise the referencing subkey name is used. feat: add LLAMA_API gguf_kv_to_c_str to llama.h test: write test gguf file to tests folder directly(py) test: add test-gguf-meta.cpp feat: Key convention: "." indicates that the key is a subkey, not an independent key. feat: add excludes argument to add_dict(gguf_write.py) feat: add_array_ex to supports nested and mix-typed array, and keep the add_array to the same fix(constant.py): rollback the get_type function and add the new get_type_ex test: add test compatibility fix: use GGML_MALLOC instead of malloc
fe25927
to
95a492a
Compare
The content of the
OBJ
type is actually a list of all key names of the object, designed to keep compatibility using the simplest flat structure.Here's an example demonstrating its usage:
eg,
OBJ
): Key isobj
, Object Value is["subKey1", "subKey2"]
.obj.subKey1
, Simple Value is1
.obj.subKey2
, Object Value is["k"]
.obj.subKey1.k
, Simple Value is2
Now the all json structure supported:
The agreement is as follows:
For Example,
Convert the JSON example above into a flat structure as follows:
This change includes several improvements and additions to the codebase:
gguf_writer.py
:def add_kv(self, key: str, val: Any) -> None
: Automatically determines the appropriate value type based onval
.def add_dict(self, key: str, val: dict, excludes: Sequence[str] = []) -> None
: Adds object (dict) values, It will recursively add all subkeys.add_array_ex
to support the nested and mixed-type array.constants.py
:GGUFValueType.get_type_ex(val)
: Added support for numpy's integers and floating-point numbers, selecting the number of digits according to the size of the integer.gguf_reader.py
:ReaderField.get()
method.ggml
:GGUF_TYPE_OBJ
to thegguf_type
enum type.gguf_get_arr_n
andgguf_get_arr_str
to get the subKey names ofGGUF_TYPE_OBJ
.gguf_set_obj_str
function to set object subkey namesgguf_set_arr_obj
function to set object array countgguf_set_arr_arr
function to set nested array countllama
:gguf_kv_to_str
LLAMA_API char * gguf_kv_to_c_str
function to get the c_str value as JSON format.ggml
asgguf_get_val_json
.GGUF_TYPE_OBJ
and nested arrayRelated Issues: #4868, #2872