Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Introduce new GGUFValueType.OBJ virtual type🌠 #5143

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

snowyu
Copy link
Contributor

@snowyu snowyu commented Jan 26, 2024

The content of the OBJ type is actually a list of all key names of the object, designed to keep compatibility using the simplest flat structure.

Here's an example demonstrating its usage:

eg,

{
  "obj": {
    "subKey1": 1,
    "subKey2": {"k": 2},
  }
}
  1. Write(OBJ): Key is obj, Object Value is ["subKey1", "subKey2"]
  2. Write The subKeys:
    1. Write(UINT8): Key is .obj.subKey1, Simple Value is 1
    2. Write(STRING): Key is .obj.subKey2, Object Value is ["k"]
      1. Write(UINT8): Key is .obj.subKey1.k, Simple Value is 2

Now the all json structure supported:

  • object supported
  • mixed-type array supported
  • nested array supported

The agreement is as follows:

  • Key convention: "." indicates that the key is a subkey, not an independent key.
  • Subkey name convention in OBJ types:
    • If the first letter of the subkey name is "/", it means referencing the full name of other keys.
    • If there is a ":" colon delimiter, it means that the string after the colon represents the subkey name in this object, otherwise the referencing subkey name is used.

For Example,

// tokenizer.jsonr:
{
  ...,
  "pre_tokenizer": {
      "type": "Sequence",
       "pretokenizers": [
          {"type": "WhitespaceSplit"},
          {"type": "Metaspace","replacement": "▁", ...}
        ]
    }
}

// tokenizer_config.json
{
  ...
  "bos_token": "<s>",
  "eos_token": "</s>",
  "add_bos_token": true,
}

Convert the JSON example above into a flat structure as follows:

// tokenizer.json
tokenizer = ["pre_tokenizer", ...], type OBJ
.tokenizer.pre_tokenizer.type = "Sequence", type STRING
.tokenizer.pre_tokenizer.pretokenizers = 2, type ARRAY, sub type OBJ
.tokenizer.pre_tokenizer.pretokenizers[0].type = "WhitespaceSplit", type STRING
.tokenizer.pre_tokenizer.pretokenizers[1].type = "Metaspace", type STRING
.tokenizer.pre_tokenizer.pretokenizers[1].replace = "▁", type STRING

// tokenizer_config.json
// there already are `tokenizer.ggml.bos_token_id`, `tokenizer.ggml.eos_token_id`
// and `tokenizer.ggml.add_bos_token` exists in GGUF, so just use them.
tokenizer_config = ["/tokenizer.ggml.bos_token_id:bos_token", 
  "/tokenizer.ggml.eos_token_id:eos_token", 
  "/tokenizer.ggml.add_bos_token",
  ...
], type OBJ

This change includes several improvements and additions to the codebase:

  • Python
    • gguf_writer.py:
      • Added def add_kv(self, key: str, val: Any) -> None: Automatically determines the appropriate value type based on val.
      • Added def add_dict(self, key: str, val: dict, excludes: Sequence[str] = []) -> None: Adds object (dict) values, It will recursively add all subkeys.
      • Added add_array_ex to support the nested and mixed-type array.
    • constants.py:
      • Added GGUFValueType.get_type_ex(val): Added support for numpy's integers and floating-point numbers, selecting the number of digits according to the size of the integer.
    • gguf_reader.py:
      • Added functionality to retrieve values from specific fields using ReaderField.get() method.
    • Unit test added
  • CPP
    • ggml:
      • Added GGUF_TYPE_OBJ to the gguf_type enum type.
      • Use gguf_get_arr_n and gguf_get_arr_str to get the subKey names of GGUF_TYPE_OBJ.
      • Added gguf_set_obj_str function to set object subkey names
      • Added gguf_set_arr_obj function to set object array count
      • Added gguf_set_arr_arr function to set nested array count
    • llama:
      • Modified gguf_kv_to_str
      • Added LLAMA_API char * gguf_kv_to_c_str function to get the c_str value as JSON format.
        • Maybe this API should be moved into ggml as gguf_get_val_json.
      • Added basic support to GGUF_TYPE_OBJ and nested array
    • Unit test added

Related Issues: #4868, #2872

@ggerganov
Copy link
Owner

I'll need a bit of time to consider this change - I like the implementation, but I'm not yet convinced it is necessary.

The way I'm thinking is that the Python writer can directly serialize complex dictionaries into array of KVs. In your example we would write straight up obj.subkey1 and obj.subkey2.k without the OBJ lists.

The composed keys (e.g. obj.subkey1) would be mapped to llama-cpp-known keys like we already do for all KV pairs and tensor names (see gguf/constants.py and gguf/tensor_mapping.py). This way in the C++ world, we straight up read the KVs that we are interested in, and don't deal with parsing dictionaries and synchronizing the key strings (the synchronization already happened thanks to the mapping during writing the GGUF file).

With the proposed approach here, I imagine that we would have to iteratively parse the OBJ KVs into std::maps for example and do some extra work I guess.

In your example:

{
  ...,
  "pre_tokenizer": {
      "type": "Sequence",
       "pretokenizers": [
          {"type": "WhitespaceSplit"},
          {"type": "Metaspace","replacement": "", ...}
        ]
    }
}

How do you imagine the C++ code would function to query the Metaspace replacement?

Without the GGUF.OBJ extension, this should ideally map to a string KV called pretokenizer.sequences.metaspace_replacement: "_" and in C++ we simply get this KV as usual.

@snowyu
Copy link
Contributor Author

snowyu commented Jan 28, 2024

I'll need a bit of time to consider this change - I like the implementation, but I'm not yet convinced it is necessary.

The way I'm thinking is that the Python writer can directly serialize complex dictionaries into array of KVs. In your example we would write straight up obj.subkey1 and obj.subkey2.k without the OBJ lists.

Actually, at first, I just wanted to implement flat structure objects in Python using existing types, without introducing new types in CPP. However, I found that for two reasons, I had to add the OBJ type:

  1. There are already some tokenizer.json and tokenizer_config.json data embedded in GGUF files. Eg, tokenizer.ggml.bos_token_id, tokenizer.chat_template, etc. Therefore, keys indexing must be maintained for backward compatibility.
    • This allows us to make the following agreement to define "tokenizer_config": ["/tokenizer.ggml.bos_token_id:bos_token", "/tokenizer.ggml.add_bos_token"]
      • An initial slash (/) indicates a reference to the full name of another key.
      • A colon(:) separator indicates that the string after the colon is the name of a sub-key in this object, otherwise it uses the referenced sub-key name.
    • If backward compatibility isn't considered, we can use the number of keys as the content of OBJ type, which requires the next specified number of keys to be the child keys of this object.
  2. The current GGUF ARRAY only supports simple type arrays. To support object arrays, the OBJ type must be added. This allows mixed-type arrays to also be supported.

How do you imagine the C++ code would function to query the Metaspace replacement?

Without the GGUF.OBJ extension, this should ideally map to a string KV called pretokenizer.sequences.metaspace_replacement: "_" and in C++ we simply get this KV as usual.

Great question! This is what I'm considering: how to add OBJ support for GGUF ARRAY. The key here lies in object arrays. If pre_tokenizer type is Sequence, then pretokenizers should be executed in order of the array sequence. So convert the JSON example above into a flat structure as follows:

tokenizer.pre_tokenizer.type = "Sequence", type STRING
tokenizer.pre_tokenizer.pretokenizers = 2, type ARRAY, sub type OBJ
pre_tokenizer.pretokenizers[0].type = "WhitespaceSplit", type STRING
pre_tokenizer.pretokenizers[1].type = "Metaspace", type STRING
pre_tokenizer.pretokenizers[1].replace = "▁", type STRING

Thank you for your question! I hope this explanation helps clarify the need for adding an OBJ type in GGUF arrays and how it can be implemented effectively with backward compatibility considerations. Please let me know if there's anything else I can help with.

@ggerganov
Copy link
Owner

Uh, I don't know. Curious if other devs have opinion on this functionality.

If pre_tokenizer type is Sequence, then pretokenizers should be executed in order of the array sequence.

I find this extremely complicated. Overall, I have a strong hesitation of supporting all these tokenizer options, templates, configs and what not in llama.cpp. It seems like an endless way of over-engineering something that should be very simple. (sorry for the rant, it's not towards this PR)

Let me think about this for a while, but right now I'd prefer if we just picked 1 or 2 items from the tokenizer options that are more important and useful and just support those with the existing GGUF types (like a boolean for whitespace split, etc.).

@slaren
Copy link
Collaborator

slaren commented Jan 29, 2024

To me this seems too much effort to shoehorn a solution into the current implementation. We could just include the entire tokenizer json file as a string, we are not going to bundle a json parser in llama.cpp, but I think it is safe to assume that any application that wants to support templates has a json parser as well.

@snowyu
Copy link
Contributor Author

snowyu commented Jan 29, 2024

@ggerganov

I find this extremely complicated. Overall, I have a strong hesitation of supporting all these tokenizer options, templates, configs and what not in llama.cpp. It seems like an endless way of over-engineering something that should be very simple. (sorry for the rant, it's not towards this PR)

Therefore, it is easiest to use the ready-made tokenizer-cpp library. Of course, it should not be difficult to implement it one by one slowly using CPP. After all, js has been implemented, including the simple Jinja template engine.

The advantage of embedding tokenizer, template and other configurations is that they can be provided to js, python, etc. for use.

Now this PR can fully support JSON format includes mixed-type array and nested array. See unit-test and updated document.

llama_model_loader: loaded meta data with 29 key-value pairs and 3 tensors from tests/test_writer.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:         general.architecture str              = "llama"
llama_model_loader: - kv   1:            llama.block_count u32              = 12
llama_model_loader: - kv   2:                       answer u32              = 42
llama_model_loader: - kv   3:              answer_in_float f32              = 42.000000
llama_model_loader: - kv   4:                        uint8 u8               = 1
llama_model_loader: - kv   5:                        nint8 i8               = 1
llama_model_loader: - kv   6:                        dict1 obj[str,3]       = {"key1":2, "key2":"hi", "obj":{"k":1}}
llama_model_loader: - kv  11:                       oArray arr[obj,2]       = [{"k":4, "o":{"o1":6}}, {"k":9}]
llama_model_loader: - kv  18:                       cArray arr[obj,3]       = [3, "hi", [1, 2]]
llama_model_loader: - kv  22:                 arrayInArray arr[arr,2]       = [[2, 3, 4], [5, 7, 8]]
llama_model_loader: - kv  25:  tokenizer.ggml.bos_token_id str              = "bos"
llama_model_loader: - kv  26: tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  27:             tokenizer_config obj[str,2]       = {"bos_token":"bos", "add_bos_token":t...
llama_model_loader: - kv  28:            general.alignment u32              = 64
llama_model_loader: Dumping metadata keys/values Done.

@snowyu snowyu force-pushed the feat/obj_virtual_type branch 2 times, most recently from 84dc536 to fe25927 Compare February 3, 2024 09:41
The content of the OBJ type is actually a list of all key names of the object.

* Python
  * `gguf_writer.py`:
    * Added `def add_kv(self, key: str, val: Any) -> None`: Automatically determines the appropriate value type based on `val`.
    * Added `def add_dict(self, key: str, val: dict, excludes: Sequence[str] = []) -> None`: Adds object (dict) values, It will recursively add all subkeys.
    * Added `add_array_ex` to support the nested and mixed-type array.
  * `constants.py`:
    * Added `GGUFValueType.get_type_ex(val)`: Added support for numpy's integers and floating-point numbers, selecting the number of digits according to the size of the integer.
  * `gguf_reader.py`:
    * Added functionality to retrieve values from specific fields using `ReaderField.get()` method.
  * Unit test added
* CPP
  * `ggml`:
    * Added `GGUF_TYPE_OBJ` to the `gguf_type` enum type.
    * Use `gguf_get_arr_n` and `gguf_get_arr_str` to get the subKey names of `GGUF_TYPE_OBJ`.
    * Added `gguf_set_obj_str` function to set object subkey names
    * Added `gguf_set_arr_obj` function to set object array count
    * Added `gguf_set_arr_arr` function to set nested array count
  * `llama`:
    * Modified `gguf_kv_to_str`
    * Added `LLAMA_API char * gguf_kv_to_c_str` function to get the c_str value as JSON format.
      * Maybe this API should be moved into `ggml` as `gguf_get_val_json`. (问题是 ggml.c 用的是C语言,而这里大量用了C++的功能)
    * Added basic support to `GGUF_TYPE_OBJ` and nested array
  * Unit test added

feat: add basic support to GGUF_TYPE_OBJ on cpp
feat(gguf.py): add OBJ and mixed-type array supports to GGUF ARRAY
feat: add OBJ and mixed-type array supports to GGUF ARRAY(CPP)
feat: add nested array supported
feat: * Subkey name convention in OBJ types:
  * If the first letter of the subkey name is "/", it means referencing the full name of other keys.
  * If there is a ":" colon delimiter, it means that the string after the colon represents the subkey name in this object, otherwise the referencing subkey name is used.
feat: add LLAMA_API gguf_kv_to_c_str to llama.h
test: write test gguf file to tests folder directly(py)
test: add test-gguf-meta.cpp
feat: Key convention: "." indicates that the key is a subkey, not an independent key.
feat: add excludes argument to add_dict(gguf_write.py)
feat: add_array_ex to supports nested and mix-typed array, and keep the add_array to the same
fix(constant.py): rollback the get_type function and add the new get_type_ex
test: add test compatibility
fix: use GGML_MALLOC instead of malloc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants