-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up QK and file and tensor types #678
Conversation
FTYPE_Q4_0 = 2, | ||
FTYPE_Q4_1 = 3, | ||
}; | ||
static const char * ftype_str[] = { "f32", "f16", "q4_0", "q4_1" }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, pedantic ISO C++ does not allow designated initializers.
@@ -100,7 +109,7 @@ struct llama_hparams { | |||
int32_t n_head = 32; | |||
int32_t n_layer = 32; | |||
int32_t n_rot = 64; | |||
int32_t f16 = 1; | |||
int32_t f16 = FTYPE_F16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the correct semantics - what's the intention behind ftype
and f16
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f16
is badly named since the times I was only considering F16 and F32 and no other data types
@@ -222,7 +179,7 @@ def copy_tensors(fin, fout, part_id, n_parts): | |||
|
|||
# ensure tensor data is aligned | |||
tensor_data_offset = fout.tell() | |||
while tensor_data_offset % QK != 0: | |||
while tensor_data_offset % 32 != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama_model_load
and llama_model_quantize_internal
have 32 hard-coded as well. What matters is not the quantized block size, but whether the processor can efficiently load and store from a multiple of this number, especially with SIMD instructions.
I think it would be worthwhile to separate the more straightforward Python changes into a separate PR which can be reviewed and merged sooner than the more complex changes in C. |
Agree, but I would like the Python definitions to be similar to C/C++ Python So I'm looking for opinions, once we have general consensus on whether and how this should be done, I will split the PR. |
@@ -32,6 +33,7 @@ def write_header(f_out, header): | |||
|
|||
if magic != 0x67676d6c: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the magic a constant too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need old magic constants anyway to detect older models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These refactors and code maintenance changes are very helpful
enum e_ftype
can be moved tollama.h
and be calledenum llama_ftype
- I think
GGML_FILE
->LLAMA_FTYPE
@ggerganov I'm glad you like the overall direction, I'll make a separate PR with this enum, then we might sync it with the Python code in #545 before that gets merged. |
(edit: the python changes will clash with #545)
This PR has several goals:
For the python scripts, I introduce a new file
ggml.py
at the top level, which contains definitions of the file and tensor types equivalent to those in ggml.h. I have formatted that withblack
, in case #611 returns from the dead.The changes to the python files on one hand and C/C++ on the other are technically independent, but the discussion will overlap, so I'm keeping this in one draft. I will split it later if it makes sense.
I have tested conversion from pth to ggml with identical outputs, but I have not tested the other conversion scripts.
Open questions:
enum e_ftype
be moved tollama.h
(with sensible renaming)? This would allow us to eliminate the hard-coded 2,3 in the usage string ofquantize.cpp
, and maybe be useful elsewhere.I could not find any other parts of the code that would use this.I see now, it's for GPTQ models. Should I addFTYPE_GPTQ
?.llama.cpp/llama.cpp
Line 515 in 3525899
GGML_FILE
ok or should it beLLAMA_FILE
? ggml.c doesn't deal with that type.Some more comments below, looking for your thoughts on this...