-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add handling of max_gen_len from mlc-llm chat_config #64
Conversation
With migration of required parameters to chat_config during model build procedure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @elvin-n! A couple thoughts.
@@ -352,9 +352,8 @@ def _decode_last_output(self, state: RequestState) -> str: | |||
return full[len(prefix) :] | |||
|
|||
def _should_stop_by_length(self, state: RequestState) -> bool: | |||
# TODO: put to config | |||
max_tokens = 4096 | |||
max_tokens = self.text_generator.model.chat_config.max_gen_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply use the maximum context length in HF config? This also matches with OpenAI API.
I do not fully understand what kind of simplification you are proposing. OpenAI api reder to parameter in the request. It is also handled in the _should_stop_by_length
function. The min value will be selected. We cannot have the only one. We have two parameters - one is defined by model, another is defined by the user in the request
Note, different llama models use different variable name to encode maximum context length.
it does not matter since we do not work with llama config parameters directly. We work with parameter which is written into chat_config and it is under our control to have the same param for any model. This param really is defined by PR1249
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully understand what kind of simplification you are proposing. OpenAI api reder to parameter in the request. It is also handled in the _should_stop_by_length function. The min value will be selected. We cannot have the only one. We have two parameters - one is defined by model, another is defined by the user in the request
iiuc, self.text_generator.model.chat_config.max_gen_len
here is actually equal to max context length defined in the HF config?
https://github.com/mlc-ai/mlc-llm/pull/1249/files#diff-969f33952a1886f9f4b1b5c567cce40a4748938b7ad263f861142ecb1bb86551R828
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is actually equal to max context length defined in the HF config
Not necessary. max_sequence_length can be or cannot be in the original HF model. The max_sequence_length, that you refer to, is a field in the mlc-llm build data structure. It can be initialized and really initialized bymax_position_embeddings
from HF model.
I.e. we have
max_position_embeddings
ormax_sequence_length
in original modelmax_sequence_length
in the mlc-llm build flow that can be initialized by above parameters from the HF model or explicitly overridden bymax_seq_len
passed to build.py scriptmax_seq_len
- the parameter which is only written to the mlc_chat_config.json
If we want to have standalone deploy flow, we need to have all data together with compiled model. This can be
mlc_chat_config.json
that is preferd for me, and in this casemax_seq_len
is the only parameter that we can/need to use.- Or we can introduce the new config file for batch serve solution, that is less preferred to me. In this case we can name parameter in this new config file as
max_sequence_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elvin-n, my main concern with mlc_chat_config.json
is that it contains too much unnecessary info for mlc_serve. For example, ConvConfig
, model_category
, top_p
, mean_genlen
, ... I think this is what @jroesch also mean in this PR.
I worked on this quickly yesterday for the latter option. Let me send the PR soon so that we can discuss over there.
@@ -335,13 +336,207 @@ def get_tvm_model(artifact_path, model, quantization, num_shards, dev): | |||
|
|||
lib_path = os.path.join(model_artifact_path, f"{model}-{quantization}-cuda.so") | |||
|
|||
chat_file = os.path.join(f"{model_artifact_path}/params", "mlc-chat-config.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mlc-chat-config
is for cpp/llm_chat.cc
which is not used by mlc-serve.
But I'm wondering if we can introduce some mlc-serve configs to get rid of the dependency on hf config. This dependency forces the endpoint to prepare a set of hf configs, such as tokenzier, which seems unnecessary. Ideally, it will be great if we can put everything under f"{model_artifact_path}/mlc-serve-config"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm wondering if we can introduce some mlc-serve configs to get rid of the dependency on hf config.
This is exactly done in this PR by switching to mlc_chat_config.json in paged_cache_model.py
. The mlc_chat_config.json
originally is created for single batch cpp/llm_chat.cc
but it would be nice to reuse it instead to introduce one more config file during the build of the model. But if you consider that it make sense to create separate config for mlc-serve, we can duplicate it during model compilation
artifact_path, model_name, quant, num_shards, dev | ||
) | ||
self.chat_config = get_chat_config(self.config_file_path, None) | ||
self.chat_config.num_shards = num_shards | ||
self.chat_config.sliding_window = sliding_window # this should migrate eventually into chat config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to migrate to config base, I'd like to finish it in this PR. Currently, when I'm working on endpoint integration, these slightly different configurations are huge pain.
My bad. Fixed |
Yes, this is one of left-over work from #48. I think we should keep the sync engine for debugging purpose, so it we should refactor. Welcome to do so if you are interested. |
Plugged chat_config created during build of the model and started to use existed required parameters from this config instead of HF one
Get rid of the hardcoded context length, started to use max_gen_len parameter from chat_config
NOTE - this fix requires to rebase on mlc-llm PR1249 to end-to-end scenario work properly