Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add handling of max_gen_len from mlc-llm chat_config #64

Closed
wants to merge 3 commits into from

Conversation

elvin-n
Copy link

@elvin-n elvin-n commented Nov 14, 2023

Plugged chat_config created during build of the model and started to use existed required parameters from this config instead of HF one

Get rid of the hardcoded context length, started to use max_gen_len parameter from chat_config

NOTE - this fix requires to rebase on mlc-llm PR1249 to end-to-end scenario work properly

With migration of required parameters to chat_config during
model build procedure
Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @elvin-n! A couple thoughts.

@@ -352,9 +352,8 @@ def _decode_last_output(self, state: RequestState) -> str:
return full[len(prefix) :]

def _should_stop_by_length(self, state: RequestState) -> bool:
# TODO: put to config
max_tokens = 4096
max_tokens = self.text_generator.model.chat_config.max_gen_len
Copy link
Member

@sunggg sunggg Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply use the maximum context length in HF config? This also matches with OpenAI API. (link)
Note, different llama models use different variable name to encode maximum context length. See PR

Copy link
Author

@elvin-n elvin-n Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply use the maximum context length in HF config? This also matches with OpenAI API.

I do not fully understand what kind of simplification you are proposing. OpenAI api reder to parameter in the request. It is also handled in the _should_stop_by_length function. The min value will be selected. We cannot have the only one. We have two parameters - one is defined by model, another is defined by the user in the request

Note, different llama models use different variable name to encode maximum context length.

it does not matter since we do not work with llama config parameters directly. We work with parameter which is written into chat_config and it is under our control to have the same param for any model. This param really is defined by PR1249

Copy link
Member

@sunggg sunggg Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand what kind of simplification you are proposing. OpenAI api reder to parameter in the request. It is also handled in the _should_stop_by_length function. The min value will be selected. We cannot have the only one. We have two parameters - one is defined by model, another is defined by the user in the request

iiuc, self.text_generator.model.chat_config.max_gen_len here is actually equal to max context length defined in the HF config?
https://github.com/mlc-ai/mlc-llm/pull/1249/files#diff-969f33952a1886f9f4b1b5c567cce40a4748938b7ad263f861142ecb1bb86551R828

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is actually equal to max context length defined in the HF config
Not necessary. max_sequence_length can be or cannot be in the original HF model. The max_sequence_length, that you refer to, is a field in the mlc-llm build data structure. It can be initialized and really initialized by max_position_embeddings from HF model.

I.e. we have

  • max_position_embeddings or max_sequence_length in original model
  • max_sequence_length in the mlc-llm build flow that can be initialized by above parameters from the HF model or explicitly overridden by max_seq_len passed to build.py script
  • max_seq_len - the parameter which is only written to the mlc_chat_config.json

If we want to have standalone deploy flow, we need to have all data together with compiled model. This can be

  • mlc_chat_config.json that is preferd for me, and in this case max_seq_len is the only parameter that we can/need to use.
  • Or we can introduce the new config file for batch serve solution, that is less preferred to me. In this case we can name parameter in this new config file as max_sequence_length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elvin-n, my main concern with mlc_chat_config.json is that it contains too much unnecessary info for mlc_serve. For example, ConvConfig, model_category, top_p, mean_genlen, ... I think this is what @jroesch also mean in this PR.

I worked on this quickly yesterday for the latter option. Let me send the PR soon so that we can discuss over there.

@@ -335,13 +336,207 @@ def get_tvm_model(artifact_path, model, quantization, num_shards, dev):

lib_path = os.path.join(model_artifact_path, f"{model}-{quantization}-cuda.so")

chat_file = os.path.join(f"{model_artifact_path}/params", "mlc-chat-config.json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mlc-chat-config is for cpp/llm_chat.cc which is not used by mlc-serve.
But I'm wondering if we can introduce some mlc-serve configs to get rid of the dependency on hf config. This dependency forces the endpoint to prepare a set of hf configs, such as tokenzier, which seems unnecessary. Ideally, it will be great if we can put everything under f"{model_artifact_path}/mlc-serve-config"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm wondering if we can introduce some mlc-serve configs to get rid of the dependency on hf config.

This is exactly done in this PR by switching to mlc_chat_config.json in paged_cache_model.py. The mlc_chat_config.json originally is created for single batch cpp/llm_chat.cc but it would be nice to reuse it instead to introduce one more config file during the build of the model. But if you consider that it make sense to create separate config for mlc-serve, we can duplicate it during model compilation

artifact_path, model_name, quant, num_shards, dev
)
self.chat_config = get_chat_config(self.config_file_path, None)
self.chat_config.num_shards = num_shards
self.chat_config.sliding_window = sliding_window # this should migrate eventually into chat config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to migrate to config base, I'd like to finish it in this PR. Currently, when I'm working on endpoint integration, these slightly different configurations are huge pain.

@masahi
Copy link
Member

masahi commented Nov 14, 2023

@elvin-n
Copy link
Author

elvin-n commented Nov 15, 2023

We also need to update https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/staging_engine_worker.py#L296-L297.

My bad. Fixed
But need to evaluate the refactoring of the synced and staged engines. They have several duplicating parts

@masahi
Copy link
Member

masahi commented Nov 15, 2023

My bad. Fixed But need to evaluate the refactoring of the synced and staged engines. They have several duplicating parts

Yes, this is one of left-over work from #48. I think we should keep the sync engine for debugging purpose, so it we should refactor. Welcome to do so if you are interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants