Add handling of max_gen_len from mlc-llm chat_config #64

elvin-n · 2023-11-14T13:30:47Z

Plugged chat_config created during build of the model and started to use existed required parameters from this config instead of HF one

Get rid of the hardcoded context length, started to use max_gen_len parameter from chat_config

NOTE - this fix requires to rebase on mlc-llm PR1249 to end-to-end scenario work properly

With migration of required parameters to chat_config during model build procedure

sunggg

Thank you, @elvin-n! A couple thoughts.

sunggg · 2023-11-14T16:41:35Z

serve/mlc_serve/engine/sync_engine.py

@@ -352,9 +352,8 @@ def _decode_last_output(self, state: RequestState) -> str:
        return full[len(prefix) :]

    def _should_stop_by_length(self, state: RequestState) -> bool:
-        # TODO: put to config
-        max_tokens = 4096
+        max_tokens = self.text_generator.model.chat_config.max_gen_len


Can we simply use the maximum context length in HF config? This also matches with OpenAI API. (link)
Note, different llama models use different variable name to encode maximum context length. See PR

Can we simply use the maximum context length in HF config? This also matches with OpenAI API.

I do not fully understand what kind of simplification you are proposing. OpenAI api reder to parameter in the request. It is also handled in the _should_stop_by_length function. The min value will be selected. We cannot have the only one. We have two parameters - one is defined by model, another is defined by the user in the request

Note, different llama models use different variable name to encode maximum context length.

it does not matter since we do not work with llama config parameters directly. We work with parameter which is written into chat_config and it is under our control to have the same param for any model. This param really is defined by PR1249

I do not fully understand what kind of simplification you are proposing. OpenAI api reder to parameter in the request. It is also handled in the _should_stop_by_length function. The min value will be selected. We cannot have the only one. We have two parameters - one is defined by model, another is defined by the user in the request

iiuc, self.text_generator.model.chat_config.max_gen_len here is actually equal to max context length defined in the HF config?
https://github.com/mlc-ai/mlc-llm/pull/1249/files#diff-969f33952a1886f9f4b1b5c567cce40a4748938b7ad263f861142ecb1bb86551R828

here is actually equal to max context length defined in the HF config
Not necessary. max_sequence_length can be or cannot be in the original HF model. The max_sequence_length, that you refer to, is a field in the mlc-llm build data structure. It can be initialized and really initialized by max_position_embeddings from HF model.

I.e. we have

max_position_embeddings or max_sequence_length in original model

max_sequence_length in the mlc-llm build flow that can be initialized by above parameters from the HF model or explicitly overridden by max_seq_len passed to build.py script

max_seq_len - the parameter which is only written to the mlc_chat_config.json

If we want to have standalone deploy flow, we need to have all data together with compiled model. This can be

mlc_chat_config.json that is preferd for me, and in this case max_seq_len is the only parameter that we can/need to use.

Or we can introduce the new config file for batch serve solution, that is less preferred to me. In this case we can name parameter in this new config file as max_sequence_length

@elvin-n, my main concern with mlc_chat_config.json is that it contains too much unnecessary info for mlc_serve. For example, ConvConfig, model_category, top_p, mean_genlen, ... I think this is what @jroesch also mean in this PR.

I worked on this quickly yesterday for the latter option. Let me send the PR soon so that we can discuss over there.

sunggg · 2023-11-14T16:49:17Z

serve/mlc_serve/model/paged_cache_model.py

@@ -335,13 +336,207 @@ def get_tvm_model(artifact_path, model, quantization, num_shards, dev):

    lib_path = os.path.join(model_artifact_path, f"{model}-{quantization}-cuda.so")

+    chat_file = os.path.join(f"{model_artifact_path}/params", "mlc-chat-config.json")


mlc-chat-config is for cpp/llm_chat.cc which is not used by mlc-serve.
But I'm wondering if we can introduce some mlc-serve configs to get rid of the dependency on hf config. This dependency forces the endpoint to prepare a set of hf configs, such as tokenzier, which seems unnecessary. Ideally, it will be great if we can put everything under f"{model_artifact_path}/mlc-serve-config"

But I'm wondering if we can introduce some mlc-serve configs to get rid of the dependency on hf config.

This is exactly done in this PR by switching to mlc_chat_config.json in paged_cache_model.py. The mlc_chat_config.json originally is created for single batch cpp/llm_chat.cc but it would be nice to reuse it instead to introduce one more config file during the build of the model. But if you consider that it make sense to create separate config for mlc-serve, we can duplicate it during model compilation

sunggg · 2023-11-14T16:52:28Z

serve/mlc_serve/model/paged_cache_model.py

            artifact_path, model_name, quant, num_shards, dev
        )
+        self.chat_config = get_chat_config(self.config_file_path, None)
+        self.chat_config.num_shards = num_shards
+        self.chat_config.sliding_window = sliding_window # this should migrate eventually into chat config


If we are going to migrate to config base, I'd like to finish it in this PR. Currently, when I'm working on endpoint integration, these slightly different configurations are huge pain.

masahi · 2023-11-14T18:42:26Z

We also need to update https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/staging_engine_worker.py#L296-L297.

elvin-n · 2023-11-15T07:14:00Z

We also need to update https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/staging_engine_worker.py#L296-L297.

My bad. Fixed
But need to evaluate the refactoring of the synced and staged engines. They have several duplicating parts

masahi · 2023-11-15T07:25:47Z

My bad. Fixed But need to evaluate the refactoring of the synced and staged engines. They have several duplicating parts

Yes, this is one of left-over work from #48. I think we should keep the sync engine for debugging purpose, so it we should refactor. Welcome to do so if you are interested.

elvin-n added 2 commits November 14, 2023 13:24

Add handling of max_gen_len from mlc-llm chat_config

1d38d19

Get rid of transformer config in serving layer

dd033b4

With migration of required parameters to chat_config during model build procedure

sunggg requested changes Nov 14, 2023

View reviewed changes

Handle proper max seq len in the staged engine

106de69

elvin-n mentioned this pull request Nov 15, 2023

Changes to support input validation to match OpenAI behavior. #65

Merged

sunggg mentioned this pull request Nov 15, 2023

[Refactor] Clean-up Management of Model/Artifact/Engine Info #66

Merged

3 tasks

elvin-n closed this Nov 16, 2023

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024

Corrected a typo (octoml#64)

9d9a219

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add handling of max_gen_len from mlc-llm chat_config #64

Add handling of max_gen_len from mlc-llm chat_config #64

elvin-n commented Nov 14, 2023

sunggg left a comment

sunggg Nov 14, 2023 •

edited

Loading

elvin-n Nov 14, 2023 •

edited

Loading

sunggg Nov 14, 2023 •

edited

Loading

elvin-n Nov 15, 2023

sunggg Nov 15, 2023

sunggg Nov 14, 2023

elvin-n Nov 15, 2023

sunggg Nov 14, 2023

masahi commented Nov 14, 2023

elvin-n commented Nov 15, 2023

masahi commented Nov 15, 2023

		@@ -335,13 +336,207 @@ def get_tvm_model(artifact_path, model, quantization, num_shards, dev):

		lib_path = os.path.join(model_artifact_path, f"{model}-{quantization}-cuda.so")

		chat_file = os.path.join(f"{model_artifact_path}/params", "mlc-chat-config.json")

Add handling of max_gen_len from mlc-llm chat_config #64

Add handling of max_gen_len from mlc-llm chat_config #64

Conversation

elvin-n commented Nov 14, 2023

sunggg left a comment

Choose a reason for hiding this comment

sunggg Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

elvin-n Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

sunggg Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

elvin-n Nov 15, 2023

Choose a reason for hiding this comment

sunggg Nov 15, 2023

Choose a reason for hiding this comment

sunggg Nov 14, 2023

Choose a reason for hiding this comment

elvin-n Nov 15, 2023

Choose a reason for hiding this comment

sunggg Nov 14, 2023

Choose a reason for hiding this comment

masahi commented Nov 14, 2023

elvin-n commented Nov 15, 2023

masahi commented Nov 15, 2023

sunggg Nov 14, 2023 •

edited

Loading

elvin-n Nov 14, 2023 •

edited

Loading

sunggg Nov 14, 2023 •

edited

Loading