Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] Clean-up Management of Model/Artifact/Engine Info #66

Merged

Conversation

sunggg
Copy link
Member

@sunggg sunggg commented Nov 15, 2023

Problem

Currently, there are several factors that makes the deployment flow tricky:

  • No separation between compile-time info (e.g., build option, model info) and deployment info (e.g., engine config). Often times, we are mix-using them so current engine needs compile-time info, such as num_shards besides model artifact. This is unnecessary.
  • Unnecessary duplication of info management.
  • Dependency on hf model config for model and tokenizer info. This also requires artifact to have such info in specific path structure so the deployment flow should copy such info separately after every compilation.
  • No way to check which build options is used for the given artifact.
  • Implicit name deduction. Since the deduction rule has changed over time, ollm needed to be updated accordingly.
  • Disco compilation requires two steps of build commands.

Changes

To overcome these issues, this PR lands the following changes:

  • Explicit separation between compile-time info and deployment info. Compile-time info is managed by ModelArtifactConfig while deployment info is managed by MLCServeEngineConfig.
  • Remove redundant info management. All necessary info is managed by two structs: ModelArtifactConfig and MLCServeEngineConfig
  • Removed the dependency on HF configs. Build script will dump artifact as follows:
`model_artifact_path` (`asset` in ollm) has the following structure
|- compiled artifact (.so)
|- `build_config.json`: stores compile-time info, such as `num_shards`, `quantization` and entire build flags used. 
|- params/ : stores weights in mlc format and `ndarray-cache.json`. 
|            `ndarray-cache.json` is especially important for Disco.
|- model/ : stores info from hf model cards such as max context length and tokenizer
  • Build options used to produce the artifact can be found in build_config.json.
  • No implicit name deduction. While respecting the previous deduction rule, it also supports direct specification of model artifact name.
model artifact name: llama-2-13b-chat-hf-q0f16-presharded-1gpu
before: `--local-id llama-2-13b-chat-hf-q0f16`
after: `--local-id llama-2-13b-chat-hf-q0f16-presharded-1gpu`  or `--local-id llama-2-13b-chat-hf-q0f16`

Todo

  • High: ollm integration
  • High: determine default engine config
  • Low: mlc_serve having dependency on mlc_llm makes the packaging tricky. Better to remove it.

cc. @jroesch @elvin-n @masahi

serve/mlc_serve/run.py Outdated Show resolved Hide resolved
Copy link
Member

@jroesch jroesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, couple small follow up nits/questions.

Copy link

@elvin-n elvin-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

mlc_llm/build.py Outdated Show resolved Hide resolved
@sunggg sunggg merged commit 858a444 into octoml:batch-serving Nov 16, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants