[Docs] Write the `Adding a New Model` section #138

WoosukKwon · 2023-06-04T02:03:42Z

Closes #65
This PR adds the Adding a New Model section to the doc.

Because the process of adding a model is complex and highly depends on the model architecture, I only provided the high-level guidance. I think we can further improve this section later.

docs/source/models/adding_model.rst

zhuohan123 · 2023-06-04T10:21:02Z

docs/source/models/adding_model.rst

 Adding a New Model
 ==================

+This document provides a high-level guide on the process of adding a new model into CacheFlow.


We can provide an example for this doc:

Suggested change

This document provides a high-level guide on the process of adding a new model into CacheFlow.

This document provides a high-level guide on the process of adding a new model into CacheFlow. For example, how to add the `OPT model in huggingface <https://github.com/huggingface/transformers/blob/v4.29.1/src/transformers/models/opt/modeling_opt.py#L812>`_ to `CacheFlow <https://github.com/WoosukKwon/cacheflow/blob/62ec38ea4148bb8147f346f7e01cab6b8a2ec7b6/cacheflow/model_executor/models/opt.py#L248>`_

Changed to:

This document provides a high-level guide on integrating a `HuggingFace Transformers <https://github.com/huggingface/transformers>`_ model into CacheFlow.

I didn't add the example here because a similar example is provided in the section 1 (for llama).

WoosukKwon · 2023-06-05T09:54:35Z

@zhuohan123 I've replied to your comments and polished the writing. PTAL.

zhuohan123 · 2023-06-05T15:48:26Z

docs/source/models/adding_model.rst

+------------------------
+
+Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the `cacheflow/model_executor/models <https://github.com/WoosukKwon/cacheflow/tree/main/cacheflow/model_executor/models>`_ directory.
+For instance, you can use the code from the HuggingFace's `modeling_llama.py <https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py>`_ file for the LLaMA models.


Maybe mention the modified LLaMA model in CacheFlow (https://github.com/WoosukKwon/cacheflow/blob/main/cacheflow/model_executor/models/llama.py) here as a reference?

Added and changed the model to OPT.

zhuohan123

LGTM! My general comment is that we can have an actual code example. This can let the user directly compare the difference between the Huggingface model and the CacheFlow model. The current texts are very detailed but still a bit abstract. I initially suggested using OPT as such an example because we have many custom kernels for LLaMA. In this case, directly comparing Huggingface LLaMA and Cacheflow LLaMA can be difficult.

WoosukKwon · 2023-06-06T03:01:14Z

LGTM! My general comment is that we can have an actual code example. This can let the user directly compare the difference between the Huggingface model and the CacheFlow model. The current texts are very detailed but still a bit abstract. I initially suggested using OPT as such an example because we have many custom kernels for LLaMA. In this case, directly comparing Huggingface LLaMA and Cacheflow LLaMA can be difficult.

Agreed. I actually wanted to have a walk-through example, but found it too complicated to explain. Please feel free to add any example to the doc.

… to GH (vllm-project#138) 1. Generate nm-vllm tar.gz file along with wheel generation 2. Upload both tar.gz and .whl in a package to GH A run will look like this: https://github.com/neuralmagic/nm-vllm/actions/runs/8359879522 --------- Co-authored-by: dhuang <dhuang@MacBook-Pro-2.local> Co-authored-by: dhuang <dhuang@ip-192-168-198-30.ec2.internal>

* Add functools.wraps decorator to with_mark_steps * i cant use functools.wraps properly it seems

@iotamudelta

* Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters (vllm-project#114) * Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters * Adding HTTP headers * Add distributed executor backend to benchmark scripts (vllm-project#118) * Add weight padding for moe (vllm-project#119) * add weight padding for moe * enable padding by default * fix linter * fix linter * fix linter * using envs.py * fix linter * [BugFix] Fix navi build after many custom for MI kernels added (vllm-project#116) * fix navi build * Created dummy kernels of unsupported on Navi to avoid function not found crashes at runtime * replacing ifdefs on host code with those on kernels * refactoring code to avoid unsupported call on Navi * syntactic change * import statements fix * moving env variables to envs.py * style fixes * cosmetic changes for isort * remved extra include * moving use_skinny to be member --------- Co-authored-by: lcskrishna <lollachaitanya@gmail.com> Co-authored-by: maleksan85 <maleksan@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * add emtpy_cache() after each padding (vllm-project#120) * [FIX] Gradlib OOM on Navi and sometimes on MI (vllm-project#124) * add memory clean up after every shape and parameter to reduce cache invalidation buffers * small typo * syntax change --------- Co-authored-by: maleksan85 <maleksan@amd.com> * save shape when fp8 solution not found (vllm-project#123) Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Fix unit test for moe by adding padding (vllm-project#128) * fix test_moe * fix linter * Llama3.1 (vllm-project#129) * Add support for a rope extension method (vllm-project#6553) * [BugFix] Fix RoPE error in Llama 3.1 (vllm-project#6693) --------- Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * chat/completions endpoint (vllm-project#121) * Initial implementation of chat/completions endpoint and its streaming variant * Reusing datatypes from the openai entrypoints * Response role from arg * Added models endpoint and model validation from the request * Optimize custom all reduce (vllm-project#130) * First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover * Add BF16 support to custom PA (vllm-project#133) * tightened atol for custom PA; enable supported head size, block sizes in testing * update num_blocks and num_iters in benchmark PA to realistic settings * move to generic b16 type * bf16 first port * enabled all bf16 tests, set atol for bf16 * enable custom PA for bf16 as well as block size 32 and head size 64 * fix cast to zero in custom PA reduce * py linter fixes * clang format fixes * div round up clang-format --------- Co-authored-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Making check for output match in original types. It saves some memory. (vllm-project#135) Co-authored-by: maleksan85 <maleksan@amd.com> * Make CAR ROCm 6.1 compatible. (vllm-project#137) * remove scoping * while there fix a typo * while there remove unused variable * Car revert (vllm-project#140) * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (vllm-project#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (vllm-project#130)" This reverts commit 636ff01. * Using the correct datatypes for streaming non-chat completions (vllm-project#134) * Adding UNREACHABLE_CODE macro for non MI300 and MI250 cards (vllm-project#138) * Adding UNREACHABLE_CODE macro * clang format fixes * clang formatting fix * minor updates in syntax * clang format update * clang format fix one more try * clang format one more try * clang format fix one more try --------- Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * gfx90a typo fix (vllm-project#142) Co-authored-by: maleksan85 <maleksan@amd.com> * wvsplitk templatized and better tuned for MI300 (vllm-project#132) * improvements to wvSpltK * wvsplt gemm; better handle MI300 and large A[] sizes * lint fix * Adjustments to better handle small weights in TP8. * early-out bug fix * better wave load balancing in wvSplt * add missing skip for wvsplt_big * Bug fix for wvSplt_big in load balancing at M4, lint fix. * [Bugfix] Dockerfile.rocm (vllm-project#141) * Dockerfile.rocm bug fix * naming preference --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> * Update test-template.j2 (vllm-project#145) * Adding Triton implementations awq_dequantize and awq_gemm to ROCm (vllm-project#136) * basic support for AWQ added * awq_dequantize implementation in Triton * awq_gemm implementation in Triton * unit tests in tests/kernels/test_awq_triton.py --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Matt Wong <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: lcskrishna <lollachaitanya@gmail.com> Co-authored-by: maleksan85 <maleksan@amd.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: iotamudelta <dieterich@ogolem.org> Co-authored-by: sanyalington <shomy.sanyal@amd.com> Co-authored-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com> Co-authored-by: Zachary Streeter <90640993+zstreet87@users.noreply.github.com> Co-authored-by: omkar kakarparthi <75638701+okakarpa@users.noreply.github.com> Co-authored-by: rasmith <Randall.Smith@amd.com>

Cherry-pick : Disable usage tracking

…ject#138) * Adding UNREACHABLE_CODE macro * clang format fixes * clang formatting fix * minor updates in syntax * clang format update * clang format fix one more try * clang format one more try * clang format fix one more try --------- Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>

…ecoder to decrease framework overhead (vllm-project#138) In Model Runner, is_encoder_decoder is exacted from model_config to determin whether vllm is running for enc-dec models. Obtaining this status requires a long call stack, and the CPU overhead is high. So this PR cache this status in __init__ of ModelInputForNPUBuilder. Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>

WoosukKwon added 10 commits May 25, 2023 01:28

Add book theme as dependency

e32fb16

Shorten the command

e3bb084

Add Supported models doc

ef7c859

[WIP] empty doc for adding a new model

8268c79

Minor fix for convenience

41aa1ad

Address comments

a6854a5

[WIP]

ab2c176

Merge branch 'main' into add-new-model

2e81dc3

Wrire adding_a_new_model

b24709f

Minor

422acf7

WoosukKwon requested a review from zhuohan123 June 4, 2023 02:13

zhuohan123 reviewed Jun 4, 2023

View reviewed changes

WoosukKwon added 2 commits June 5, 2023 09:26

Address comments

5782dc6

Polish

81cee6b

WoosukKwon requested a review from zhuohan123 June 5, 2023 09:54

zhuohan123 reviewed Jun 5, 2023

View reviewed changes

zhuohan123 approved these changes Jun 5, 2023

View reviewed changes

Address comment

7bdfd97

WoosukKwon merged commit 456941c into main Jun 6, 2023

WoosukKwon deleted the add-new-model branch June 6, 2023 03:01

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

[Docs] Write the Adding a New Model section (vllm-project#138)

9c47eb4

Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Aug 15, 2024

Add functools.wraps decorator to with_mark_steps (vllm-project#138)

bc1af91

* Add functools.wraps decorator to with_mark_steps * i cant use functools.wraps properly it seems

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Sep 30, 2024

Merge pull request vllm-project#138 from opendatahub-io/main

8fdcb03

Cherry-pick : Disable usage tracking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Docs] Write the `Adding a New Model` section #138

[Docs] Write the `Adding a New Model` section #138

Uh oh!

WoosukKwon commented Jun 4, 2023 •

edited

Loading

Uh oh!

Uh oh!

zhuohan123 Jun 4, 2023

Uh oh!

WoosukKwon Jun 5, 2023 •

edited

Loading

Uh oh!

WoosukKwon commented Jun 5, 2023 •

edited

Loading

Uh oh!

zhuohan123 Jun 5, 2023 •

edited

Loading

Uh oh!

WoosukKwon Jun 6, 2023

Uh oh!

zhuohan123 left a comment

Uh oh!

WoosukKwon commented Jun 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Docs] Write the Adding a New Model section #138

[Docs] Write the Adding a New Model section #138

Uh oh!

Conversation

WoosukKwon commented Jun 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zhuohan123 Jun 4, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Jun 6, 2023

Choose a reason for hiding this comment

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Jun 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Docs] Write the `Adding a New Model` section #138

[Docs] Write the `Adding a New Model` section #138

WoosukKwon commented Jun 4, 2023 •

edited

Loading

WoosukKwon Jun 5, 2023 •

edited

Loading

WoosukKwon commented Jun 5, 2023 •

edited

Loading

zhuohan123 Jun 5, 2023 •

edited

Loading