forked from mlc-ai/mlc-llm
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge upstream nov11 #59
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix two bugs in kv-cache pop loop Bug 1: old code would stop early because output_ids was shortened in-place during the loop Bug 2: off-by-one in backoff size due to break
…1017) This commit adds an optional `--pdb` flag to the `build.py` script. If passed, any exception raised that would otherwise terminate the script will first enter a pdb post-mortem, allowing the error to be inspected.
…ai#1040) Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs, build model
llama2 q4f160
fix permission issue
Support for the stablelm-3b-4e1t model
* Iterate model prebuilts docs * small fix
This PR separates out the tokenizer creation function, the random number generator out from `llm_chat.cc` as a preparation step for batching inference support, since these functions/modules are also used in the same way in batching inference.
Update README.md
* add verbose stats to mlc-chat REST API * update docs
* [Transform] Apply split_rotary optimization on prefill Prior to this commit, the `transform.fuse_split_rotary_embedding` function was only applicable to the `decode` function of a Llama-type model. This was due to the sequence length being restricted to one, both in the pattern-match rule and in the `split_rotary` function, and the function being restricted to operate only on the `decode` function. This commit updates the `transform.fuse_split_rotary_embedding` pass to be a `tvm.ir.transform.Pass`, operating on all applicable matched in the `IRModule`. The `split_rotary` function is now produced as a fully-generic function, with static parameters substituted in afterwards. At this stage, the sequence length is retained as a dynamic parameter, such that it can be used by the `prefill` function. * Avoid multiple kernel launches for split_rotary
…i#1055) Co-authored-by: Junru Shao <junrushao1994@gmail.com>
…i#1033)" (mlc-ai#1058) This reverts commit b9179cf as elaborated here mlc-ai#1033 (comment)
…ma-2 families (mlc-ai#1032) * fix * reflect feedback --------- Co-authored-by: “Sunghyun <sunggg@umich.com>
`--force-reinstall` will reinstall all dependencies to a python package, which is unnecessary. `-U` is a better choice in this case.
This PR introduces the initial batched input support for llama models. To make the code managable, we keep both the single-sequence handling flow and the batching handling flow in the Llama modeling. Now, with `--enable-batching` as a build argument, we build Llama for the batched version. NOTE: The paged attention kernel/TIR func are not included in this PR, so currently the built library with batching enabled is not runnable. We will follow up with the attention kernel in the future. This PR guarantees that the existing single-sequence inference (Python API, CLI, etc.) is not broken. P.S.. The batching flow is subject to bug fixes as we integrate with the attention function and run the e2e flow in the future.
* [stablelm 3b] Rename dynamic vocab size from "v" to "vocab_size" * Add get_num_key_value_heads method to StableLM3bConfig
This commit removes the `if`/`elif` chain in `core.py`, where the body of each conditional assigns the same `mod, param_manager, params, model_config`, and is identical except for the choice of model being built.
This commit replaces the single-parameter `relax_model.param_manager.create_quantize_func` function with a method on the `ParamManager`, `create_parameter_transformation`. This avoids potential typos between `param_manager` as the imported Python module `mlc_llm.relax_model.param_manager` and an instance of the `ParamManager` class named `param_manager`, and makes the functionality easier to find. This function also takes an optional `optimize_parameter_order` flag, defaulting to `True`, which applies the `ReorderTransformFunc` pass. Since the `ReorderTransformFunc` is intended to be used with several configuration objects owned by `ParamManager`, this simplifies the common path of producing an optimally-ordered parameter transformation module.
PR mlc-ai#1048 updated the signature of softmax in the built model library and changed the temperature buffer shape in ChatModule. This causes some existing demo unable to run since we did not do a round of model library update. This PR reverts the ChatModule change, and adds back the softmax function in non-batching case. With this PR, the regression should be fixed.
…ai#1074) This PR lifts the device string parsing (just a few of lines) to a standalone function, so that on the serving side the serving can make use of this function as well. Tested Python API and it does not seem to incur regression.
The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard.
This PR establishes the compiler components in MLC-Chat Python API, which currently includes two primary components: models and parameters. The models are `nn.Module`-based definition of an LLM, which, as the very first stab, contains only `LlamaForCasualLM`. It is decomposed into three files: - `llama_config.py`: common configurations for Llama, where we define relevant configurations of its architecture, as well as include standard config file for Llama2-7B/13B/70B for convenient testing; - `llama.py`: the model architecture of Llama, based on the PyTorch-like `nn.Module` API; - `llama_parameter.py`: defines the mapping between MLC parameters and pytorch parameters. The parameters contains the basic functionality of parameter mapping, and the loaders that effectively convert parameters from PyTorch to MLC according to the mapping specified. Currently, only `HFTorchLoader` is implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite straightforward according to the existing design. On top of this PR, on-the-fly quantization could be defined as a loading time transformation on MLC parameters, while pre-quantized parameter loading is effectively parameter loading after MLC's `nn.Module` is quantized. Two unittests examplify how the infrastructure works: - `./tests/python/model/test_llama.py` shows how to create an `nn.Module` using the new infra, and then convert it to TVM IRModule; - `./tests/python/parameter/hf_torch_loader.py` shows how to load parameters from HuggingFace PyTorch format. Besides, `mlc_chat.support` is established for utility functions, which now contains two utils: - `config.py` which supports reading configurations into dataclasses from JSON file or Python dict. On top of Python dataclass, it throws irrelevant fields into `cls.kwargs`, which is helpful when loading HuggingFace configuration file; - `tqdm.py` which contains tqdm-related utilities, primarily redirecting logging and printing to work nicely with tqdm.
Update zstd installation
Add docs for RestAPI Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu>
This PR enables ```llm-vscode``` extension API support for copilot-like code completion, following [HF's LSP](https://github.com/huggingface/llm-ls). Fully compatible with ```CodeLlama``` and ```starcoder``` on mlc-llm. - huggingface/llm-vscode#103 enhances extension user experience when used with mlc-llm rest api. Thanks @ pacman100, who came up with this on his latest blogpost: https://huggingface.co/blog/personal-copilot
PR mlc-ai#1203 introduces some unnecessary and redundant logging messages. This PR gets them removed.
The error message below ``` /usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h: In member function 'std::string picojson::value::to_str() const': /usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:494:37: error: expected ')' before 'PRId64' 494 | SNPRINTF(buf, sizeof(buf), "%" PRId64, u_.int64_); | ~ ^~~~~~~ | ) /usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:81:1: note: 'PRId64' is defined in header '<cinttypes>'; did you forget to '#include <cinttypes>'? 80 | #include <errno.h> +++ |+#include <cinttypes> 81 | #include <inttypes.h> ``` indicates that the `__STDC_FORMAT_MACROS` flag is not turned on for some reason.
Try fix macOS build with picojson
The breakage was resulting from newer syntax being used for type annotations, as part of mlc-ai#592. So long as `mlc_chat.interface.openai_api` wasn't imported, the breaking changes were not encountered. In mlc-ai#1107, the addition of `from .interface.openai_api import ChatMessage` caused this module to be imported, breaking compatibility of `mlc_chat.ChatModule` with Python3.8. This commit updates the type annotations to the supported syntax.
* [SLM] Enable loading from AWQ pre-quantized weight. * remove awq_loader.py * Update to the latest commit * Delete llama_parameter.py * update unittest * fix lint * upd * add Llama-2-7B-AWQ
This PR supports the int3 and float32 group quantization, and fixes some minor issue in quantization impl and tests.
Add sliding window to metadata, make smalle changes to invariants in runtime
…2) (mlc-ai#956) * added support for chatml format conversation * added template to factory
This PR introduces Rust language support for the MLC-LLM project, specifically targeting supporting the `ChatModule` interface. It utilizes the existing C++ implementation of MLC-LLM and leverages both TVM's C API and its Rust bindings. The `rust/examples/mlc_chat.rs` gives an example of how to create a `chat_module` and serve user prompts in Rust. The primary goal of this PR is to enrich the MLC-LLM ecosystem by offering a Rust interface that aligns with the current Python API. This enhancement will empower Rust developers to integrate MLC-LLM into their codebase and applications. **Followup PRs**: - Extend the feature set to achieve parity with the C++/Python interface. - Refine the Rust API, ensuring robustness. - Set up Rust CI if needed.
Remove dependency on openai_api
With this PR, the metadata in a DSO file using `vm["_metadata"]()` now have information about the upper bound RAM estimate on each function. As an example, the JSON string now is: ```json { "quantization": "q4f16_1", "model_type": "llama", "memory_usage": { "_initialize_effect": 0, "prefill": 136192, "softmax_with_temperature": 0, "decode": 218624 }, "params": [ {"name": "model.embed_tokens.q_weight", "shape": [32000, 512], "dtype": "uint32"}, {"name": "model.embed_tokens.q_scale", "shape": [32000, 128], "dtype": "float16"}, ... ] } ``` This helps the MLC runtime to better determine if a method is going to OOM and plan ahead, e.g. plan pre-allocated KVCache, accordingly. The idea originates from Ruihang's ancient PR that prints memory usage estimate as debugging information for demo purposes, and this PR further enhances it to IRModule-level attribute that can be used by the runtime.
…lc-ai#1225) Now it shows a more reasonable upper bound for sequence length = 4096. ```json { "_initialize_effect": 0, "prefill": 3479311360, "softmax_with_temperature": 0, "decode": 34531840 } ``` Thanks Ruihang for helping with the fix!
* [Bugfix] Correct input shape for shard info function Prior to this commit, the sharding functions sharded axis converted from `orig_size * num_shards` to `orig_size // num_shards`. This commit updates the sharding functions to instead convert from `orig_size` to `orig_size // num_shards`. * [Bugfix] Include LegalizeOps in utils.convert_weights Prior to this commit, `utils.convert_weights` assumes that the parameter transformation module is already legalized, and uses no relax operations that require legalization. This commit adds a call to `relax.transform.LegalizeOps` to remove this assumption. * [MultiGPU] Cleanup create_shard_info_func - De-duplicate the `if param.shard_strategy == foo` if/else chain - Return a `tvm.IRModule` instead of modifying an existing module * Extract a ParamManager.optimize_transform_param_order method * Extract ParamManager.create_parameter_transformation call from convert_weights * Support writing of pre-sharded weights * Support execution using pre-sharded weights * Updating for review comments * fix typo
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Doing another merge now that mlc-ai#1096 has been merged.