Cherry pick build-time sharding PR from upstream #53

masahi · 2023-11-07T00:32:52Z

This merges mlc-ai#1096 into our branch.

commit 44f4cbdfed7941e7ed060d74b23d870d026a57c9 Author: Eric Lunderberg <elunderberg@octoml.ai> Date: Fri Oct 20 13:41:40 2023 +0000 Support execution using pre-sharded weights commit 35644870e2daf829c29ebb5431d357ceaa4e2793 Author: Eric Lunderberg <elunderberg@octoml.ai> Date: Fri Oct 13 20:35:39 2023 +0000 Support writing of pre-sharded weights commit 97572226d331ebd9ef49e4a2c1dad42344d09bac Author: Eric Lunderberg <elunderberg@octoml.ai> Date: Fri Oct 13 20:55:34 2023 +0000 Extract ParamManager.create_parameter_transformation call from convert_weights commit e1d3217f7b0c87c49ba3721567cb24df436618a3 Author: Eric Lunderberg <elunderberg@octoml.ai> Date: Fri Oct 13 22:39:00 2023 +0000 Extract a ParamManager.optimize_transform_param_order method commit b2a9e1c7e83c0886e3a0ebed02f4c8416dfbfb5f Author: Eric Lunderberg <elunderberg@octoml.ai> Date: Fri Oct 13 18:30:59 2023 +0000 [MultiGPU] Cleanup create_shard_info_func - De-duplicate the `if param.shard_strategy == foo` if/else chain - Return a `tvm.IRModule` instead of modifying an existing module commit f67d47a57fbca21df48867a9dfc10e430a3a3b04 Author: Eric Lunderberg <elunderberg@octoml.ai> Date: Mon Oct 16 16:59:43 2023 +0000 [Bugfix] Include LegalizeOps in utils.convert_weights Prior to this commit, `utils.convert_weights` assumes that the parameter transformation module is already legalized, and uses no relax operations that require legalization. This commit adds a call to `relax.transform.LegalizeOps` to remove this assumption. commit a98f9cf45a4cc7a2412f68a4f4306d97e7239a13 Author: Eric Lunderberg <elunderberg@octoml.ai> Date: Mon Oct 16 16:49:44 2023 +0000 [Bugfix] Correct input shape for shard info function Prior to this commit, the sharding functions sharded axis converted from `orig_size * num_shards` to `orig_size // num_shards`. This commit updates the sharding functions to instead convert from `orig_size` to `orig_size // num_shards`. commit 4042626 Author: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Date: Mon Nov 6 15:43:21 2023 -0800 [Slim-LM] Enable loading from AWQ pre-quantized weight. (mlc-ai#1114) * [SLM] Enable loading from AWQ pre-quantized weight. * remove awq_loader.py * Update to the latest commit * Delete llama_parameter.py * update unittest * fix lint * upd * add Llama-2-7B-AWQ commit 9869ca6 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Mon Nov 6 16:03:12 2023 -0600 Fix Python3.8 compatibility breakage (mlc-ai#1210) The breakage was resulting from newer syntax being used for type annotations, as part of mlc-ai#592. So long as `mlc_chat.interface.openai_api` wasn't imported, the breaking changes were not encountered. In mlc-ai#1107, the addition of `from .interface.openai_api import ChatMessage` caused this module to be imported, breaking compatibility of `mlc_chat.ChatModule` with Python3.8. This commit updates the type annotations to the supported syntax. commit e00220c Author: Junru Shao <junrushao@apache.org> Date: Mon Nov 6 13:04:36 2023 -0800 Detect `mtriple` via LLVM (mlc-ai#1211) commit e2c99a8 Author: Junru Shao <junrushao@apache.org> Date: Mon Nov 6 12:01:51 2023 -0800 [Fix] Keep update-to-date with upstream API change (mlc-ai#1209) commit a7f1183 Author: Git bot <bot@noreply.github.com> Date: Mon Nov 6 18:53:07 2023 +0000 Auto updated submodule references commit 51d6f9c Author: Junru Shao <junrushao@apache.org> Date: Mon Nov 6 09:28:57 2023 -0800 Try fix macOS build with picojson again (mlc-ai#1207) Try fix macOS build with picojson commit 01d4339 Author: Junru Shao <junrushao@apache.org> Date: Mon Nov 6 09:08:58 2023 -0800 Try fix macOS build with picojson (mlc-ai#1206) The error message below ``` /usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h: In member function 'std::string picojson::value::to_str() const': /usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:494:37: error: expected ')' before 'PRId64' 494 | SNPRINTF(buf, sizeof(buf), "%" PRId64, u_.int64_); | ~ ^~~~~~~ | ) /usr/share/miniconda/envs/mlc-llm-build/conda-bld/mlc-chat-cli-nightly-package_1699286394016/work/3rdparty/tvm/3rdparty/picojson/picojson.h:81:1: note: 'PRId64' is defined in header '<cinttypes>'; did you forget to '#include <cinttypes>'? 80 | #include <errno.h> +++ |+#include <cinttypes> 81 | #include <inttypes.h> ``` indicates that the `__STDC_FORMAT_MACROS` flag is not turned on for some reason. commit 65478c8 Author: Junru Shao <junrushao@apache.org> Date: Sun Nov 5 19:52:53 2023 -0800 [Fix] Remove Redundant Warnings (mlc-ai#1204) PR mlc-ai#1203 introduces some unnecessary and redundant logging messages. This PR gets them removed. commit 7ccb51a Author: Junru Shao <junrushao@apache.org> Date: Sun Nov 5 18:33:49 2023 -0800 Integrating MLC runtime with the new compilation workflow (mlc-ai#1203) commit 3413d17 Author: Junru Shao <junrushao@apache.org> Date: Sun Nov 5 12:03:33 2023 -0800 [Fix] Use `fabs` as floating point abs function in C++ (mlc-ai#1202) commit 145a984 Author: David Pissarra <61968959+davidpissarra@users.noreply.github.com> Date: Sun Nov 5 06:18:47 2023 +0000 [API] ```llm-vscode``` extension support (mlc-ai#1198) This PR enables ```llm-vscode``` extension API support for copilot-like code completion, following [HF's LSP](https://github.com/huggingface/llm-ls). Fully compatible with ```CodeLlama``` and ```starcoder``` on mlc-llm. - huggingface/llm-vscode#103 enhances extension user experience when used with mlc-llm rest api. Thanks @ pacman100, who came up with this on his latest blogpost: https://huggingface.co/blog/personal-copilot commit 0e08845 Author: Animesh Bohara <ani.bohara@gmail.com> Date: Sun Nov 5 01:01:26 2023 -0400 [RestAPI] Added docs (mlc-ai#1193) Add docs for RestAPI Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> commit 3417505 Author: Junru Shao <junrushao@apache.org> Date: Sat Nov 4 19:44:25 2023 -0700 Support overriding `--max-sequence-length` in command line (mlc-ai#1197) commit 5d63f7e Author: Junru Shao <junrushao@apache.org> Date: Sat Nov 4 15:42:19 2023 -0700 [Docs] Clarify zstd installation on Windows (mlc-ai#1196) Update zstd installation commit 78424f0 Author: Junru Shao <junrushao@apache.org> Date: Sat Nov 4 02:13:21 2023 -0700 [Docs] Clarify zstd installation on Windows (mlc-ai#1191) commit 4832c2f Author: Junru Shao <junrushao@apache.org> Date: Sat Nov 4 01:58:55 2023 -0700 Add CodeLlama as part of model presets (mlc-ai#1190) commit 5d1dc34 Author: Junru Shao <junrushao@apache.org> Date: Sat Nov 4 01:46:19 2023 -0700 Merge llama_config.py into llama_model.py (mlc-ai#1189) commit 9d20575 Author: Junru Shao <junrushao@apache.org> Date: Sat Nov 4 01:30:15 2023 -0700 Merge `llama_config.CONFIG` into `MODEL_PRESETS` (mlc-ai#1188) commit 4716704 Author: Junru Shao <junrushao@apache.org> Date: Sat Nov 4 01:20:43 2023 -0700 Add Python API for Weight Conversion (mlc-ai#1182) This PR primarily does a major refactoring to introduce Python API that is consistent with the CLI API. Besides, it includes the following fixes and enhancements: - More info provided to `isort` for better formatting in `pyproject.toml`; - Print out the default value of all arguments in argparse command line; - Ensure `--device` is always available locally when doing weight conversion; - Add argument echoing in weight conversion to be consistent with its counterpart in compilation; - Add a consistency checker to make sure the shapes/dtypes of all tensors from weight conversion is consistent with compilation; - Echo the total size of parameters; - Better logging of each parameter's shape and dtype, and either or not its quantized; - More structure robustification, renaming `parameter/` to `loader/` to be more explicit about its intention; - Inline and remove `ParamQuantizer` into the loader to improve logging and the logic flow; - Always add instructions "Use `--xxx` to override" for any options that are auto detected to be more informative to end users; - Fix wrong shape calculation when quantizing `nn.Embedding`; - Fix wrong dtype calculation in group quantization when the input dtype is different from model dtype (e.g. "float32" in torch, but the model dtype in quantization is fp16 in `q4f16_1`); - Fix inconsistent param names in layers such as `GroupQuantizeLinear`; - Fix dtype inconsistency when a parameter is not quantized; - Fix existing unittests. commit 6ae02dd Author: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com> Date: Fri Nov 3 15:34:29 2023 -0400 [Model Support][SWA] Add support for sliding window attention for Mistral (mlc-ai#1087) * mistral base * Add sliding window mask making and its tests * Small changes for sliding window mask * Clean up mask making * Remove kv_seq_len * Add prefill chunking, handle max window size in SWA * Add interleave kv * Temporary fix for kv seq len * Pass in more shapes to SWA prefill and decode in runtime * mistral var fix * Small changes regarding shape passing * Small fix on chunk size * Add build args, fix mlc chat config dump * mistral system prompt --------- Co-authored-by: David Pissarra <david.pissarra@tecnico.ulisboa.pt> Co-authored-by: David Pissarra <61968959+davidpissarra@users.noreply.github.com> commit 2dc8183 Author: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Date: Fri Nov 3 00:36:52 2023 -0700 [Fix][SLM] Update q4f16 quantization with the new mutator name rule (mlc-ai#1178) [Fix] Update q4f16 quantization with the new mutator name rule commit 53060af Author: Xiyou Zhou <xiyou.zhou@gmail.com> Date: Thu Nov 2 13:08:11 2023 -0700 [SLM][AutoLLM] Enable Command Line Weight Conversion (mlc-ai#1170) This PR enables weight conversion in command line. Sample command: `python3 -m mlc_chat.cli.convert_weight --config dist/models/llama-2-13b-chat-hf/ --quantization "q4f16_1" --output dist/test/` commit 2ca7d15 Author: Junru Shao <junrushao@apache.org> Date: Thu Nov 2 11:30:28 2023 -0700 [Fix] TIR block name of dequantization (mlc-ai#1177) commit 1757777 Author: Yaxing Cai <caiyaxing666@gmail.com> Date: Wed Nov 1 15:52:19 2023 -0700 [SLM] Fix group quantization (mlc-ai#1172) This PR fixes the group quantization and add related unit tests. commit 9831135 Author: Animesh Bohara <ani.bohara@gmail.com> Date: Wed Nov 1 15:16:09 2023 -0400 Fix Android app Permission denied error on Android 10 (mlc-ai#1175) Use scoped storage instead of Downloads directory Co-authored-by: Animesh Bohara <abohara@cs.cmu.edu> commit 200653a Author: Git bot <bot@noreply.github.com> Date: Wed Nov 1 14:53:54 2023 +0000 Auto updated submodule references commit f5b2e88 Author: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Date: Wed Nov 1 12:23:40 2023 +0800 Fix RWKV Support (mlc-ai#1136) I successfully ran the rwkv-world-3b fp16 model on my Xiaomi phone. This PR is to fix a bug on the main branch where the rwkv model outputs only one word and then stop. ![image](https://github.com/mlc-ai/mlc-llm/assets/35585791/6514d6ef-c93c-4ad2-8e76-8ffa0663080f) commit e0cd3f6 Author: Junru Shao <junrushao@apache.org> Date: Tue Oct 31 12:56:28 2023 -0700 [Bugfix] Cannot find global function `mlc.llm_chat_create` (mlc-ai#1167) commit 02d1e57 Author: Junru Shao <junrushao@apache.org> Date: Tue Oct 31 12:43:17 2023 -0700 Support CUDA Multi-Arch Compilation (mlc-ai#1166) commit 8438b27 Author: Junru Shao <junrushao@apache.org> Date: Tue Oct 31 12:09:16 2023 -0700 Misc Cleanups of Compilation Pipeline (mlc-ai#1165) commit b5bfa5b Author: Yaxing Cai <caiyaxing666@gmail.com> Date: Tue Oct 31 11:39:44 2023 -0700 Enable group quant transform with nn.Module (mlc-ai#1154) * Enable group quant transform with nn.Module This PR completes the group quantization support for `nn.Module` based model. * remove deprecated tests * Update * wip * remove deprecated test * fix lint * fix lint * fix lint --------- Co-authored-by: Junru Shao <junrushao@apache.org> commit 9076d01 Author: Yuchen Jin <yuchenj@cs.washington.edu> Date: Mon Oct 30 22:40:53 2023 -0700 [Rest] Document emoji handling (mlc-ai#1160) Followup PR of mlc-ai#1142 to document the emoji handling. commit 425a2cb Author: Junru Shao <junrushao@apache.org> Date: Mon Oct 30 15:00:24 2023 -0700 [Fix][REST] Use lowered-cased "app" (mlc-ai#1159) commit 0a9d6c7 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Mon Oct 30 14:44:44 2023 -0500 [Utils] Remove conversion to numpy array in utils.save_params (mlc-ai#1083) Prior to this commit, each parameter was converted to a numpy-owned array as part of a total size computation. This commit computes the size directly, removing the conversion. commit 3cf5605 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Mon Oct 30 14:43:51 2023 -0500 [Utility] Check for isinstance(exc, Exception) before entering pdb (mlc-ai#1095) This is a follow-up to mlc-ai#1017, which added a `--pdb` flag to enter a debugger on exit. This commit checks the type of the raised exception, and only enters the debugger if it is a subclass of `Exception`. This ensures that implementation-details, such as a thrown `SystemExit` or `KeyboardInterrupt`, do not cause an erroneous entry to pdb. commit 8ca0176 Author: Yuchen Jin <yuchenj@cs.washington.edu> Date: Mon Oct 30 12:26:58 2023 -0700 [Rest] Fix emoji handling in Rest API. (mlc-ai#1142) commit b190578 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Mon Oct 30 13:58:18 2023 -0500 Apply rewrite for normal attention and MQA (mlc-ai#1138) Fixes a bug introduced in mlc-ai#1052, where use of the `--use-flash-attn-mqa` flag on a model that doesn't use MQA would prevent the use of CUTLASS attention at all. commit ece97b1 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Mon Oct 30 13:58:08 2023 -0500 [Transform][Redo] Apply split_rotary optimization on prefill (mlc-ai#1125) Prior to this commit, the `transform.fuse_split_rotary_embedding` function was only applicable to the `decode` function of a Llama-type model. This was due to the sequence length being restricted to one, both in the pattern-match rule and in the `split_rotary` function, and the function being restricted to operate only on the `decode` function. This commit updates the `transform.fuse_split_rotary_embedding` pass to be a `tvm.ir.transform.Pass`, operating on all applicable matched in the `IRModule`. The `split_rotary` function is now produced as a fully-generic function, with static parameters substituted in afterwards. At this stage, the sequence length is retained as a dynamic parameter, such that it can be used by the `prefill` function. This commit reapplies the reverted commit mlc-ai#1033. The error in the previous implementation was in the definition of `rotary_embedding_offset`, which provided the `query_sequence_length` instead of `kv_sequence_length`. This was able to pass the validity tests described [here](mlc-ai#1058 (comment)), as these two sequence lengths are identical for the first call. commit fee2cb5 Author: masahi <masahi129@gmail.com> Date: Tue Oct 31 01:32:06 2023 +0900 Add batched Llama model definition using vLLM paged attention (mlc-ai#1134) * Add batched Llama model with vllm paged attention * update core.py * doc * minor * add e2e test * mv file * clean * Check if TVM has been built with USE_VLLM * update BuildArgs docstring commit ba67835 Author: Junru Shao <junrushao@apache.org> Date: Sun Oct 29 23:54:12 2023 -0700 Update attention layer (mlc-ai#1153) Existing dlight optimization only works for NT matmul, but not NN. As a result, the new `nn.Module`-based implementation, which uses NN matmul, fails compilation at HEAD for now. This PR fixes this issue by tweaking `k` to the preferred layout. The following commands now work with the new compilation pipeline: ```bash python -m mlc_chat.cli.compile --config llama2_7b --quantization q4f16_1 -o /tmp/1.so python -m mlc_chat.cli.compile --config llama2_13b --quantization q4f16_1 -o /tmp/1.so python -m mlc_chat.cli.compile --config llama2_70b --quantization q4f16_1 -o /tmp/1.so ``` Note that the quantization algorithm per se, `q4f16_1`, has not been implemented yet, meaning this code path is not yet ready for use so far. commit 1a79a53 Author: Junru Shao <junrushao@apache.org> Date: Sun Oct 29 21:51:36 2023 -0700 Compile Model Preset without External `config.json` (mlc-ai#1151) This PR adds support for compiling a preset of models without having to provide a `config.json` on disk using the commands below: ```diff python -m mlc_chat.cli.compile \ --quantization q4f16_1 -o /tmp/1.so \ - --config /models/Llama-2-7b-chat-hf + --config llama2_7b ``` This allows easier testing and binary distribution without having to depend on external model directory. commit 0a25374 Author: Junru Shao <junrushao@apache.org> Date: Sun Oct 29 21:17:38 2023 -0700 Migrate Compiler Passes (mlc-ai#1150) commit 2193767 Author: Junru Shao <junrushao@apache.org> Date: Sun Oct 29 16:35:07 2023 -0700 Enable Mypy and Pylint in mlc_chat Python Package (mlc-ai#1149) commit c0c3a8d Author: Xiyou Zhou <xiyou.zhou@gmail.com> Date: Sun Oct 29 13:16:46 2023 -0700 [Slim-LM] Enable Group Quant (mlc-ai#1129) * Enable group quant via new interface. * Minor fix. * Linting. * Fix isort. * Fix mypy. * TE compute working. * Skip embed. * Support cpu+gpu quantization. * Add target option to tests. * Linting. commit 878ae84 Author: Junru Shao <junrushao@apache.org> Date: Sun Oct 29 00:19:20 2023 -0700 Support parameter packing (mlc-ai#1146) commit 2b6d832 Author: fennecJ <hwahwa649@gmail.com> Date: Sun Oct 29 14:59:10 2023 +0800 Make the help info consistent with program name (mlc-ai#1137) When user use command `mlc_chat_cli --help`, the output will be something like Usage: mlc_chat [--help] ... That's because the program name specified in `cli_main.cc` is "mlc_chat". It will be less confusing if the output of help info shows Usage: mlc_chat_cli [--help] ... commit 27ac5ac Author: DavidSharma <68979667+David-Sharma@users.noreply.github.com> Date: Sat Oct 28 20:18:16 2023 -0400 Updating tvm install docs (mlc-ai#1143) Updating the tvm install docs to assist a user in finding and copying zstd.dll to the correct folder. commit 2ec0cc8 Author: Yuchen Jin <yuchenj@cs.washington.edu> Date: Sat Oct 28 15:13:48 2023 -0700 Minor enhancements to `ChatModule` (mlc-ai#1132) Some minor enhancements to `ChatModule`, mainly handle the device parsing solely in `_parse_device_str` instead of handling it both in the member function and the `__init__` function to avoid redundancy; and some type annotation fix. commit 2c492e5 Author: S A G A R <110724849+tmsagarofficial@users.noreply.github.com> Date: Sun Oct 29 03:43:15 2023 +0530 Grammatical and Typographical improvements (mlc-ai#1139) * Update faq.rst * Update guideline.rst * Update compile_models.rst * Update distribute_compiled_models.rst * Update get-vicuna-weight.rst * Update python.rst * Update android.rst * Update cli.rst * Update ios.rst * Update javascript.rst * Update python.rst * Update rest.rst commit 24f795e Author: Goutham Tamilselvan <goutham2688@gmail.com> Date: Fri Oct 27 03:25:59 2023 -0400 added details to windows installation (mlc-ai#1133) 32bit version of the zstd.dll library was causing issues, so updated the doc to be more specific and download the 64bit version. commit 973f9fc Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Wed Oct 25 10:14:46 2023 -0500 [ParamManager][Redo] Use BundleModelParams for transform_dequantize (mlc-ai#1127) Prior to this commit, `ParamManager.transform_quantize` function took as input functions with separate parameters for each weight tensor, and produced output functions with a tuple parameter for all weights. Because `LiftTransformParams` had the same convention, neither could be applied as part of the same build flow. This commit updates `ParamManager.transform_quantize` pass to produce outputs with separate tensor parameters, using the `BundleModelParams` transform to later combine them into a single tuple parameter. The analogous change was also performed for `LiftTransformParams` as part of apache/tvm#15657. In addition, prior to this commit, the `ParamManager.transform_dequantize` function operated directly on a `IRModule` object. As a result, any debug instrumentation (e.g. before/after printouts for each pass, before/after verification with `relax.analysis.well_formed`, etc.) did not apply to this `transform_dequantize`. This commit updates `ParamManager.transform_dequantize` to return a `ir.transform.Pass`. This commit is a repeat of the reverted PR mlc-ai#1056. This PR resolves the bug in the earlier implementation by removing the call to `.without_attr("num_input")` in `ParamReplacer.rewrite_func`. This follows an analogous update in `LiftTransformParams`, preserving the `"num_input"` attribute for use in `BundleModelParams`. commit a4279e3 Author: Junru Shao <junrushao@apache.org> Date: Tue Oct 24 21:05:24 2023 -0700 Add --opt flag parsing to CLI (mlc-ai#1123) commit 9166edb Author: Kartik Khandelwal <kartikkhandelwal1998@gmail.com> Date: Tue Oct 24 15:07:23 2023 -0400 [REST] OpenAI compatible Rest API (mlc-ai#1107) * add presence and frequency penalty * Added support for passing conversation history in /v1/chat/completions endpoint * Added support for RestAPI parameters max_gen_len, n, and stop_str * * add presence and frequency penalty to generation config * refactor generation config * Added documentation for parameters * replace lib_path with model_lib_path in rest.py * fixed black isort issues * fix lib_path commit 9cb8e8e Author: Junru Shao <junrushao@apache.org> Date: Tue Oct 24 09:04:45 2023 -0700 Remove inaccurate warning message (mlc-ai#1121) This PR removes an inaccurate warning from mlc-ai#1086, which warns about `model_lib` overriding regardless of whether or not it's actually overridden. With this commit, we only warn if its value is not None. commit 2aa6809 Author: Junru Shao <junrushao@apache.org> Date: Tue Oct 24 09:03:38 2023 -0700 Revert "[ParamManager] Use BundleModelParams for transform_dequantize" (mlc-ai#1120) Revert "[ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056)" This reverts commit e5927ce. This causes a regression impacting all MLC LLM nightlies as it violates the existing calling convention in MLC Chat runtime. An example: mlc-ai#1060 (comment) commit 206103b Author: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com> Date: Tue Oct 24 11:54:01 2023 -0400 [Docs] Add doc for max and mean gen len, shift factor; and buildArgs (mlc-ai#1119) * Add doc for max and mean gen len, shift factor * Update python docs for BuildArgs commit 488017d Author: SingLi <Sing-Li@users.noreply.github.com> Date: Tue Oct 24 08:19:31 2023 -0500 fix mismatched argument name (mlc-ai#1117) fix error introduced by recent code changes fixes mlc-ai#1116 commit 8ce7793 Author: Git bot <bot@noreply.github.com> Date: Tue Oct 24 07:30:53 2023 +0000 Auto updated submodule references commit 61179a0 Author: Junru Shao <junrushao@apache.org> Date: Mon Oct 23 23:58:01 2023 -0700 Add CLI commands for compilation (mlc-ai#1109) commit 5a7dcd8 Author: Tianqi Chen <tqchen@users.noreply.github.com> Date: Tue Oct 24 00:00:41 2023 -0400 [WINDOWS] reduce noise in windows build (mlc-ai#1115) commit 7ae8c6d Author: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Date: Mon Oct 23 15:33:00 2023 -0700 [Slim-LM] Introduce HFLoad for loading Pytorch and SafeTensor weights (mlc-ai#1113) commit e5927ce Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Mon Oct 23 13:31:44 2023 -0500 [ParamManager] Use BundleModelParams for transform_dequantize (mlc-ai#1056) * [ParamManager] Use BundleModelParams for transform_quantize Prior to this commit, `ParamManager.transform_quantize` function took as input functions with separate parameters for each weight tensor, and produced output functions with a tuple parameter for all weights. Because `LiftTransformParams` had the same convention, neither could be applied as part of the same build flow. This commit updates `ParamManager.transform_quantize` pass to produce outputs with separate tensor parameters, using the `BundleModelParams` transform to later combine them into a single tuple parameter. The analogous change was also performed for `LiftTransformParams` as part of apache/tvm#15657. In addition, prior to this commit, the `ParamManager.transform_dequantize` function operated directly on a `IRModule` object. As a result, any debug instrumentation (e.g. before/after printouts for each pass, before/after verification with `relax.analysis.well_formed`, etc.) did not apply to this `transform_dequantize`. This commit updates `ParamManager.transform_dequantize` to return a `ir.transform.Pass`. * Correct type annotation commit f57c9c9 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Mon Oct 23 13:31:24 2023 -0500 [Transform] Provide IRModule transform for rewrite_attention (mlc-ai#1052) Prior to this commit, `mlc_llm.transform.rewrite_attention` updated a single function. This commit modifies it to instead be a transform operating on any pattern matches within an `IRModule`. commit 16dd2ae Author: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Date: Sun Oct 22 19:51:10 2023 -0700 [Slim-LM] Smart path finding for config and weight (mlc-ai#1088) commit 6159cc4 Author: Junru Shao <junrushao@apache.org> Date: Sun Oct 22 02:22:55 2023 -0700 [CI] Add clang-format (mlc-ai#1103) commit 46d11e6 Author: Junru Shao <junrushao@apache.org> Date: Fri Oct 20 23:39:28 2023 -0700 Add Basic Pylint and Mypy Tooling (mlc-ai#1100) Add pylint/mypy tooling into pyproject.toml This PR establishes the initial Python tooling infra with Pylint and Mypy. Currently only the newest modules, i.e. `mlc_chat.support` and `mlc_chat.compiler` are covered, and we expect to cover the entire package, as being tracked in mlc-ai#1101. commit 03c641a Author: Junru Shao <junrushao@apache.org> Date: Fri Oct 20 21:51:51 2023 -0700 Enable Python Linter (mlc-ai#1098) This PR enables two Python formatters "black" and "isort" on the following directory: - `./python/` - `./tests/python/` Enabling pylint and mypy is left for future work commit e9b85ce Author: Junru Shao <junrushao@apache.org> Date: Fri Oct 20 21:36:28 2023 -0700 More formatting (mlc-ai#1099) commit cf39bf6 Author: Junru Shao <junrushao@apache.org> Date: Fri Oct 20 15:38:49 2023 -0700 [Format] Apply isort and black for `python/` (mlc-ai#1097) [Format] Apply isort and black on `python/` The commands I am using are: ``` isort --profile black python/ black python/ ``` It is always recommended to format the code before submission, given we don't have a linter CI yet. commit 62d0c03 Author: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Date: Fri Oct 20 15:33:51 2023 -0700 Disable Disco for q4f16_ft and q8f16_ft quantization (mlc-ai#1094) commit 9bf5723 Author: Junru Shao <junrushao@apache.org> Date: Thu Oct 19 15:49:40 2023 -0700 Update `benchmark.py` according to mlc-ai#1086 (mlc-ai#1091) Update `benchmark.py` commit 830656f Author: Varshith Bathini <varshith15@gmail.com> Date: Fri Oct 20 00:40:14 2023 +0530 StreamIterator (mlc-ai#1057) Co-authored-by: Varshith <varshith.bathini@sprinklr.com> commit b0373d1 Author: Rick Zhou <riczhou@linkedin.com> Date: Thu Oct 19 12:09:12 2023 -0700 Support lib_path override in C++. Improvements on docs and error messages (mlc-ai#1086) * Support lib_path option in C++ CLI. Disable ChatConfig.model_lib override in Python API. Improvements on helper messages and error messages * Update docs * Rename lib_path -> model_lib_path commit 56a8004 Author: Junru Shao <junrushao@apache.org> Date: Thu Oct 19 10:37:24 2023 -0700 Update README.md for Multi-GPU (mlc-ai#1090) commit 2625945 Author: Junru Shao <junrushao@apache.org> Date: Thu Oct 19 08:57:50 2023 -0700 Establish `mlc_chat.compiler` (mlc-ai#1082) This PR establishes the compiler components in MLC-Chat Python API, which currently includes two primary components: models and parameters. The models are `nn.Module`-based definition of an LLM, which, as the very first stab, contains only `LlamaForCasualLM`. It is decomposed into three files: - `llama_config.py`: common configurations for Llama, where we define relevant configurations of its architecture, as well as include standard config file for Llama2-7B/13B/70B for convenient testing; - `llama.py`: the model architecture of Llama, based on the PyTorch-like `nn.Module` API; - `llama_parameter.py`: defines the mapping between MLC parameters and pytorch parameters. The parameters contains the basic functionality of parameter mapping, and the loaders that effectively convert parameters from PyTorch to MLC according to the mapping specified. Currently, only `HFTorchLoader` is implemented, but loaders like SafeTensor, GPTQ or AWQ should be quite straightforward according to the existing design. On top of this PR, on-the-fly quantization could be defined as a loading time transformation on MLC parameters, while pre-quantized parameter loading is effectively parameter loading after MLC's `nn.Module` is quantized. Two unittests examplify how the infrastructure works: - `./tests/python/model/test_llama.py` shows how to create an `nn.Module` using the new infra, and then convert it to TVM IRModule; - `./tests/python/parameter/hf_torch_loader.py` shows how to load parameters from HuggingFace PyTorch format. Besides, `mlc_chat.support` is established for utility functions, which now contains two utils: - `config.py` which supports reading configurations into dataclasses from JSON file or Python dict. On top of Python dataclass, it throws irrelevant fields into `cls.kwargs`, which is helpful when loading HuggingFace configuration file; - `tqdm.py` which contains tqdm-related utilities, primarily redirecting logging and printing to work nicely with tqdm. commit 3aefd9f Author: Junru Shao <junrushao@apache.org> Date: Mon Oct 16 21:16:27 2023 -0700 [Bugfix] Compilation Error in q4f32_1 (mlc-ai#1078) The pass `fuse-split-rotary` assumes the compute dtype is fp16, which usually is, but in certain cases, e.g. `q0f32` and `q4f32_1`, the compute is based on fp32 instead. This PR strengthens the check guard. commit 9872c48 Author: Ruihang Lai <ruihangl@cs.cmu.edu> Date: Mon Oct 16 14:56:24 2023 -0400 [Python] Extract common device str parse function in ChatModule (mlc-ai#1074) This PR lifts the device string parsing (just a few of lines) to a standalone function, so that on the serving side the serving can make use of this function as well. Tested Python API and it does not seem to incur regression. commit d202077 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Mon Oct 16 08:06:26 2023 -0500 [ParamManager] Added progress bar for get_item/set_item (mlc-ai#1063) commit 204860b Author: Ruihang Lai <ruihangl@cs.cmu.edu> Date: Sun Oct 15 14:02:12 2023 -0400 [Fix] ChatModule incorrect temperature buffer shape (mlc-ai#1070) PR mlc-ai#1048 updated the signature of softmax in the built model library and changed the temperature buffer shape in ChatModule. This causes some existing demo unable to run since we did not do a round of model library update. This PR reverts the ChatModule change, and adds back the softmax function in non-batching case. With this PR, the regression should be fixed. commit b0bfc88 Author: Junru Shao <junrushao@apache.org> Date: Sun Oct 15 00:24:24 2023 -0700 Add links to Python API Reference (mlc-ai#1068) commit 9010d48 Author: Jeethu Rao <jeethu@jeethurao.com> Date: Sun Oct 15 06:42:24 2023 +0100 Minor typo fix (mlc-ai#1064) commit 8184431 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Sat Oct 14 00:33:15 2023 -0500 [ParamManager] Cleanup creation of quantization IRModule (mlc-ai#1053) This commit replaces the single-parameter `relax_model.param_manager.create_quantize_func` function with a method on the `ParamManager`, `create_parameter_transformation`. This avoids potential typos between `param_manager` as the imported Python module `mlc_llm.relax_model.param_manager` and an instance of the `ParamManager` class named `param_manager`, and makes the functionality easier to find. This function also takes an optional `optimize_parameter_order` flag, defaulting to `True`, which applies the `ReorderTransformFunc` pass. Since the `ReorderTransformFunc` is intended to be used with several configuration objects owned by `ParamManager`, this simplifies the common path of producing an optimally-ordered parameter transformation module. commit 481cd92 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Sat Oct 14 00:32:36 2023 -0500 [Core] Remove duplication in MODEL.get_model calls (mlc-ai#1054) This commit removes the `if`/`elif` chain in `core.py`, where the body of each conditional assigns the same `mod, param_manager, params, model_config`, and is identical except for the choice of model being built. commit c2b8cbc Author: Jeethu Rao <jeethu@jeethurao.com> Date: Sat Oct 14 06:32:05 2023 +0100 Fix Stable LM 3B build (mlc-ai#1061) * [stablelm 3b] Rename dynamic vocab size from "v" to "vocab_size" * Add get_num_key_value_heads method to StableLM3bConfig commit d854105 Author: Ruihang Lai <ruihangl@cs.cmu.edu> Date: Fri Oct 13 20:45:58 2023 -0400 [Model] Initial batching support for Llama (mlc-ai#1048) This PR introduces the initial batched input support for llama models. To make the code managable, we keep both the single-sequence handling flow and the batching handling flow in the Llama modeling. Now, with `--enable-batching` as a build argument, we build Llama for the batched version. NOTE: The paged attention kernel/TIR func are not included in this PR, so currently the built library with batching enabled is not runnable. We will follow up with the attention kernel in the future. This PR guarantees that the existing single-sequence inference (Python API, CLI, etc.) is not broken. P.S.. The batching flow is subject to bug fixes as we integrate with the attention function and run the e2e flow in the future. commit edab9b5 Author: Junru Shao <junrushao@apache.org> Date: Fri Oct 13 09:57:46 2023 -0700 [Doc] Use -U instead of --force-reinstall (mlc-ai#1062) `--force-reinstall` will reinstall all dependencies to a python package, which is unnecessary. `-U` is a better choice in this case. commit ca8c11b Author: Sunghyun Park <sunggg@umich.edu> Date: Fri Oct 13 09:00:21 2023 -0700 [BugFix] Set the right `max_sequence_length` for both Llama-1 and Llama-2 families (mlc-ai#1032) * fix * reflect feedback --------- Co-authored-by: “Sunghyun <sunggg@umich.com> commit bfaa5b9 Author: Ruihang Lai <ruihangl@cs.cmu.edu> Date: Thu Oct 12 17:40:54 2023 -0400 Revert "[Transform] Apply split_rotary optimization on prefill (mlc-ai#1033)" (mlc-ai#1058) This reverts commit b9179cf as elaborated here mlc-ai#1033 (comment) commit 98ebd28 Author: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Date: Thu Oct 12 13:24:10 2023 -0700 [Docs] Add `mlc.ai/package` to `DEPENDENCY INSTALLATION` group (mlc-ai#1055) Co-authored-by: Junru Shao <junrushao1994@gmail.com> commit b9179cf Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Thu Oct 12 12:15:44 2023 -0500 [Transform] Apply split_rotary optimization on prefill (mlc-ai#1033) * [Transform] Apply split_rotary optimization on prefill Prior to this commit, the `transform.fuse_split_rotary_embedding` function was only applicable to the `decode` function of a Llama-type model. This was due to the sequence length being restricted to one, both in the pattern-match rule and in the `split_rotary` function, and the function being restricted to operate only on the `decode` function. This commit updates the `transform.fuse_split_rotary_embedding` pass to be a `tvm.ir.transform.Pass`, operating on all applicable matched in the `IRModule`. The `split_rotary` function is now produced as a fully-generic function, with static parameters substituted in afterwards. At this stage, the sequence length is retained as a dynamic parameter, such that it can be used by the `prefill` function. * Avoid multiple kernel launches for split_rotary commit 1e6fb11 Author: Denise Kutnick <boca.denise@gmail.com> Date: Wed Oct 11 00:06:46 2023 -0700 add verbose stats to mlc-chat REST API (mlc-ai#1049) * add verbose stats to mlc-chat REST API * update docs commit 20131fb Author: Junru Shao <junrushao@apache.org> Date: Mon Oct 9 16:53:56 2023 -0700 Update README.md (mlc-ai#1045) Update README.md commit bdd9d9b Author: Ruihang Lai <ruihangl@cs.cmu.edu> Date: Mon Oct 9 19:08:14 2023 -0400 [CPP] Separate common utils out from llm_chat.cc (mlc-ai#1044) This PR separates out the tokenizer creation function, the random number generator out from `llm_chat.cc` as a preparation step for batching inference support, since these functions/modules are also used in the same way in batching inference. commit a58605f Author: Junru Shao <junrushao@apache.org> Date: Mon Oct 9 15:05:34 2023 -0700 Update README.md commit a032d40 Author: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com> Date: Mon Oct 9 18:03:24 2023 -0400 [Docs] Iterate model prebuilts docs (mlc-ai#1043) * Iterate model prebuilts docs * small fix commit 85001ed Author: Jeethu Rao <jeethu@jeethurao.com> Date: Mon Oct 9 20:40:52 2023 +0100 Support for the Stable LM 3B model (mlc-ai#1008) Support for the stablelm-3b-4e1t model commit c02fdaf Author: yongjer <54315206+yongjer@users.noreply.github.com> Date: Tue Oct 10 00:58:51 2023 +0800 Update compile_models.rst (mlc-ai#1038) fix permission issue commit bed9e60 Author: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com> Date: Mon Oct 9 12:58:36 2023 -0400 [Docs] Model prebuilts tracking page revamp (mlc-ai#1000) commit 3a9849a Author: Bohan Hou <bohanhou@andrew.cmu.edu> Date: Mon Oct 9 12:27:58 2023 -0400 [Android] Add Llama2 q4f16_0 (mlc-ai#1041) llama2 q4f160 commit b44f679 Author: Charlie Ruan <53290280+CharlieFRuan@users.noreply.github.com> Date: Mon Oct 9 11:35:58 2023 -0400 Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs (mlc-ai#1040) Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs, build model commit bae37b3 Author: Yaxing Cai <caiyaxing666@gmail.com> Date: Sun Oct 8 16:55:16 2023 -0700 [Android] Use `AlertDialog` instead of `Toast` (mlc-ai#1039) commit 6e40c21 Author: Eric Lunderberg <Lunderberg@users.noreply.github.com> Date: Sat Oct 7 22:07:09 2023 -0500 [Build] Added --pdb flag to build.py, drop into pdb on error (mlc-ai#1017) This commit adds an optional `--pdb` flag to the `build.py` script. If passed, any exception raised that would otherwise terminate the script will first enter a pdb post-mortem, allowing the error to be inspected. commit ad3a6b9 Author: Roee Shenberg <shenberg@gmail.com> Date: Sun Oct 8 04:59:15 2023 +0200 Fix two bugs in kv-cache backtrack loop (mlc-ai#856) Fix two bugs in kv-cache pop loop Bug 1: old code would stop early because output_ids was shortened in-place during the loop Bug 2: off-by-one in backoff size due to break commit 898db76 Author: David Pissarra <61968959+davidpissarra@users.noreply.github.com> Date: Sun Oct 8 03:36:19 2023 +0100 [API] Add GenerationConfig (mlc-ai#1024)

masahi added 2 commits November 7, 2023 00:23

clean

19d768d

masahi merged commit c222b9f into octoml:batch-serving Nov 7, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry pick build-time sharding PR from upstream #53

Cherry pick build-time sharding PR from upstream #53

masahi commented Nov 7, 2023

Cherry pick build-time sharding PR from upstream #53

Cherry pick build-time sharding PR from upstream #53

Conversation

masahi commented Nov 7, 2023