Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge Nov20 (includes ft_group quantization support) #71

Merged
merged 161 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
161 commits
Select commit Hold shift + click to select a range
898db76
[API] Add GenerationConfig (#1024)
davidpissarra Oct 8, 2023
ad3a6b9
Fix two bugs in kv-cache backtrack loop (#856)
shenberg Oct 8, 2023
6e40c21
[Build] Added --pdb flag to build.py, drop into pdb on error (#1017)
Lunderberg Oct 8, 2023
bae37b3
[Android] Use `AlertDialog` instead of `Toast` (#1039)
cyx-6 Oct 8, 2023
b44f679
Add doc for ChatConfig, ConvConfig, GenerationConfig, BuildArgs (#1040)
CharlieFRuan Oct 9, 2023
3a9849a
[Android] Add Llama2 q4f16_0 (#1041)
spectrometerHBH Oct 9, 2023
bed9e60
[Docs] Model prebuilts tracking page revamp (#1000)
CharlieFRuan Oct 9, 2023
c02fdaf
Update compile_models.rst (#1038)
yongjer Oct 9, 2023
85001ed
Support for the Stable LM 3B model (#1008)
jeethu Oct 9, 2023
a032d40
[Docs] Iterate model prebuilts docs (#1043)
CharlieFRuan Oct 9, 2023
a58605f
Update README.md
junrushao Oct 9, 2023
bdd9d9b
[CPP] Separate common utils out from llm_chat.cc (#1044)
MasterJH5574 Oct 9, 2023
20131fb
Update README.md (#1045)
junrushao Oct 9, 2023
1e6fb11
add verbose stats to mlc-chat REST API (#1049)
denise-k Oct 11, 2023
b9179cf
[Transform] Apply split_rotary optimization on prefill (#1033)
Lunderberg Oct 12, 2023
98ebd28
[Docs] Add `mlc.ai/package` to `DEPENDENCY INSTALLATION` group (#1055)
LeshengJin Oct 12, 2023
bfaa5b9
Revert "[Transform] Apply split_rotary optimization on prefill (#1033…
MasterJH5574 Oct 12, 2023
ca8c11b
[BugFix] Set the right `max_sequence_length` for both Llama-1 and Lla…
sunggg Oct 13, 2023
edab9b5
[Doc] Use -U instead of --force-reinstall (#1062)
junrushao Oct 13, 2023
d854105
[Model] Initial batching support for Llama (#1048)
MasterJH5574 Oct 14, 2023
c2b8cbc
Fix Stable LM 3B build (#1061)
jeethu Oct 14, 2023
481cd92
[Core] Remove duplication in MODEL.get_model calls (#1054)
Lunderberg Oct 14, 2023
8184431
[ParamManager] Cleanup creation of quantization IRModule (#1053)
Lunderberg Oct 14, 2023
9010d48
Minor typo fix (#1064)
jeethu Oct 15, 2023
b0bfc88
Add links to Python API Reference (#1068)
junrushao Oct 15, 2023
204860b
[Fix] ChatModule incorrect temperature buffer shape (#1070)
MasterJH5574 Oct 15, 2023
d202077
[ParamManager] Added progress bar for get_item/set_item (#1063)
Lunderberg Oct 16, 2023
9872c48
[Python] Extract common device str parse function in ChatModule (#1074)
MasterJH5574 Oct 16, 2023
3aefd9f
[Bugfix] Compilation Error in q4f32_1 (#1078)
junrushao Oct 17, 2023
2625945
Establish `mlc_chat.compiler` (#1082)
junrushao Oct 19, 2023
56a8004
Update README.md for Multi-GPU (#1090)
junrushao Oct 19, 2023
b0373d1
Support lib_path override in C++. Improvements on docs and error mess…
rickzx Oct 19, 2023
830656f
StreamIterator (#1057)
varshith15 Oct 19, 2023
9bf5723
Update `benchmark.py` according to #1086 (#1091)
junrushao Oct 19, 2023
62d0c03
Disable Disco for q4f16_ft and q8f16_ft quantization (#1094)
LeshengJin Oct 20, 2023
cf39bf6
[Format] Apply isort and black for `python/` (#1097)
junrushao Oct 20, 2023
e9b85ce
More formatting (#1099)
junrushao Oct 21, 2023
03c641a
Enable Python Linter (#1098)
junrushao Oct 21, 2023
46d11e6
Add Basic Pylint and Mypy Tooling (#1100)
junrushao Oct 21, 2023
6159cc4
[CI] Add clang-format (#1103)
junrushao Oct 22, 2023
16dd2ae
[Slim-LM] Smart path finding for config and weight (#1088)
LeshengJin Oct 23, 2023
f57c9c9
[Transform] Provide IRModule transform for rewrite_attention (#1052)
Lunderberg Oct 23, 2023
e5927ce
[ParamManager] Use BundleModelParams for transform_dequantize (#1056)
Lunderberg Oct 23, 2023
7ae8c6d
[Slim-LM] Introduce HFLoad for loading Pytorch and SafeTensor weights…
LeshengJin Oct 23, 2023
5a7dcd8
[WINDOWS] reduce noise in windows build (#1115)
tqchen Oct 24, 2023
61179a0
Add CLI commands for compilation (#1109)
junrushao Oct 24, 2023
8ce7793
Auto updated submodule references
Oct 24, 2023
488017d
fix mismatched argument name (#1117)
Sing-Li Oct 24, 2023
206103b
[Docs] Add doc for max and mean gen len, shift factor; and buildArgs …
CharlieFRuan Oct 24, 2023
2aa6809
Revert "[ParamManager] Use BundleModelParams for transform_dequantize…
junrushao Oct 24, 2023
9cb8e8e
Remove inaccurate warning message (#1121)
junrushao Oct 24, 2023
9166edb
[REST] OpenAI compatible Rest API (#1107)
Kartik14 Oct 24, 2023
a4279e3
Add --opt flag parsing to CLI (#1123)
junrushao Oct 25, 2023
973f9fc
[ParamManager][Redo] Use BundleModelParams for transform_dequantize (…
Lunderberg Oct 25, 2023
24f795e
added details to windows installation (#1133)
goutham2688 Oct 27, 2023
2c492e5
Grammatical and Typographical improvements (#1139)
tmsagarofficial Oct 28, 2023
2ec0cc8
Minor enhancements to `ChatModule` (#1132)
YuchenJin Oct 28, 2023
27ac5ac
Updating tvm install docs (#1143)
David-Sharma Oct 29, 2023
2b6d832
Make the help info consistent with program name (#1137)
fennecJ Oct 29, 2023
878ae84
Support parameter packing (#1146)
junrushao Oct 29, 2023
c0c3a8d
[Slim-LM] Enable Group Quant (#1129)
zxybazh Oct 29, 2023
2193767
Enable Mypy and Pylint in mlc_chat Python Package (#1149)
junrushao Oct 29, 2023
0a25374
Migrate Compiler Passes (#1150)
junrushao Oct 30, 2023
1a79a53
Compile Model Preset without External `config.json` (#1151)
junrushao Oct 30, 2023
ba67835
Update attention layer (#1153)
junrushao Oct 30, 2023
fee2cb5
Add batched Llama model definition using vLLM paged attention (#1134)
masahi Oct 30, 2023
ece97b1
[Transform][Redo] Apply split_rotary optimization on prefill (#1125)
Lunderberg Oct 30, 2023
b190578
Apply rewrite for normal attention and MQA (#1138)
Lunderberg Oct 30, 2023
8ca0176
[Rest] Fix emoji handling in Rest API. (#1142)
YuchenJin Oct 30, 2023
3cf5605
[Utility] Check for isinstance(exc, Exception) before entering pdb (#…
Lunderberg Oct 30, 2023
0a9d6c7
[Utils] Remove conversion to numpy array in utils.save_params (#1083)
Lunderberg Oct 30, 2023
425a2cb
[Fix][REST] Use lowered-cased "app" (#1159)
junrushao Oct 30, 2023
9076d01
[Rest] Document emoji handling (#1160)
YuchenJin Oct 31, 2023
b5bfa5b
Enable group quant transform with nn.Module (#1154)
cyx-6 Oct 31, 2023
8438b27
Misc Cleanups of Compilation Pipeline (#1165)
junrushao Oct 31, 2023
02d1e57
Support CUDA Multi-Arch Compilation (#1166)
junrushao Oct 31, 2023
e0cd3f6
[Bugfix] Cannot find global function `mlc.llm_chat_create` (#1167)
junrushao Oct 31, 2023
f5b2e88
Fix RWKV Support (#1136)
BBuf Nov 1, 2023
200653a
Auto updated submodule references
Nov 1, 2023
9831135
Fix Android app Permission denied error on Android 10 (#1175)
anibohara2000 Nov 1, 2023
1757777
[SLM] Fix group quantization (#1172)
cyx-6 Nov 1, 2023
2ca7d15
[Fix] TIR block name of dequantization (#1177)
junrushao Nov 2, 2023
53060af
[SLM][AutoLLM] Enable Command Line Weight Conversion (#1170)
zxybazh Nov 2, 2023
2dc8183
[Fix][SLM] Update q4f16 quantization with the new mutator name rule (…
LeshengJin Nov 3, 2023
6ae02dd
[Model Support][SWA] Add support for sliding window attention for Mis…
CharlieFRuan Nov 3, 2023
4716704
Add Python API for Weight Conversion (#1182)
junrushao Nov 4, 2023
9d20575
Merge `llama_config.CONFIG` into `MODEL_PRESETS` (#1188)
junrushao Nov 4, 2023
5d1dc34
Merge llama_config.py into llama_model.py (#1189)
junrushao Nov 4, 2023
4832c2f
Add CodeLlama as part of model presets (#1190)
junrushao Nov 4, 2023
78424f0
[Docs] Clarify zstd installation on Windows (#1191)
junrushao Nov 4, 2023
5d63f7e
[Docs] Clarify zstd installation on Windows (#1196)
junrushao Nov 4, 2023
3417505
Support overriding `--max-sequence-length` in command line (#1197)
junrushao Nov 5, 2023
0e08845
[RestAPI] Added docs (#1193)
anibohara2000 Nov 5, 2023
145a984
[API] ```llm-vscode``` extension support (#1198)
davidpissarra Nov 5, 2023
3413d17
[Fix] Use `fabs` as floating point abs function in C++ (#1202)
junrushao Nov 5, 2023
7ccb51a
Integrating MLC runtime with the new compilation workflow (#1203)
junrushao Nov 6, 2023
65478c8
[Fix] Remove Redundant Warnings (#1204)
junrushao Nov 6, 2023
01d4339
Try fix macOS build with picojson (#1206)
junrushao Nov 6, 2023
51d6f9c
Try fix macOS build with picojson again (#1207)
junrushao Nov 6, 2023
a7f1183
Auto updated submodule references
Nov 6, 2023
e2c99a8
[Fix] Keep update-to-date with upstream API change (#1209)
junrushao Nov 6, 2023
e00220c
Detect `mtriple` via LLVM (#1211)
junrushao Nov 6, 2023
9869ca6
Fix Python3.8 compatibility breakage (#1210)
Lunderberg Nov 6, 2023
4042626
[Slim-LM] Enable loading from AWQ pre-quantized weight. (#1114)
LeshengJin Nov 6, 2023
be1c18b
[Bugfix] Fix Cannot import name '_LIB' from 'mlc_chat.base' (#1214)
CharlieFRuan Nov 7, 2023
1015aae
[SLM] Support `q3f16_1` and `q4f32_1` (#1215)
cyx-6 Nov 8, 2023
1a6fadd
Make the Compilation Working E2E (#1218)
junrushao Nov 8, 2023
616ca42
[Mistral][SWA] Add sliding window to metadata (#1217)
CharlieFRuan Nov 8, 2023
e52f449
Support for `chatml` format conversation (for TinyLlama-1.1B-Chat-v0.…
acalatrava Nov 8, 2023
fbe75e3
Add Rust Support for MLC-LLM (#1213)
YuchenJin Nov 8, 2023
beca2ab
[Bugfix] Remove dependency on openai_api in chat module (#1222)
CharlieFRuan Nov 8, 2023
9ee5705
Bake in RAM Usage in the Generated DSO (#1224)
junrushao Nov 8, 2023
069181c
[Fix] ChatModule python messages and offset types (#1220)
YuchenJin Nov 8, 2023
f1bc951
[Fix] Variable Upperbound Should be Injected before Build Pipeline (#…
junrushao Nov 8, 2023
834811f
[MultiGPU] Support pre-sharded model weights (#1096)
Lunderberg Nov 9, 2023
45bf1c5
[AWQ] e2e awq-quantized model (#1229)
LeshengJin Nov 10, 2023
d08b009
[SLM] Support `q0f16` and `q0f32` (#1228)
cyx-6 Nov 10, 2023
fab4486
[Core][Llama] Argument `max_vocab_size` and `max_batch_size` (#1076)
MasterJH5574 Nov 11, 2023
cd71665
[Llama] Support batched prefill (#1233)
MasterJH5574 Nov 11, 2023
a21c759
[Core] Skip PrimExpr index int32 downcasting for batching (#1234)
MasterJH5574 Nov 11, 2023
cb68e7b
Auto updated submodule references
Nov 12, 2023
1400cd9
Update index.rst (#1236)
a7k3 Nov 12, 2023
c2082d8
Update android.rst (#1237)
a7k3 Nov 12, 2023
26fd019
Correct typo in cuda device name for rust chat model (#1241)
malramsay64 Nov 13, 2023
ab2a05b
Generating mlc-chat-config.json (#1238)
junrushao Nov 13, 2023
d24379c
Rename `--config` to `--model` and Consolidate CLI Messages (#1244)
junrushao Nov 13, 2023
4021785
Specify argument "dest" in argparse (#1245)
junrushao Nov 13, 2023
5005772
Add more stats during quantization (#1246)
junrushao Nov 13, 2023
34c15f2
ensure that max_gen_len is set properly in mlc_chat_config (#1249)
denise-k Nov 13, 2023
7da81a4
[Fix] Memory usage statistics (#1252)
LeshengJin Nov 13, 2023
cd4a8ed
Introduce mlc_chat subcommands (#1251)
junrushao Nov 13, 2023
8305b22
Update mlc-chat-config.json (#1254)
junrushao Nov 14, 2023
5e02cac
[Rust] Support multiple prompts (#1253)
YuchenJin Nov 14, 2023
77a4b69
[UI] Correct "convert_weight_only" to "convert_weights_only" (#1227)
Lunderberg Nov 14, 2023
12efd45
Add a downloader from HuggingFace (#1258)
junrushao Nov 14, 2023
1dbfac5
[Fix] Add prefix_tokens to `ConvConfig` in Python to match C++ implem…
YuchenJin Nov 14, 2023
8d9effe
[nn.Module] Mistral implementation (#1230)
davidpissarra Nov 15, 2023
8304d4c
Add `mlc_chat.__main__` as command line entrypoint (#1263)
junrushao Nov 15, 2023
64e3410
[Rust] Improve ergonomics of `generate` function in `ChatModule` (#1…
YuchenJin Nov 15, 2023
2c00373
[Fix] mistral `max_gen_len` (#1264)
davidpissarra Nov 15, 2023
ceb27d5
Rename `max-sequence-length` to `context-window-size` (#1265)
junrushao Nov 15, 2023
17aa5bf
Auto updated submodule references
Nov 16, 2023
fde2e85
Fix group quantization shape infer (#1273)
cyx-6 Nov 16, 2023
4a137d3
Continuous Model Delivery (#1272)
junrushao Nov 16, 2023
2600b9a
Auto updated submodule references
Nov 17, 2023
31910dd
Enhance Model Delivery (#1283)
junrushao Nov 17, 2023
fb7a224
add python, rest api test (#1278)
Kartik14 Nov 18, 2023
d3b7aad
Enable Jenkins CI (#1292)
Hzfengsy Nov 19, 2023
ad1933a
Merge remote-tracking branch 'mlc-ai/main' into merge-nov20
Nov 19, 2023
cf67fb7
fix
Nov 19, 2023
5fac856
Update android.rst (#1289)
a7k3 Nov 19, 2023
56ec8bd
more fix
Nov 20, 2023
49f75d2
Consolidate Logics for GPU Detection (#1297)
junrushao Nov 20, 2023
01daa64
[CI] Fix lint concurrent clone issue (#1299)
MasterJH5574 Nov 20, 2023
418b9a9
Auto updated submodule references
Nov 20, 2023
b4ba7ca
[Feature] Prefill chunking for non-SWA models (#1280)
davidpissarra Nov 20, 2023
488f65d
Compatible with chatglm (#979)
qc903113684 Nov 20, 2023
2fd1bf5
Add q4/q8_ft_group quantization mode (#1284)
vinx13 Nov 21, 2023
e75736c
Merge remote-tracking branch 'mlc-ai/main' into merge-nov20
Nov 21, 2023
bbed8cf
fix
Nov 21, 2023
aed3412
restore multi gpu support for FT quant
Nov 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 0 additions & 87 deletions .github/workflows/lint.yml

This file was deleted.

91 changes: 91 additions & 0 deletions ci/jenkinsfile.groovy
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

import org.jenkinsci.plugins.pipeline.modeldefinition.Utils

image = 'mlcaidev/ci-cpu:caab922'
docker_run = "bash ci/bash.sh ${image}"

def per_exec_ws(folder) {
return "workspace/exec_${env.EXECUTOR_NUMBER}/" + folder
}

def init_git(submodule = false) {
checkout scm
if (submodule) {
retry(5) {
timeout(time: 2, unit: 'MINUTES') {
sh(script: 'git submodule update --init --recursive -f', label: 'Update git submodules')
}
}
}
}

stage('Lint') {
parallel(
'isort': {
node('CPU-SMALL') {
ws(per_exec_ws('mlc-llm-lint-isort')) {
init_git()
sh(script: "ls", label: 'debug')
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
sh(script: "${docker_run} bash ci/task/isort.sh", label: 'Lint')
}
}
},
'black': {
node('CPU-SMALL') {
ws(per_exec_ws('mlc-llm-lint-black')) {
init_git()
sh(script: "ls", label: 'debug')
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
sh(script: "${docker_run} bash ci/task/black.sh", label: 'Lint')
}
}
},
'mypy': {
node('CPU-SMALL') {
ws(per_exec_ws('mlc-llm-lint-mypy')) {
init_git()
sh(script: "ls", label: 'debug')
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
sh(script: "${docker_run} bash ci/task/mypy.sh", label: 'Lint')
}
}
},
'pylint': {
node('CPU-SMALL') {
ws(per_exec_ws('mlc-llm-lint-pylint')) {
init_git()
sh(script: "ls", label: 'debug')
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
sh(script: "${docker_run} bash ci/task/pylint.sh", label: 'Lint')
}
}
},
'clang-format': {
node('CPU-SMALL') {
ws(per_exec_ws('mlc-llm-lint-clang-format')) {
init_git()
sh(script: "ls", label: 'debug')
sh(script: "${docker_run} conda env export --name ci-lint", label: 'Checkout version')
sh(script: "${docker_run} bash ci/task/clang-format.sh", label: 'Lint')
}
}
},
)
}
2 changes: 1 addition & 1 deletion ci/task/mypy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ export PYTHONPATH="./python:$PYTHONPATH"

set -x

mypy ./python/ ./tests/python/
mypy --install-types --non-interactive ./python/ ./tests/python/
2 changes: 1 addition & 1 deletion ci/task/pylint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ export PYTHONPATH="./python:$PYTHONPATH"
set -x

# TVM Unity is a dependency to this testing
pip install --quiet --pre -U -f https://mlc.ai/wheels mlc-ai-nightly
pip install --quiet --pre -U -f https://mlc.ai/wheels mlc-ai-nightly requests

pylint --jobs $NUM_THREADS ./python/
pylint --jobs $NUM_THREADS --recursive=y ./tests/python/
59 changes: 28 additions & 31 deletions cpp/llm_chat.cc
Original file line number Diff line number Diff line change
Expand Up @@ -317,25 +317,33 @@ class LLMChat {
return os.str();
}

bool UpdateMaxWindowSizeFromMetadata() {
void UpdateConfigFromMetadata() {
if (ft_.use_disco) {
return false;
}
if (this->sliding_window_ != -1) {
return false;
return;
}

PackedFunc fget_metadata = ft_.mod_get_func("get_metadata");
if (fget_metadata == nullptr) {
return false;
return;
}
ObjectRef ret = fget_metadata();
std::string metadata_str = std::string(Downcast<String>(ret));
picojson::value metadata_info;
picojson::parse(metadata_info, std::string(metadata_str));
auto metadata = metadata_info.get<picojson::object>();

ICHECK(metadata["max_window_size"].is<int64_t>());
max_window_size_ = std::min(max_window_size_, metadata["max_window_size"].get<int64_t>());
return true;

if (metadata.count("prefill_chunk_size")) {
ICHECK(metadata["prefill_chunk_size"].is<int64_t>());
prefill_chunk_size_ =
std::min(prefill_chunk_size_, metadata["prefill_chunk_size"].get<int64_t>());
}
if (metadata.count("sliding_window")) {
ICHECK(metadata["sliding_window"].is<int64_t>());
sliding_window_ = std::min(sliding_window_, metadata["sliding_window"].get<int64_t>());
}
}

/*!
Expand Down Expand Up @@ -410,21 +418,12 @@ class LLMChat {
<< "Cannot specify both sliding_window and max_window_size.";
this->sliding_window_ = config["sliding_window"].get<int64_t>();
CHECK(this->sliding_window_ > 0) << "Sliding window size needs to be positive";
CHECK(config.count("sliding_window_chunk_size"))
CHECK(config.count("prefill_chunk_size"))
<< "Need to specify chunk size if using sliding window attention.";
}
if (config.count("sliding_window_chunk_size")) {
CHECK(config["sliding_window_chunk_size"].is<int64_t>());
this->sliding_window_chunk_size_ = config["sliding_window_chunk_size"].get<int64_t>();
CHECK(this->sliding_window_chunk_size_ > 0)
<< "Sliding window chunk size needs to be positive";
CHECK(config.count("sliding_window")) << "Need to specify sliding window size.";
}
if (config.count("model_name")) {
CHECK(config["model_name"].is<std::string>());
this->model_name_ = config["model_name"].get<std::string>();
} else {
CHECK(partial_update) << "Key \"model_name\" not found.";
if (config.count("prefill_chunk_size")) {
CHECK(config["prefill_chunk_size"].is<int64_t>());
this->prefill_chunk_size_ = config["prefill_chunk_size"].get<int64_t>();
}
if (config.count("top_p")) {
CHECK(config["top_p"].is<double>());
Expand Down Expand Up @@ -513,8 +512,8 @@ class LLMChat {
// so there is no explicit abi dependency on these extra
// classes other than basic tvm runtime.
this->ft_.Init(reload_lib, device_, this->num_shards_);
UpdateConfigFromMetadata();
if (this->sliding_window_ == -1) {
UpdateMaxWindowSizeFromMetadata();
CHECK(max_window_size_ != std::numeric_limits<int64_t>::max())
<< "Key \"max_window_size\" not found.";
}
Expand Down Expand Up @@ -807,9 +806,8 @@ class LLMChat {
if (ft_.use_disco) {
LOG(FATAL) << "NotImplementedError: Distributed inference is not supported for this model";
}
if (this->sliding_window_ != -1) {
LOG(FATAL)
<< "NotImplementedError: Sliding window attention does not support separate embedding";
if (this->prefill_chunk_size_ != -1) {
LOG(FATAL) << "NotImplementedError: Separate embedding does not support chunking";
}
NDArray embedding = Downcast<NDArray>(
EmbedStep(inp, append_conversation, place_in_prompt, generation_config_str));
Expand All @@ -832,10 +830,10 @@ class LLMChat {

int32_t new_seq_len = total_seq_len_;
NDArray logits_on_device;
if (this->sliding_window_ != -1) {
// Use chunking if we use sliding window attention (see Mistral paper figure 3).
for (int64_t begin = 0; begin < token_len; begin += this->sliding_window_chunk_size_) {
int64_t end = std::min(token_len, begin + this->sliding_window_chunk_size_);
if (this->prefill_chunk_size_ > 0) {
// Perform chunking.
for (int64_t begin = 0; begin < token_len; begin += this->prefill_chunk_size_) {
int64_t end = std::min(token_len, begin + this->prefill_chunk_size_);
std::vector<int32_t> chunk =
std::vector<int32_t>(prompt_tokens.begin() + begin, prompt_tokens.begin() + end);
new_seq_len += static_cast<int64_t>(chunk.size());
Expand All @@ -844,6 +842,7 @@ class LLMChat {
ICHECK_EQ(new_seq_len, total_seq_len_ + token_len) << "Expect chunking process all tokens";
} else {
// Otherwise, prefill entire prompt at once.
CHECK(sliding_window_ == -1) << "Expect chunking with sliding window attention";
new_seq_len += token_len;
logits_on_device = this->ForwardTokens(prompt_tokens, new_seq_len);
}
Expand Down Expand Up @@ -1356,16 +1355,14 @@ class LLMChat {
//----------------------------
// Conversation
//----------------------------
// model name
std::string model_name_;
// conversation
Conversation conversation_;
// total sequence len,
int64_t total_seq_len_{0};
// max window size, mean and max generation length, sliding window
// If we use sliding window, max window size is its default max() value
int64_t max_window_size_{std::numeric_limits<int64_t>::max()}, mean_gen_len_{128},
max_gen_len_{512}, sliding_window_{-1}, sliding_window_chunk_size_{-1};
max_gen_len_{512}, sliding_window_{-1}, prefill_chunk_size_{-1};
// size of the vocab table
int64_t vocab_size_;
// number of shards in distributed inference
Expand Down
6 changes: 3 additions & 3 deletions docs/deploy/android.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Prerequisite
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-clang
# Example on Windows
ANDROID_NDK: $HOME/Library/Android/sdk/ndk/25.2.9519653
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/darwin-x86_64/bin/aarch64-linux-android24-clang
TVM_NDK_CC: $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android24-clang

**JDK**, such as OpenJDK >= 17, to compile Java bindings of TVM Unity runtime. It could be installed via Homebrew on macOS, apt on Ubuntu or other package managers. Set up the following environment variable:

Expand Down Expand Up @@ -164,6 +164,6 @@ Instructions have been provided to build an Android App with MLC LLM in previous
.. code-block:: bash

adb install android/MLCChat/app/release/app-release.apk
adb push dist/${MODEL_NAME}-${QUANTIZATION}/params /data/local/tmp/${MODEL_NAME}/
adb push dist/${MODEL_NAME}-${QUANTIZATION}/params /data/local/tmp/${MODEL_NAME}-${QUANTIZATION}/
adb shell "mkdir -p /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/"
adb shell "mv /data/local/tmp/${MODEL_NAME} /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/${MODEL_NAME}"
adb shell "mv /data/local/tmp/${MODEL_NAME}-${QUANTIZATION} /storage/emulated/0/Android/data/ai.mlc.mlcchat/files/"
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ It is recommended to have at least 6GB free VRAM to run it.
- Redmi Note 12 Pro with Snapdragon 685
- Google Pixel phones

**Tutorial and source code**. The source code of the iOS app is fully `open source <https://github.com/mlc-ai/mlc-llm/tree/main/android>`__,
**Tutorial and source code**. The source code of the android app is fully `open source <https://github.com/mlc-ai/mlc-llm/tree/main/android>`__,
and a :doc:`tutorial <deploy/android>` is included in documentation.

.. figure:: https://blog.mlc.ai/img/android/android-recording.gif
Expand Down
13 changes: 7 additions & 6 deletions mlc_llm/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,17 +40,18 @@ def main():
# Post processing of arguments
parsed_args = core._parse_args(parsed_args) # pylint: disable=protected-access

# if num_shard>1 without -convert-weight-only or --build-model-only, we implicitly run it sequentially
if parsed_args.num_shards > 1 and not (parsed_args.build_model_only or parsed_args.convert_weight_only):
# if num_shard>1 without -convert-weight-only or --build-model-only, we implicitly run it sequentially
if parsed_args.num_shards > 1 and not (parsed_args.build_model_only or parsed_args.convert_weights_only):
parsed_args.build_model_only = True
parsed_args.convert_weight_only = False # just to be explicit
parsed_args.convert_weights_only = False # just to be explicit
core.build_model_from_args(parsed_args)

parsed_args.build_model_only = False
parsed_args.convert_weight_only = True
parsed_args.convert_weights_only = True
core.build_model_from_args(parsed_args)
else:
core.build_model_from_args(parsed_args)



if __name__ == "__main__":
main()
Loading