Main #2

zyearw1024 · 2024-09-15T13:53:22Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

…nternLM#2245) * support vlm custom parameters in openai input format * remove flash_attn deps * update * update * update

…ig (InternLM#2275) * fix side-effect: failed to update tm model config with tm engine config * fix

* fix: follow up InternLM#2303 * upd

* Support send tool_calls back to internlm2 * update documents * condition

* fix template * add max_dynamic_patch custom setting * fix test * update docs * update docs * remove unnecessary process * update link

…M#2240) * fix the issue missing dependencies in the Dockerfile and pip * reset dependencies * reset compose images tag * add InternVL_Dockerfile * nvidia/cuda image should be as low as possible,now as of nccl2.22.3,the minimum supported CUDA version if 12.2; if nvidia device version on host machine higher than images,torch in images will not work * remove a line * fix the apt error in InternVL_Dockerfile * apt add -y * change InternVL_Dockerfile base image tag * roll back .\requirements\test.txt to 7c4e75b * add rust build tools in Dockerfile to fix the bug when gradio>4.40.0,which depend on orjson should build with cargo * run pre-commit to fix;and add internvl docs * move rust install command to where installing sys packages * fix internVL docs layout * remove tritonclient[grpc] from serve.txt * fix docs about InternVL * fix en docs about internVL markdown format * fix en docs about internVL markdown format * update * update docs of internVL by H.Lyu * fix supported inference engine info for InternVL * remove nccl installation since it is already in the docker image * remove nccl install and chage base image tag * change base image to 12.4.1 --------- Co-authored-by: lvhan028 <lvhan_028@163.com>

* preprocess for kv-int8 * working kv-int8 * minor * working kv-int4 * optimize kv-int4 * optimize kv-int4 * optimized SIMT f16/u8/u4 decoding * fix tc decoding * int8 tc decoding * int4 tc decoding * minor * optimize * optimize tc kv-int4/int8 * fix `sm_75`/`sm_70` * simplify * bf16+kv4/8 * support more mma instruction * refactor * dispatching * integration * remove offline kv params * fix msvc build * fix msvc build * fix lint * fix lint * fix cmake * fix lint * fix lint * minor * refactor * gemm baseline * optimize * minor * tb swizzle * minor * tune * minor * wip * minor * fp16 transcription * optimize * tune * adjust layout * optimize * tune * refactor * refactor * f16xs4/8 gemm * refactor * dequant * fix Q * fix Q * end-to-end test * optimize Q * pack Q * tune * split-k * sliced-k * fix Q * add `transpose_m8n8_b32` * tune gemm * predicate support * tune * dispatch * dispatch v2 * automatic tuning * nvbench * better API * GPU metrics * update cost model * add simt impl * add 16816 * add 884 * refactor * smem copy * minor * NT & NN * transformation * refactor * refactor * add UV * refactor testbed * working fp16 packing * update * use `(m, k)` & `(n, k)` * simplify * dispatch for conversion * refactor * refactor * refactor * simplify * refactor quantization * quantization * fix `pack_cnt_m/k` * `uint8_t` * `uint4_t` * symmetry * refactor * large pack * fix `SmemCopy` for packed inputs * tune * SIMT * SIMT packing * SIMT int8 * SIMT int4 * fix group size * mma.m8n8k4 * clean-up * refactor epilogue * fix smem layout for C * tune epilogue * TN * optimize * fix `_src_step_k` * use raked partition * fix `Tiled_MMA_v2` & optimize smem copy * working w4a16 * add missing * fuse up and gate * fused silu * `sm75` and `sm70` * cache policy * remove unused * col major output * fix tiling of C * wip * wip * wip * fix iterator * update * update kernel signature * fix packing * update * refactor * update * update * update * alpha beta * set beta * fix & clean-up * check max splits & add qwen * add tp * refactor `LlamaLinear` * share linear layer * tuning interface * update * skip nvbench for MSVC * define `uint` when needed * fix * fix * fix * update * disable large kernels * fix * refactor model conversion * fix lint * simplify target model * refactor model import * minor * pad `inter_size` for tp * refactor * skip `sm_80` and `sm_90` on MSVC * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix cu12 / sm90 build * fix * fix ut * fix missing include * support GPTQ models * fix ut * parse tuning args * minor * minor * add missing header * add missing headers * fix converter * fix internvl reader initializer * fix * tuning * remove unused * tuning * minor * fix lint * fix lint * fix lint * minor * fix lint * fix baichuan2-7b, deepseek-vl and xcomposer2d5-4bit * tune sm_70 * optimize sm70 & fix converter * optimize v100 * fix lint * RTX 4090 * fix lint * refactor & batch_dim support * A100 * `TuningParams` * lint * lint * minor * switch to m-major MMA for sm70 * recognize GPTQ models * RTX 2080 & GTX 1660 * fix missing return * fix cu12 build for sm90 * fix ptr of operand C * disable cache eviction policy on sm_90 * fix lint * add refs * fix lint * lint

* Update error status_code to raise error in openai client * remove strict

* remove device-type in cli * remove device arg from lite cli

* add test utils * lint * fix msvc build * fix msvc build * lint

…nternLM#2325) * fix getting quantization config * recursive get quant_config * minor fix * check 128

* Fix hidden size and support mistral nemo * fix lint * comments

* Support custom logits processors * support logit_bias for pytorch engine * mv to fused logits_processors * replace input_ids with all_ids * type hint

* optimize prefill * merge main, update attention implementation * check window * request no response * fix response * staged attention * optimize prefill

* refactor turbomind * minor * use `size_t` for size type * fix * minor * print split-fuse param

* support model convert * update template and vision model * update docs * update README

…nternLM#2353) * feat(server): enable `seed` parameter for openai compatible server. * refactor: fix format issue

* move get_started and installation to get_started directory * update

* build(ascend): add Dockerfile for ascend aarch64 910B * ci(ascend): skip ascend dockerfile codespell check * update lmdeploy tag and transformers version in Dockerfile_aarch_910B

* update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update ---------

* support ascend using infer_ext * fix(ascend): make infer_ext using TND format q,k,v in paged_token_attention * support ascend using infer_ext * feat: support ascend moe_gating_topk_softmax * feat: change infer_ext ops function param order (#2) * ascend: align attention mask to 32bytes (InternLM#7) * fix attn args (InternLM#9) * fix: expand shape of attn_mask (InternLM#10) * feat: udpate infer_ext ops interface (InternLM#13) * rename infer_ext to dlinfer * format code * Support internlm 2.5 (InternLM#14) * refactor ascend pagedattention * fix ascend apply_rotary_pos_emb * fix import dlinfer (InternLM#16) * fix: fix rms_norm params (InternLM#18) * fix sync on ascend --------- Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn> Co-authored-by: CyCle1024 <ccy_justin@163.com> Co-authored-by: Wei Tao <1136862851@qq.com> Co-authored-by: jinminxi104 <jinminxi104@hotmail.com> Co-authored-by: pdx1989 <pdx1989@gmail.com>

* support do_sample parameter * merge GenerationConfig & EngineGenerationConfig * align gen_config with logic with transformers * add comments * fix comments * fix comments * rename stop_words_ids -> stop_token_ids * update tests

* use yaml config instead of ini config * remove comments * update * remove unused parameters in TurbomindModelConfig * fix test error * fetch yaml-cpp instead of find_package(yaml-cpp) * remove INIReader and chagne config to yaml format * update converter * remove use_logn_attn from TurbomindEngineConfig since it is rarelly used * update * update * not save engine_config to yaml * fix _from_workspace * fix ut * fix _from_hf * _postprocess_config * fix awq model inference * update * update * fix chat * fix lint * fix lint * config_to_dict * fix lint * minor * minor * remove allow_none * minor * fix lint * typo * fix --------- Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>

* Add guided decoding * miss a file * update server * add documents * RegexFSM -> RegexGuide * fix internlm tokenizer * lint * return if guided_input_ids is None * update doc * update * remove stream=False * Add outlines to requirements

* import dlinfer before imageencoding * import dlinfer if device_type is set to ascend * fix isort error

* add ascend_readme_en * add ascend_readme_zh_cn * fix ascend_readme typo * fix typo * fix indent&typo * fix indent again * fix indent again

…M#2419) * update actions/download-artifact to 4.1.7 to fix security issue * update * debug * upgrade upload-artifact v4 * update

* ignore *.pth when download model from model hub * update

* handle invalid images * fix test * update encode_image_base64 * use image instead * update tests

* split lm_head * split token_embed * fix memory malloc and load from converted model * support tie_word_embeddings * split tok_embeddings along hidden dim * Revert "support tie_word_embeddings" This reverts commit b7dc61b. * remove unused * fix windows build * fix loading from workspace * remove the use of ConcateSlice * remove ConcateSlice class * use context_decoder_output_buf_ buffer * use tp in config.yaml * add sync_check_cuda_error * update * remove unused header * udpate check * remove unused

* build: update ascend dockerfile * add description of run file in ascend dockerfile * use multi-stage build to copy lmdeploy without run file

* fix modelscope * fix llava model when input images have size (x, 1) * larger interval * skip get_font for xcomposer2d5 * fix custom image token position * fix potential mismatching issues * update docs

* support pytorch backend min_p * support turbomind backend min_p * fix comments * remove unused header * remove inplace * use _filter_minp_sorted_ * remove end_ids from sampling * use larger grid size in invokeTopPSortInitialize * skip softmax for topk request * use const * use nullptr * use eps * fix pr test

…ot specified (InternLM#2434) * automatically set max_batch_size * update * fix * update * update

* attn layer * move to backend * add base layer * finish llama base * add lora and w8a8 * support awq * add add_rms_norm kernel * optimize step context * attn meta as input * add cuda graph support * disable one of mha kernel * share graph pool * del graph * update docstring * awq cudagraph * merge main * support llava for llama * fix support cudagraph flag * support lora cudagraph * support logit softcapping * support transformers 4.43 * fix ut * fix dynamic ntk cudagraph * add moe support * add custom module support * optimize awq kernel * optimize attention * fix graph runner * optimize prefill * fix response * optmize prefill * adjust grid of paged attention * add attention stages * support llama3 * optimize apply rotary * rename * fix sampling * remove print * prepare for new weight loader * refactor add model * optimize nn * fix linear device * support baichuan 7b 13b * support deepseekv2 no-tp * support deepseek v2 tp * add log * fix ut * support chatglm * support llava * add falcon * add internlm2 and mistral * add gemma/gemma2 * add deepseek, qwen1 * remove request timeout * add qwen2, qwen-moe * add starcoder2 phi-3 phi-3 vision * support phi3 moe * support dbrx * support internvl * support merged awq weight * add cogvlm * update docs * fused layernorm * add gelu and mul * support triton==3.0.0 * update names * fix * cogvlm2 * fix * fix * fix internlm2 awq * rename * fix a hanging problem when using cli serve mode and device ascend on exit * raise -> return * optimize moe * fix linear awq bias, default awq kernel * fix * optimize default awq * fix llama rope, add internlm * optimize decoding * recovery attention * fix fill kv cache * fix internlm oom * fix llama3 memory usage * remove float deepseekv2 * fix llama3 * update smooth quant flag * fix w8a8 * fix w8a8 tp --------- Co-authored-by: grimoire <yaoqian@pjlab.org.cn> Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn>

* add support for llama ascend 910b using torch layers * feat(ascend): modify the usage of ascend kernels to fit torch.compile * refactor for llama * format code * add AscendSoftmaxTopKBuilder * feat: support ascend mixtral * refactor step_context for fusion attention * fix ascend op_backend after rebase * add comment * format code * remove unused param in rms_norm

…nsor parallelism (InternLM#2454) * fix pp * fix lint

* fix minp * better test params

* bump version to 0.6.0 * update readme * update supported models * update get_started on ascend platform

RunningLeon and others added 30 commits August 9, 2024 14:48

fix gradio autobackend (InternLM#2256)

02077a7

remove eviction param (InternLM#2285)

c9aaa5b

support vlm custom image process parameters in openai input format (I…

7933956

…nternLM#2245) * support vlm custom parameters in openai input format * remove flash_attn deps * update * update * update

fix side-effect: failed to update tm model config with tm engine conf…

f8f8543

…ig (InternLM#2275) * fix side-effect: failed to update tm model config with tm engine config * fix

Update python support version (InternLM#2290)

85daad9

Remove QoS serving (InternLM#2294)

ebae7d2

fix Windows compile error (InternLM#2303)

a316aec

fix: follow up InternLM#2303 (InternLM#2307)

afa54ee

* fix: follow up InternLM#2303 * upd

Support send tool_calls back to internlm2 (InternLM#2147)

32dc298

* Support send tool_calls back to internlm2 * update documents * condition

Fix internvl2 template and update docs (InternLM#2292)

c1923f4

* fix template * add max_dynamic_patch custom setting * fix test * update docs * update docs * remove unnecessary process * update link

Add stream options to control usage (InternLM#2313)

3b0fd0d

add device type for pytorch engine in cli(InternLM#2321)

2197250

Update error status_code to raise error in openai client (InternLM#2333)

6522a87

* Update error status_code to raise error in openai client * remove strict

Change to use device instead of device-type in cli (InternLM#2337)

12112e1

* remove device-type in cli * remove device arg from lite cli

Add GEMM test utils (InternLM#2342)

8ed696c

* add test utils * lint * fix msvc build * fix msvc build * lint

add environment variable to turn off silu fusion (InternLM#2343)

6f81013

Fix the way to get "quantization_config" from model's coniguration (I…

8ee8afb

…nternLM#2325) * fix getting quantization config * recursive get quant_config * minor fix * check 128

fix(ascend): fix import error of pt engine in cli (InternLM#2328)

0e6e897

Use single thread per model instance (InternLM#2339)

0a1f65a

add cache to speed up docker building (InternLM#2344)

f21b5a4

add max_prefill_token_num argument in CLI (InternLM#2345)

adbefe9

Fix hidden size and support mistral nemo (InternLM#2215)

0f57128

* Fix hidden size and support mistral nemo * fix lint * comments

Support custom logits processors (InternLM#2329)

1c5b1e0

* Support custom logits processors * support logit_bias for pytorch engine * mv to fused logits_processors * replace input_ids with all_ids * type hint

torch engine optimize prefill for long context (InternLM#1962)

3d715a3

* optimize prefill * merge main, update attention implementation * check window * request no response * fix response * staged attention * optimize prefill

set rope_scaling_factor default value None (InternLM#2358)

2cd7f95

Refactor turbomind (1/N) (InternLM#2352)

b78c8b9

* refactor turbomind * minor * use `size_t` for size type * fix * minor * print split-fuse param

support openbmb/MiniCPM-V-2_6 (InternLM#2351)

3ffb0c4

* support model convert * update template and vision model * update docs * update README

feat(server): enable seed parameter for openai compatible server. (I…

0772c10

…nternLM#2353) * feat(server): enable `seed` parameter for openai compatible server. * refactor: fix format issue

lvhan028 and others added 29 commits August 28, 2024 18:09

Reorganize the table of content of get_started (InternLM#2378)

eece386

* move get_started and installation to get_started directory * update

build(ascend): add Dockerfile for ascend aarch64 910B (InternLM#2278)

d04b37f

* build(ascend): add Dockerfile for ascend aarch64 910B * ci(ascend): skip ascend dockerfile codespell check * update lmdeploy tag and transformers version in Dockerfile_aarch_910B

Fix /v1/completions batch order wrong (InternLM#2395)

dd0d739

[ci] add daily test's coverage report (InternLM#2401)

7eb0fe8

* update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update ---------

rename ascend dockerfile (InternLM#2403)

47c7379

fix get_started user guide unaccessible (InternLM#2410)

8291c68

support do_sample parameter (InternLM#2375)

7519a35

* support do_sample parameter * merge GenerationConfig & EngineGenerationConfig * align gen_config with logic with transformers * add comments * fix comments * fix comments * rename stop_words_ids -> stop_token_ids * update tests

import dlinfer before imageencoding (InternLM#2413)

761ae09

* import dlinfer before imageencoding * import dlinfer if device_type is set to ascend * fix isort error

add Ascend get_started (InternLM#2417)

e0fa42f

* add ascend_readme_en * add ascend_readme_zh_cn * fix ascend_readme typo * fix typo * fix indent&typo * fix indent again * fix indent again

update actions/download-artifact to v4 to fix security issue (InternL…

aa44e3d

…M#2419) * update actions/download-artifact to 4.1.7 to fix security issue * update * debug * upgrade upload-artifact v4 * update

ignore *.pth when download model from model hub (InternLM#2426)

4866410

* ignore *.pth when download model from model hub * update

inplace logits process as default (InternLM#2427)

5e29b90

handle invalid images (InternLM#2312)

f32fc53

* handle invalid images * fix test * update encode_image_base64 * use image instead * update tests

build: update ascend dockerfile (InternLM#2421)

62ac4f1

* build: update ascend dockerfile * add description of run file in ascend dockerfile * use multi-stage build to copy lmdeploy without run file

build nccl in dockerfile for cu118 (InternLM#2433)

3df11e7

Fix some issues encountered by modelscope and community (InternLM#2428)

659a6b0

* fix modelscope * fix llava model when input images have size (x, 1) * larger interval * skip get_font for xcomposer2d5 * fix custom image token position * fix potential mismatching issues * update docs

automatically set max_batch_size according to the device when it is n…

e6d70a0

…ot specified (InternLM#2434) * automatically set max_batch_size * update * fix * update * update

fix llama3 rotary in pytorch engine (InternLM#2444)

edcdd8e

fix tensors on different devices when deploying MiniCPM-V-2_6 with te…

5d09bfd

…nsor parallelism (InternLM#2454) * fix pp * fix lint

fix MultinomialSampling operator builder (InternLM#2460)

f98d152

Fix initialization of runtime_min_p (InternLM#2461)

64fe4c5

* fix minp * better test params

bump version to v0.6.0 (InternLM#2445)

e2aa4bd

* bump version to 0.6.0 * update readme * update supported models * update get_started on ascend platform

zyearw1024 merged commit 4063b26 into develop Sep 15, 2024
10 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main #2

Main #2

zyearw1024 commented Sep 15, 2024

Main #2

Main #2

Conversation

zyearw1024 commented Sep 15, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist