forked from InternLM/lmdeploy
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Main #2
Merged
Merged
Main #2
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…nternLM#2245) * support vlm custom parameters in openai input format * remove flash_attn deps * update * update * update
…ig (InternLM#2275) * fix side-effect: failed to update tm model config with tm engine config * fix
* fix: follow up InternLM#2303 * upd
* Support send tool_calls back to internlm2 * update documents * condition
* fix template * add max_dynamic_patch custom setting * fix test * update docs * update docs * remove unnecessary process * update link
…M#2240) * fix the issue missing dependencies in the Dockerfile and pip * reset dependencies * reset compose images tag * add InternVL_Dockerfile * nvidia/cuda image should be as low as possible,now as of nccl2.22.3,the minimum supported CUDA version if 12.2; if nvidia device version on host machine higher than images,torch in images will not work * remove a line * fix the apt error in InternVL_Dockerfile * apt add -y * change InternVL_Dockerfile base image tag * roll back .\requirements\test.txt to 7c4e75b * add rust build tools in Dockerfile to fix the bug when gradio>4.40.0,which depend on orjson should build with cargo * run pre-commit to fix;and add internvl docs * move rust install command to where installing sys packages * fix internVL docs layout * remove tritonclient[grpc] from serve.txt * fix docs about InternVL * fix en docs about internVL markdown format * fix en docs about internVL markdown format * update * update docs of internVL by H.Lyu * fix supported inference engine info for InternVL * remove nccl installation since it is already in the docker image * remove nccl install and chage base image tag * change base image to 12.4.1 --------- Co-authored-by: lvhan028 <lvhan_028@163.com>
* preprocess for kv-int8 * working kv-int8 * minor * working kv-int4 * optimize kv-int4 * optimize kv-int4 * optimized SIMT f16/u8/u4 decoding * fix tc decoding * int8 tc decoding * int4 tc decoding * minor * optimize * optimize tc kv-int4/int8 * fix `sm_75`/`sm_70` * simplify * bf16+kv4/8 * support more mma instruction * refactor * dispatching * integration * remove offline kv params * fix msvc build * fix msvc build * fix lint * fix lint * fix cmake * fix lint * fix lint * minor * refactor * gemm baseline * optimize * minor * tb swizzle * minor * tune * minor * wip * minor * fp16 transcription * optimize * tune * adjust layout * optimize * tune * refactor * refactor * f16xs4/8 gemm * refactor * dequant * fix Q * fix Q * end-to-end test * optimize Q * pack Q * tune * split-k * sliced-k * fix Q * add `transpose_m8n8_b32` * tune gemm * predicate support * tune * dispatch * dispatch v2 * automatic tuning * nvbench * better API * GPU metrics * update cost model * add simt impl * add 16816 * add 884 * refactor * smem copy * minor * NT & NN * transformation * refactor * refactor * add UV * refactor testbed * working fp16 packing * update * use `(m, k)` & `(n, k)` * simplify * dispatch for conversion * refactor * refactor * refactor * simplify * refactor quantization * quantization * fix `pack_cnt_m/k` * `uint8_t` * `uint4_t` * symmetry * refactor * large pack * fix `SmemCopy` for packed inputs * tune * SIMT * SIMT packing * SIMT int8 * SIMT int4 * fix group size * mma.m8n8k4 * clean-up * refactor epilogue * fix smem layout for C * tune epilogue * TN * optimize * fix `_src_step_k` * use raked partition * fix `Tiled_MMA_v2` & optimize smem copy * working w4a16 * add missing * fuse up and gate * fused silu * `sm75` and `sm70` * cache policy * remove unused * col major output * fix tiling of C * wip * wip * wip * fix iterator * update * update kernel signature * fix packing * update * refactor * update * update * update * alpha beta * set beta * fix & clean-up * check max splits & add qwen * add tp * refactor `LlamaLinear` * share linear layer * tuning interface * update * skip nvbench for MSVC * define `uint` when needed * fix * fix * fix * update * disable large kernels * fix * refactor model conversion * fix lint * simplify target model * refactor model import * minor * pad `inter_size` for tp * refactor * skip `sm_80` and `sm_90` on MSVC * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix msvc build * fix cu12 / sm90 build * fix * fix ut * fix missing include * support GPTQ models * fix ut * parse tuning args * minor * minor * add missing header * add missing headers * fix converter * fix internvl reader initializer * fix * tuning * remove unused * tuning * minor * fix lint * fix lint * fix lint * minor * fix lint * fix baichuan2-7b, deepseek-vl and xcomposer2d5-4bit * tune sm_70 * optimize sm70 & fix converter * optimize v100 * fix lint * RTX 4090 * fix lint * refactor & batch_dim support * A100 * `TuningParams` * lint * lint * minor * switch to m-major MMA for sm70 * recognize GPTQ models * RTX 2080 & GTX 1660 * fix missing return * fix cu12 build for sm90 * fix ptr of operand C * disable cache eviction policy on sm_90 * fix lint * add refs * fix lint * lint
* Update error status_code to raise error in openai client * remove strict
* remove device-type in cli * remove device arg from lite cli
* add test utils * lint * fix msvc build * fix msvc build * lint
…nternLM#2325) * fix getting quantization config * recursive get quant_config * minor fix * check 128
* Fix hidden size and support mistral nemo * fix lint * comments
* Support custom logits processors * support logit_bias for pytorch engine * mv to fused logits_processors * replace input_ids with all_ids * type hint
* optimize prefill * merge main, update attention implementation * check window * request no response * fix response * staged attention * optimize prefill
* refactor turbomind * minor * use `size_t` for size type * fix * minor * print split-fuse param
* support model convert * update template and vision model * update docs * update README
…nternLM#2353) * feat(server): enable `seed` parameter for openai compatible server. * refactor: fix format issue
* move get_started and installation to get_started directory * update
* build(ascend): add Dockerfile for ascend aarch64 910B * ci(ascend): skip ascend dockerfile codespell check * update lmdeploy tag and transformers version in Dockerfile_aarch_910B
* update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update ---------
* support ascend using infer_ext * fix(ascend): make infer_ext using TND format q,k,v in paged_token_attention * support ascend using infer_ext * feat: support ascend moe_gating_topk_softmax * feat: change infer_ext ops function param order (#2) * ascend: align attention mask to 32bytes (InternLM#7) * fix attn args (InternLM#9) * fix: expand shape of attn_mask (InternLM#10) * feat: udpate infer_ext ops interface (InternLM#13) * rename infer_ext to dlinfer * format code * Support internlm 2.5 (InternLM#14) * refactor ascend pagedattention * fix ascend apply_rotary_pos_emb * fix import dlinfer (InternLM#16) * fix: fix rms_norm params (InternLM#18) * fix sync on ascend --------- Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn> Co-authored-by: CyCle1024 <ccy_justin@163.com> Co-authored-by: Wei Tao <1136862851@qq.com> Co-authored-by: jinminxi104 <jinminxi104@hotmail.com> Co-authored-by: pdx1989 <pdx1989@gmail.com>
* support do_sample parameter * merge GenerationConfig & EngineGenerationConfig * align gen_config with logic with transformers * add comments * fix comments * fix comments * rename stop_words_ids -> stop_token_ids * update tests
* use yaml config instead of ini config * remove comments * update * remove unused parameters in TurbomindModelConfig * fix test error * fetch yaml-cpp instead of find_package(yaml-cpp) * remove INIReader and chagne config to yaml format * update converter * remove use_logn_attn from TurbomindEngineConfig since it is rarelly used * update * update * not save engine_config to yaml * fix _from_workspace * fix ut * fix _from_hf * _postprocess_config * fix awq model inference * update * update * fix chat * fix lint * fix lint * config_to_dict * fix lint * minor * minor * remove allow_none * minor * fix lint * typo * fix --------- Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
* Add guided decoding * miss a file * update server * add documents * RegexFSM -> RegexGuide * fix internlm tokenizer * lint * return if guided_input_ids is None * update doc * update * remove stream=False * Add outlines to requirements
* import dlinfer before imageencoding * import dlinfer if device_type is set to ascend * fix isort error
* add ascend_readme_en * add ascend_readme_zh_cn * fix ascend_readme typo * fix typo * fix indent&typo * fix indent again * fix indent again
…M#2419) * update actions/download-artifact to 4.1.7 to fix security issue * update * debug * upgrade upload-artifact v4 * update
* ignore *.pth when download model from model hub * update
* handle invalid images * fix test * update encode_image_base64 * use image instead * update tests
* split lm_head * split token_embed * fix memory malloc and load from converted model * support tie_word_embeddings * split tok_embeddings along hidden dim * Revert "support tie_word_embeddings" This reverts commit b7dc61b. * remove unused * fix windows build * fix loading from workspace * remove the use of ConcateSlice * remove ConcateSlice class * use context_decoder_output_buf_ buffer * use tp in config.yaml * add sync_check_cuda_error * update * remove unused header * udpate check * remove unused
* build: update ascend dockerfile * add description of run file in ascend dockerfile * use multi-stage build to copy lmdeploy without run file
* fix modelscope * fix llava model when input images have size (x, 1) * larger interval * skip get_font for xcomposer2d5 * fix custom image token position * fix potential mismatching issues * update docs
* support pytorch backend min_p * support turbomind backend min_p * fix comments * remove unused header * remove inplace * use _filter_minp_sorted_ * remove end_ids from sampling * use larger grid size in invokeTopPSortInitialize * skip softmax for topk request * use const * use nullptr * use eps * fix pr test
…ot specified (InternLM#2434) * automatically set max_batch_size * update * fix * update * update
* attn layer * move to backend * add base layer * finish llama base * add lora and w8a8 * support awq * add add_rms_norm kernel * optimize step context * attn meta as input * add cuda graph support * disable one of mha kernel * share graph pool * del graph * update docstring * awq cudagraph * merge main * support llava for llama * fix support cudagraph flag * support lora cudagraph * support logit softcapping * support transformers 4.43 * fix ut * fix dynamic ntk cudagraph * add moe support * add custom module support * optimize awq kernel * optimize attention * fix graph runner * optimize prefill * fix response * optmize prefill * adjust grid of paged attention * add attention stages * support llama3 * optimize apply rotary * rename * fix sampling * remove print * prepare for new weight loader * refactor add model * optimize nn * fix linear device * support baichuan 7b 13b * support deepseekv2 no-tp * support deepseek v2 tp * add log * fix ut * support chatglm * support llava * add falcon * add internlm2 and mistral * add gemma/gemma2 * add deepseek, qwen1 * remove request timeout * add qwen2, qwen-moe * add starcoder2 phi-3 phi-3 vision * support phi3 moe * support dbrx * support internvl * support merged awq weight * add cogvlm * update docs * fused layernorm * add gelu and mul * support triton==3.0.0 * update names * fix * cogvlm2 * fix * fix * fix internlm2 awq * rename * fix a hanging problem when using cli serve mode and device ascend on exit * raise -> return * optimize moe * fix linear awq bias, default awq kernel * fix * optimize default awq * fix llama rope, add internlm * optimize decoding * recovery attention * fix fill kv cache * fix internlm oom * fix llama3 memory usage * remove float deepseekv2 * fix llama3 * update smooth quant flag * fix w8a8 * fix w8a8 tp --------- Co-authored-by: grimoire <yaoqian@pjlab.org.cn> Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn>
* add support for llama ascend 910b using torch layers * feat(ascend): modify the usage of ascend kernels to fit torch.compile * refactor for llama * format code * add AscendSoftmaxTopKBuilder * feat: support ascend mixtral * refactor step_context for fusion attention * fix ascend op_backend after rebase * add comment * format code * remove unused param in rms_norm
…nsor parallelism (InternLM#2454) * fix pp * fix lint
* fix minp * better test params
* bump version to 0.6.0 * update readme * update supported models * update get_started on ascend platform
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist