Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Main #2

Merged
merged 65 commits into from
Sep 15, 2024
Merged

Main #2

merged 65 commits into from
Sep 15, 2024

Conversation

zyearw1024
Copy link
Owner

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

RunningLeon and others added 30 commits August 9, 2024 14:48
…nternLM#2245)

* support vlm custom parameters in openai input format

* remove flash_attn deps

* update

* update

* update
…ig (InternLM#2275)

* fix side-effect: failed to update tm model config with tm engine config

* fix
* Support send tool_calls back to internlm2

* update documents

* condition
* fix template

* add max_dynamic_patch custom setting

* fix test

* update docs

* update docs

* remove unnecessary process

* update link
…M#2240)

* fix the issue missing dependencies in the Dockerfile and pip

* reset dependencies

* reset compose images tag

* add InternVL_Dockerfile

* nvidia/cuda image should be as low as possible,now as of nccl2.22.3,the minimum supported CUDA version if 12.2; if nvidia device version on host machine higher than images,torch in images will not work

* remove a line

* fix the apt error  in InternVL_Dockerfile

* apt add -y

* change InternVL_Dockerfile base image tag

* roll back .\requirements\test.txt to 7c4e75b

* add rust build tools in Dockerfile to fix the bug when gradio>4.40.0,which depend on orjson should build with cargo

* run pre-commit to fix;and add internvl docs

* move rust install command to where installing sys packages

* fix internVL docs layout

* remove tritonclient[grpc] from serve.txt

* fix docs about InternVL

* fix en docs about internVL markdown format

* fix en docs about internVL markdown format

* update

* update docs of internVL by H.Lyu

* fix supported inference engine info for InternVL

* remove nccl installation since it is already in the docker image

* remove nccl install and chage base image tag

* change base image to 12.4.1

---------

Co-authored-by: lvhan028 <lvhan_028@163.com>
* preprocess for kv-int8

* working kv-int8

* minor

* working kv-int4

* optimize kv-int4

* optimize kv-int4

* optimized SIMT f16/u8/u4 decoding

* fix tc decoding

* int8 tc decoding

* int4 tc decoding

* minor

* optimize

* optimize tc kv-int4/int8

* fix `sm_75`/`sm_70`

* simplify

* bf16+kv4/8

* support more mma instruction

* refactor

* dispatching

* integration

* remove offline kv params

* fix msvc build

* fix msvc build

* fix lint

* fix lint

* fix cmake

* fix lint

* fix lint

* minor

* refactor

* gemm baseline

* optimize

* minor

* tb swizzle

* minor

* tune

* minor

* wip

* minor

* fp16 transcription

* optimize

* tune

* adjust layout

* optimize

* tune

* refactor

* refactor

* f16xs4/8 gemm

* refactor

* dequant

* fix Q

* fix Q

* end-to-end test

* optimize Q

* pack Q

* tune

* split-k

* sliced-k

* fix Q

* add `transpose_m8n8_b32`

* tune gemm

* predicate support

* tune

* dispatch

* dispatch v2

* automatic tuning

* nvbench

* better API

* GPU metrics

* update cost model

* add simt impl

* add 16816

* add 884

* refactor

* smem copy

* minor

* NT & NN

* transformation

* refactor

* refactor

* add UV

* refactor testbed

* working fp16 packing

* update

* use `(m, k)` & `(n, k)`

* simplify

* dispatch for conversion

* refactor

* refactor

* refactor

* simplify

* refactor quantization

* quantization

* fix `pack_cnt_m/k`

* `uint8_t`

* `uint4_t`

* symmetry

* refactor

* large pack

* fix `SmemCopy` for packed inputs

* tune

* SIMT

* SIMT packing

* SIMT int8

* SIMT int4

* fix group size

* mma.m8n8k4

* clean-up

* refactor epilogue

* fix smem layout for C

* tune epilogue

* TN

* optimize

* fix `_src_step_k`

* use raked partition

* fix `Tiled_MMA_v2` & optimize smem copy

* working w4a16

* add missing

* fuse up and gate

* fused silu

* `sm75` and `sm70`

* cache policy

* remove unused

* col major output

* fix tiling of C

* wip

* wip

* wip

* fix iterator

* update

* update kernel signature

* fix packing

* update

* refactor

* update

* update

* update

* alpha beta

* set beta

* fix & clean-up

* check max splits & add qwen

* add tp

* refactor `LlamaLinear`

* share linear layer

* tuning interface

* update

* skip nvbench for MSVC

* define `uint` when needed

* fix

* fix

* fix

* update

* disable large kernels

* fix

* refactor model conversion

* fix lint

* simplify target model

* refactor model import

* minor

* pad `inter_size` for tp

* refactor

* skip `sm_80` and `sm_90` on MSVC

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix msvc build

* fix cu12 / sm90 build

* fix

* fix ut

* fix missing include

* support GPTQ models

* fix ut

* parse tuning args

* minor

* minor

* add missing header

* add missing headers

* fix converter

* fix internvl reader initializer

* fix

* tuning

* remove unused

* tuning

* minor

* fix lint

* fix lint

* fix lint

* minor

* fix lint

* fix baichuan2-7b, deepseek-vl and xcomposer2d5-4bit

* tune sm_70

* optimize sm70 & fix converter

* optimize v100

* fix lint

* RTX 4090

* fix lint

* refactor & batch_dim support

* A100

* `TuningParams`

* lint

* lint

* minor

* switch to m-major MMA for sm70

* recognize GPTQ models

* RTX 2080 & GTX 1660

* fix missing return

* fix cu12 build for sm90

* fix ptr of operand C

* disable cache eviction policy on sm_90

* fix lint

* add refs

* fix lint

* lint
* Update error status_code to raise error in openai client

* remove strict
* remove device-type in cli

* remove device arg from lite cli
* add test utils

* lint

* fix msvc build

* fix msvc build

* lint
…nternLM#2325)

* fix getting quantization config

* recursive get quant_config

* minor fix

* check 128
* Fix hidden size and support mistral nemo

* fix lint

* comments
* Support custom logits processors

* support logit_bias for pytorch engine

* mv to fused logits_processors

* replace input_ids with all_ids

* type hint
* optimize prefill

* merge main, update attention implementation

* check window

* request no response

* fix response

* staged attention

* optimize prefill
* refactor turbomind

* minor

* use `size_t` for size type

* fix

* minor

* print split-fuse param
* support model convert

* update template and vision model

* update docs

* update README
…nternLM#2353)

* feat(server): enable `seed` parameter for openai compatible server.

* refactor: fix format issue
lvhan028 and others added 29 commits August 28, 2024 18:09
* move get_started and installation to get_started directory

* update
* build(ascend): add Dockerfile for ascend aarch64 910B

* ci(ascend): skip ascend dockerfile codespell check

* update lmdeploy tag and transformers version in Dockerfile_aarch_910B
* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

---------
* support ascend using infer_ext

* fix(ascend): make infer_ext using TND format q,k,v in paged_token_attention

* support ascend using infer_ext

* feat: support ascend moe_gating_topk_softmax

* feat: change infer_ext ops function param order (#2)

* ascend: align attention mask to 32bytes (InternLM#7)

* fix attn args (InternLM#9)

* fix: expand shape of attn_mask (InternLM#10)

* feat: udpate infer_ext ops interface (InternLM#13)

* rename infer_ext to dlinfer

* format code

* Support internlm 2.5 (InternLM#14)

* refactor ascend pagedattention

* fix ascend apply_rotary_pos_emb

* fix import dlinfer (InternLM#16)

* fix: fix rms_norm params (InternLM#18)

* fix sync on ascend

---------

Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn>
Co-authored-by: CyCle1024 <ccy_justin@163.com>
Co-authored-by: Wei Tao <1136862851@qq.com>
Co-authored-by: jinminxi104 <jinminxi104@hotmail.com>
Co-authored-by: pdx1989 <pdx1989@gmail.com>
* support do_sample parameter

* merge GenerationConfig & EngineGenerationConfig

* align gen_config with logic with transformers

* add comments

* fix comments

* fix comments

* rename stop_words_ids -> stop_token_ids

* update tests
* use yaml config instead of ini config

* remove comments

* update

* remove unused parameters in TurbomindModelConfig

* fix test error

* fetch yaml-cpp instead of find_package(yaml-cpp)

* remove INIReader and chagne config to yaml format

* update converter

* remove use_logn_attn from TurbomindEngineConfig since it is rarelly used

* update

* update

* not save engine_config to yaml

* fix _from_workspace

* fix ut

* fix _from_hf

* _postprocess_config

* fix awq model inference

* update

* update

* fix chat

* fix lint

* fix lint

* config_to_dict

* fix lint

* minor

* minor

* remove allow_none

* minor

* fix lint

* typo

* fix

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
* Add guided decoding

* miss a file

* update server

* add documents

* RegexFSM -> RegexGuide

* fix internlm tokenizer

* lint

* return if guided_input_ids is None

* update doc

* update

* remove stream=False

* Add outlines to requirements
* import dlinfer before imageencoding

* import dlinfer if device_type is set to ascend

* fix isort error
* add ascend_readme_en

* add ascend_readme_zh_cn

* fix ascend_readme typo

* fix typo

* fix indent&typo

* fix indent again

* fix indent again
…M#2419)

* update actions/download-artifact to 4.1.7 to fix security issue

* update

* debug

* upgrade upload-artifact v4

* update
* ignore *.pth when download model from model hub

* update
* handle invalid images

* fix test

* update encode_image_base64

* use image instead

* update tests
* split lm_head

* split token_embed

* fix memory malloc and load from converted model

* support tie_word_embeddings

* split tok_embeddings along hidden dim

* Revert "support tie_word_embeddings"

This reverts commit b7dc61b.

* remove unused

* fix windows build

* fix loading from workspace

* remove the use of ConcateSlice

* remove ConcateSlice class

* use context_decoder_output_buf_ buffer

* use tp in config.yaml

* add sync_check_cuda_error

* update

* remove unused header

* udpate check

* remove unused
* build: update ascend dockerfile

* add description of run file in ascend dockerfile

* use multi-stage build to copy lmdeploy without run file
* fix modelscope

* fix llava model when input images have size (x, 1)

* larger interval

* skip get_font for xcomposer2d5

* fix custom image token position

* fix potential mismatching issues

* update docs
* support pytorch backend min_p

* support turbomind backend min_p

* fix comments

* remove unused header

* remove inplace

* use _filter_minp_sorted_

* remove end_ids from sampling

* use larger grid size in invokeTopPSortInitialize

* skip softmax for topk request

* use const

* use nullptr

* use eps

* fix pr test
…ot specified (InternLM#2434)

* automatically set max_batch_size

* update

* fix

* update

* update
* attn layer

* move to backend

* add base layer

* finish llama base

* add lora and w8a8

* support awq

* add add_rms_norm kernel

* optimize step context

* attn meta as input

* add cuda graph support

* disable one of mha kernel

* share graph pool

* del graph

* update docstring

* awq cudagraph

* merge main

* support llava for llama

* fix support cudagraph flag

* support lora cudagraph

* support logit softcapping

* support transformers 4.43

* fix ut

* fix dynamic ntk cudagraph

* add moe support

* add custom module support

* optimize awq kernel

* optimize attention

* fix graph runner

* optimize prefill

* fix response

* optmize prefill

* adjust grid of paged attention

* add attention stages

* support llama3

* optimize apply rotary

* rename

* fix sampling

* remove print

* prepare for new weight loader

* refactor add model

* optimize nn

* fix linear device

* support baichuan 7b 13b

* support deepseekv2 no-tp

* support deepseek v2 tp

* add log

* fix ut

* support chatglm

* support llava

* add falcon

* add internlm2 and mistral

* add gemma/gemma2

* add deepseek, qwen1

* remove request timeout

* add qwen2, qwen-moe

* add starcoder2 phi-3 phi-3 vision

* support phi3 moe

* support dbrx

* support internvl

* support merged awq weight

* add cogvlm

* update docs

* fused layernorm

* add gelu and mul

* support triton==3.0.0

* update names

* fix

* cogvlm2

* fix

* fix

* fix internlm2 awq

* rename

* fix a hanging problem when using cli serve mode and device ascend on exit

* raise -> return

* optimize moe

* fix linear awq bias, default awq kernel

* fix

* optimize default awq

* fix llama rope, add internlm

* optimize decoding

* recovery attention

* fix fill kv cache

* fix internlm oom

* fix llama3 memory usage

* remove float deepseekv2

* fix llama3

* update smooth quant flag

* fix w8a8

* fix w8a8 tp

---------

Co-authored-by: grimoire <yaoqian@pjlab.org.cn>
Co-authored-by: chenchiyu <chenchiyu@pjlab.org.cn>
* add support for llama ascend 910b using torch layers

* feat(ascend): modify the usage of ascend kernels to fit torch.compile

* refactor for llama

* format code

* add AscendSoftmaxTopKBuilder

* feat: support ascend mixtral

* refactor step_context for fusion attention

* fix ascend op_backend after rebase

* add comment

* format code

* remove unused param in rms_norm
* bump version to 0.6.0

* update readme

* update supported models

* update get_started on ascend platform
@zyearw1024 zyearw1024 merged commit 4063b26 into develop Sep 15, 2024
10 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.