Adjust the flashinfer llama model to accommodate the baichuan model #40

alfredgui2 · 2024-06-20T15:12:52Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Adjust the flashinfer llama model to accommodate the baichuan model

Init fix: cleanup Add load testing Refactored gRPC interface Added validation logic ValidationError was not correctly handled Use axum feat: Docker image feat: Add AML deployment Update aml deployment feat: Improve error handling feat: Add arguments to CLI v0.1.0 fix(validation): Fix error messages feat(router): Add max_waiting_tokens Create LICENSE (#2) feat(server): Use safetensors Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> feat(client): Simplify sharded logic feat(server): Support bitsandbytes feat(server): Support all AutoModelForCausalLM on a best effort basis feat: Use json formatter by default in docker image fix(models): Revert buggy support for AutoModel feat(server): Support generic AutoModelForCausalLM feat(server): Support AutoModelForSeq2SeqLM feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard feat(server): Improved doc fix(server): Fix Transformers fork version feat(server): Clarify CausalLMBatch concatenate method feat(rust): Update to 1.65 fix(router): Fix HTTP status codes fix(readme): Typo fix(router): Handle tokenizer errors feat(server): Support Galactica (#4) fix(batching): Avoid theoretical hang in batcher loop (#5) - Avoid theoretical hang in batcher loop - Avoid a couple of clones in the router generate method - Keep attention mask tensors as integers - Remove num_heads attribute Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com> feat(server): Add model tests (#6) fix(server): Only pad to multiple of 8 on GPUs feat: Support stop sequences (#7) feat: Return logprobs (#8) feat(launcher): Add integration tests (#9) fix(server): Fix stop sequences (#11) fix(server): Check for device type correctly when determining initial padding (#16) AFAIK there is no torch device type called "gpu". fix(router): Include special tokens when tokenizing (#14) There's currently a discrepancy in the tokenization between the router and python server code. The latter includes special tokens but former does not. This results in a token count mismatch for seq2seq models such as mt0 where the tokenizer emits an EOS token at the end. This in turn results in some unexpected/incorrect output, in particular when batch concatenation is involved, because the python code uses the input length passed from the router for each row. As far as I can tell, it is better to include this token in the encoder `input_ids`, so I guess it's best to just adjust on the router side. feat(router): Add const parameters to validation logic (#15) I noticed some opportunity to collapse some of the logic, in case you are interested. fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13) Fixes #12 in the easiest way I could think of. feat(launcher): Log server stdout (#19) Co-authored-by: Nick Hill <nickhill@us.ibm.com> fix(server): Minor refactorization using new_zeros (#24) - Fix some type hints, in particular base tokenizer class - Make use of `tensor.new_zero/empty` methods - Simplify env var string parsing in launcher fix(router): Obey max batch size (#23) feat(server): Support SantaCoder (#26) fix(server): Fix position ids (#28) feat(docker): Make the image compatible with api-inference (#29) fix(docker): fix api-inference deployment (#30) fix(router): fix api-inference deployment (#31) fix(dockerfile): fix docker build (#32) feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33) feat(router): Remove second lock from batcher hot path (#27) @njhill feat: Support sampling seeding (#37) Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com> feat: Add token streaming using ServerSideEvents support (#36) Add token streaming using ServerSideEvents (SSE). The signature of the SSE events is: ```rust struct Details { finish_reason: String, generated_tokens: u32, seed: Option<u64>, } struct StreamResponse { token: Token, generated_text: Option<String>, details: Option<Details>, } struct ErrorResponse { error: String, } ``` Revert "feat: Add token streaming using ServerSideEvents support" (#40) Reverts huggingface/text-generation-inference#36 fix(server): fix seeding on gpu (#42) fix(server): fix seeding with multiple shards (#44) feat: Add token streaming using ServerSideEvents support (#41) fix(server): fix quantization for sharded models (#45) feat(server): Support GPT-Neox (#39) feat(ci): Docker build and push (#46) feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48) feat(server): support repetition penalty (#47) feat(server): allow the server to use a local weight cache (#49) fix(server): allow greedy repetition penalty (#51) feat(router): use background task to manage request queue (#52) Co-authored-by: Nick Hill <nickhill@us.ibm.com> breaking(router): modify /generate API to only return generated text (#50) @njhill, @yk FYI generated_text was concatenated to the user prompt for legacy reason. We want to remove this behaviour as we don't think it is useful and even detrimonial to usability. We also remove the unused Vec. feat(router): refactor API and add openAPI schemas (#53) feat(docs): Clarify installation steps (#54) Adds some bits for first-time users (like me 😄 ) feat(ci): push to AML registry (#56) fix(server): better handling of inference mode (#57) V0.2.1 (#58) feat(server): support t5 (#59) fix(docker): increase shm size (#60) fixed SSE naming (#61) https://en.wikipedia.org/wiki/Server-sent_events feat: add distributed tracing (#62) feat: add safetensors conversion (#63) feat(server): improve download logging (#66) feat(launcher): add disable_custom_kernels arg (#67) feat(router): add max_total_tokens and empty_input validation (#68) closes #65 fix(launcher): copy current env vars to subprocesses (#70) closes #69 feat(router): add prometheus metrics scrape endpoint (#71) v0.3.0 (#72) feat(router): add cors allow origin options (#73) feat(server): enable hf-transfer (#76) fix(server): remove position_ids from galactica forward (#82) closes #80 feat(server): pre-allocate max attention mask (#75) v0.3.1 (#84) feat(server): add special token bool (#85) fix(docs): fix openapi schema (#86) fix(server): fix token_is_special (#87) feat(router): add legacy route for api-inference support (#88) feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89) feat(router): add api-inference headers (#91) feat(server): add logits watermark (#90) feat(server): update to hf_transfer==0.1.2 (#93) feat(ci): improve CI speed (#94) fix(launcher): add router parameters to launcher (#95) feat(server): fix transformers commit (#96) v0.3.2 (#97) fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100) feat: allow local models (#101) closes #99 feat: add supported models (#102) feat(clients): Python client (#103) fix(server): fix galactica batch (#106) closes #105 feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107) feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108) fix(python-client): stream not set on the sync client (#109) fix(server): fix index out of range for watermarking (#110) feat: support typical sampling (#114) closes #112 fix(server): do not warp prefill logits (#116) feat(router): support left truncation (#115) closes #111 feat(router): add best_of parameter (#117) feat(python-client): add new parameters (#118) v0.4.0 (#119) feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122) …ed models fix(server): revert gpt-neox optims (#123) fix(server): add position ids to neox (#126) fix(server): use server tokenizer as gt (#128) fix(python-client): relax dependencies (#129) feat(python-client): add cookies to Client constructors and requests (#132) I have a use case where we need to pass cookies (for auth reasons) to an internally hosted server. Note: I couldn't get the client tests to pass - do you need to have an HF token? ```python FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid ``` feat(ci): add ci paths (#134) feat: Add note about NVIDIA drivers (#64) Co-authored-by: OlivierDehaene <olivier@huggingface.co> feat(python-client): release v0.4.0 (#135) feat(python-client): add CI (#136) feat(server): flash neoX (#133) fix(server): fix flash-neox scores warping (#137) feat(server): cleanup flash neox loading (#139) v0.4.1 (#140) fix(server): Avoid using try/except to determine kind of AutoModel (#142) feat(server): Add mypy-protobuf (#141) Generates .pyi files for protobuf stubs which provide strong typing information. Very helpful for IDE auto-completion, etc. feat(server): clear cache on error (#143) feat(server): reduce mlp and attn in one op for flash neox (#145) feat: aws sagemaker compatible image (#147) The only difference is that now it pushes to registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:... instead of registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-... --------- Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com> fix(ci): fix sagemaker action (#148) feat(benchmark): tui based benchmarking tool (#149) fix(server): fix flash neox rotary embeddings (#150) v0.4.2 (#151) v0.4.3 (#152) feat(server): flash santacoder (#153) docs(readme): provide link Logits Warper README (#154) fix(server): fix escape characters in stop sequence (#155) feat(docker): improve flash_attention caching (#160) feat(launcher): allow disabling hf_transfer (#161) fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162) fix(router): use buckets for metrics histograms (#163) feat(router): make router input validation optional (#164) feat(server): add flash attention llama (#144) feat(server): support OPT models (#55) OPT models do not all have a `tokenizer.json` file on the hub at the moment. Can't merge for now. v0.5.0 (#168) feat(server): optimize decode for sane tokenizers (#170) feat(server): support sharded santacoder (#167) fix(launcher): revert change on shard errors (#173) fix(ci): fix CVE in github-slug-action (#174) feat(ci): add image signing with cosign (#175) feat(ci): add Trivy and scan docker image (#178) feat(ci): use large runners (#179) feat(ci): faster scanning (#180) fix(ci): fix ci permissions (#181) fea(dockerfile): better layer caching (#159) fix(ci): fix cosign error (#183) fix(docker): fix docker image (#184) fix(docker): fix image (#185) fix(docker): revert dockerfile changes (#186) fix(docker): fix docker image dependencies (#187) fix(router): fix truncation (#190) closes #189 feat(python-client): get list of currently deployed tgi models using the inference API (#191) feat(router): add info route (#196) close #125 feat(server): support quantization for flash models (#200) closes #197 feat(server): check cuda capability when importing flash models (#201) close #198 fix(server): fix hf_transfer issue with private repos (#203) fix(docker): remove unused dependencies (#205) fix(router): add auth token to get model info (#207) feat(router): add git sha to info route (#208) feat(router): drop requests when client closes the channel (#202) fix(ci): fix sha in docker image (#212) feat(server): flash attention past key value optimizations (#213) feat(router): add device and dtype info (#215) fix(server): fix past key values logic (#216) @njhill fyi fix(server): cleanup new flash past_key_values logic (#217) fix(server): fix flash causal (#218) fix(server): fix flash causal (#219) fix(server): fix flash batch filtering (#220) misc: update to rust 1.69 (#221) v0.6.0 (#222) feat(server): reduce memory requirement (#214) chore(server): update huggingface-hub (#227) feat(router): use number of tokens in batch as input for dynamic batching (#226) Co-authored-by: Nick Hill <nickhill@us.ibm.com> feat(router): add endpoint info to /info route (#228) chore(server): update safetensors version (#235) fix(python-client): add auth headers to is supported requests (#234) Starting some routing tests. (#233) fix(benchmarking): fix benchmarking tool chore(launcher): refactor logic (#242) Hopefully it's cleaner feat(router): add tests to validation (#237) feat(router): new healthcheck that skips the queue (#244) Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co> fix(server): fix reshaping of bloom past_key_values in concatenate() (#252) Introduced in #214 Fixes #249 fix(server): Small tidy of code from recent changes (#251) remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter() chore(server): update transformers (#250) feat(server): add watermarking tests (#248) feat(docker): add nvidia env vars (#255) doc(launcher): add more docs to the `launcher` itself and link in the README (#257) feat(benchmark): add support for private tokenizers (#262) Adding docs on how dynamic batching works. (#258) This PR starts the minimal possible amount of explanation I could think of. It tries to explain how dynamic batching occurs, the interactions with past key values and ignores the padding problem. Maybe some drawings could help too but I kept it to text for now. chore(github): add templates (#264) fix(server): fix typo in tokenizers decode (#269) closes #268 feat(server): support hf endpoint weight layout (#266) fix(launcher): pass weights cache override to the download process (#274) closes #273 fix(launcher): handle hub branches (#278) fix(server): Removes the parallelism in file convertion (during download) (#275) feat(launcher): Improve error message when download process fails. (#276) fix(server): fix convert (#284) chore: add `flash-attention` to docker ignore (#287) included when building docker locally. (Where the local dirs might have the flash-attention folder.)   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  fea(server): decrease convert RAM requirements (#286) fix(dockerfile): fix nvidia env vars (#297) Fixes #291 feat(router): Adding response schema for compat_generate (#292) feat(docker): add benchmarking tool to docker image (#298) fix(docker): fix docker build (#299) feat(server): optim flash causal lm decode_token (#285) fix(docker): fix nvidia env vars (#305) fix(docker): remove nvidia require cuda env (#310) feat(server): shard token decode (#303) feat(server): use float16 (#304) fix(docker): remove CUDA_VERSION feat(server): use cuda graph in logits warping (#302) fix(server): fix multinomial implem in Sampling feat(server): GPTQ quantization (step1) (#277) Changes only the type from `bool` to `Option<Enum>` pretty much everywhere. - Use `Optional[str]` in Python (easier to manage than importing type everywhere). Except for the cli to get proper validation - Updated all models to handle gracefully new values. (Error out if unknown value, or gptq since not implemented).   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  chore(docker): use nvidia base image (#318) fix(docker): remove quantize default fix(docker): use ubuntu20.04 Hotfixes for santacoder/bigcode. (#294) Hotfixes: - Uses `model_type`=`gpt_bigcode` for more general usage. - Hotfixes linked lm_head vs wte_embedding (safetensors file do not contain the key, correctly when the file is sharded, where as pytorch copies the tensor)   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Lifting check_unitialized. (#325) Lifting check_unitialized.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  Removing dead variables. (#327)   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  feat(ci): custom gpu runners (#328) Single place for TP layers + Dropout Layer Norm + FastLinear (#329)   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  feat: add snapshot testing (#282) feat(integration-tests): improve comparison and health checks (#336) fix(server): fix decode token (#334) Fixes #333 --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> fix: set MODEL_ID in sagemaker-entrypoint script (#343) feat(server): Support BLOOMChat-176B (#348) (#351) @njhill, temporary workaround to be able to run our CI as secrets are not available to runners run by external contributors. I will ask around to see if there is a better way. Co-authored-by: Nick Hill <nickhill@us.ibm.com> fix(server): fix init for flash causal lm (#352) Fixes #347 fix(server): t5 cannot run in f16 (#356) Fix #349 fix(ci): fix security group (#359) Switch security group used for ci (open outbound rules) Signed-off-by: Raphael <oOraph@users.noreply.github.com> Co-authored-by: Raphael <oOraph@users.noreply.github.com> feat: add nightly load testing (#358) chore(sever): update requirements (#357) Fixes #338 feat(server): support fp16 for t5 (#360) Fixes #349 feat(server): do not use device_map auto on single GPU (#362) feat(server): support trust_remote_code (#363) feat(router): log input/ouput at debug level (#364) @njhill FYI v0.7.0 (#353) feat: decrease IPC proto size (#367) Closes #307 #308 feat(benchmarker): add summary tables (#368) feat(server): support vectorized warpers in flash causal lm (#317) Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com> Fix issue when load AutoModelForSeq2SeqLM model (#370) fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES fix(server): fix quantization feat(server): support RefinedWeb models (#379) v0.8.0 increase health checks feat(server): add retry on download (#384) fix(server): fix bnb quantization for CausalLM models (#385) v0.8.1 fix(server): fix has_position_ids (#395) Fix #389 feat(server): remove trust_remote_code requirement for falcon models (#396) feat(server): load santacoder/starcoder models with safetensors (#393) Fix #366 v0.8.2 feat(sagemaker): add trust remote code to entrypoint (#394) feat(launcher): parse oom signal (#404) feat(server): only compute prefill logprobs when asked (#406) Close #288 feat(server): batch tokenization for flash causal lm (#411) chore: update openapi schema feat(server): Rework model loading (#344) Reworked the loading logic. Idea is to use cleaner loading code: - Remove need for `no_init_weights` - Remove all weird `bnb_linear` and `load_weights` and `post_load_weights`. New code layout: - New class `Weights` in charge of handling loading the weights from multiple files into appropiate tensors (potentially sharded) - TP layers now are "shells", they contain the code to know what kind of sharding we need + eventual `all_reduce`. They do not inherit from linear, but they contain some kind of Linear instead - the contained linear can be either FastLinear, BnbLinear or GPTq Linear next. - All modeling code is explictly made for sharding, process group is just no-ops for non sharded code (removes a lot of test cases) ![Screenshot from 2023-05-19 23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f) --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net> Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> feat(server): optimize dist ops (#434) docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441) It solves a typo in the comment sections referencing the environment variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this variable have been found in code logic leading to undefined behaviour or bugs. This PR is not expected to perform any code logic modification. fix(makefile): Fix typo and use POSIX comparison in the makefile (#443) This PR fixes: - The usage of non posix comparison which may fail depending on the shell used (`=` will always work, `==` only with bash) - Typo in the env variable name displayed in the error message `BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`  Fixes #422 feat(server): pre-allocate past key values for flash causal LM (#412) feat(router): add ngrok integration (#453) feat(server): improve flash attention import errors (#465) @lewtun, is this enough? Closes #458 Closes #456 fix(server): fix warpers on CPU (#472) Closes #471 fix(server): Fixing T5 in case the names are mixed up. (#475) feat(server): Update convert logic. (#483) Should be more robust to shared tensors (ok when using `from_pretrained). But forcing us to add new checks in our loading code (since the chosen key to keep might be different from `transformers`). --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> feat(server): Adding new ignore_rule for conversion. (#485) fix(router): add timeout on flume sends (#488) feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438) Let's start discussing implementation. - Need to expose the quantization scripts (either included here or add doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa) - Make sure GPTQ works for multiple models (priority to Falcon). Currently it means that every place we use `get_{tensor|sharded}` to check for quantization. My idea is to reintegrate as much as possible into `utils/layer.py` by expanding `load_multi` to be a bit more generic. This might require some thinking, but ultimately the `qweight,qzeros,scales,g_idx` should be in a single place, and independant of bias presence.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co> fix(server): Do not init process group if already initialized (#388) feat(router): add header option to disable buffering for the generate_stream response (#498) generate_stream endpoint response stream. Problem: If a model is run behind a proxy server such as nginx that has buffering enabled then the response stream from generate_stream gets aggregated into a single response which basically disables streaming. Instead of getting a chunked response where each token is presented over time the response presents everything all at once. Solution: This change adds the `X-Accel-Buffering` http header which disables buffering for the generate_stream response, allowing the response to stream properly. feat(server): add paged attention to flash models (#516) Closes #478 feat(router): arg validation (#519) feat: Add the option to force another dtype than `f16`. (#513) fix(launcher): fix issue where launcher does not properly report shard failures (#522) v0.9.0 (#525) feat(server): Add Non flash MPT. (#514) This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention. Fixes https://github.com/huggingface/text-generation-inference/issues/361 Fixes https://github.com/huggingface/text-generation-inference/issues/491 Fixes https://github.com/huggingface/text-generation-inference/issues/290 fix: Update server/Makefile to include Makefile-vllm (#520) For consistency and ease of use (you can just run `make` to install vllm without any extra steps).   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462) fix(server): Handle loading from local files for MPT (#534) This PR allows the MPT model to be loaded from local files. Without this change, an exception will be thrown by `hf_hub_download` function if `model_id` is a local path. fix(server): avoid errors for very small top_p values (#544) See https://github.com/huggingface/transformers/pull/24111 I didn't add validation to the `__init__` method since it's not done for other values/warpers. feat(server): use latest flash attention commit (#543) @njhill FYI feat(router): add argument for hostname in router (#545) (#550) In title. Adds argument `--hostname` in router to support something like `--hostname ::`. Tested with ```commandline cargo run -- --port 8080 --hostname :: curl -I -X GET 'http://[::1]:8080/health' # failed before this commit ``` Trigger CI --------- Co-authored-by: Phil Chen <philchen2000@gmail.com> fix(server): decrease memory fragmentation (#557) v0.9.1 (#558) fix(server): harden the weights choice to save on disk. (#561) - Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : https://github.com/huggingface/text-generation-inference/issues/555 and : https://github.com/huggingface/text-generation-inference/pull/501 and https://github.com/huggingface/text-generation-inference/issues/556 and https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593 feat: better errors for warmup and TP (#575) Close #571 fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579) Fixes #555 feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580) Some models are already converted, and do not have those values in the file, this enables users to use them with less friction. Went for pure env based because adding flags would end up (imo) very tedious to maintain. There's a lot of sanitation to do: those flags would be errors if not used in conjuction with `--quantize gptq`. Then the flags need to exist in the launcher and the server passing them all throughout all function calls. This PR is intended as an easy escape hatch, not the defacto method to use gptq in TGI. Fixes #500 chore: migrate ci region for more availability. (#581) fix(server): T5 weights names. (#582) Fixes #541 fix(server): Adding logger import to t5_modeling.py (#585) Logger is referenced during the apex importing but is not imported, causing a NameError fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590) This fixes a typo and extends the GPTP_BITS environment variables through to the second method which requires the same logic. Please let me know if there's anything I've misunderstood in this change. Thanks @Narsil for the original fix. feat(server): Implements sharding for non divisible `vocab_size`. (#583) - The code is relatively easy (just disable the checks on Embedding and Head) This cannot be done in the same easy fashion for hidden_dim/head_dim. It's relatively easy on some models (classic MHA) but it would make the other models (MQA) much more complex, and GPTQ quantization another quite hairy piece of code. feat(server): empty cache on errors GPTQ Env vars: catch correct type of error (#596) When passing in environment variables like gptq_bits, we still get errors thrown from TGI because the try/catch block is catching the wrong type of error. This PR aims to fix that. @Narsil - let me know if this is how you want this formatted. My Python is a little shaky, so I hope this syntax is correct. feat(launcher): add arg validation and drop subprocess (#595) feat(router): explicit warning if revision is not set (#608) docs: README: Add logo + baseline (#611) ![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984) fix(server): blacklist local files (#609) Close #589 #602 v0.9.2 (#616) fix(server): empty_cache when stopped fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621) fea(launcher): debug logs (#623) feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587) but should work on more configurations (no need for 2 GPUs, less RAM usage). Reworking the quantization script so it's still universal (not llama specific) but should work on more configurations (no need for 2 GPUs, less RAM usage). Still need to investigate the potential differences in quantization results.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  feat(server): flash attention v2 (#624) feat(server): add support for llamav2 (#633) v0.9.3 (#634) fix(server): fix llamav2 config (#635) feat(server): auto max_batch_total_tokens for flash att models (#630) feat(router): ngrok edge (#642) docs: Update README.md (#639) docs: Update README.md (#643) Add trust_remote_code to quantize script (#647)   Fixes a bug appeared with MR #587 fixing issue #552. See the discussion in #552. With MR #587 the trust_remote_code variable is not passed to AutoModelForCausalLM, but is found in the function signature. This prevents models like falcon to be quantized, because trust_remote_code is required. This MR fixes the issue. - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [X] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil  fix(server): llama v2 GPTQ (#648) As per title & reported https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956 https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5 Test it: ``` GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq ``` & ``` curl 127.0.0.1:8080/generate \ -X POST \ -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \ -H 'Content-Type: application/json' ``` fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661) fix(server): use mem_get_info to get kv cache size (#664) Close https://github.com/huggingface/text-generation-inference/issues/649 Close https://github.com/huggingface/text-generation-inference/issues/651 Close https://github.com/huggingface/text-generation-inference/issues/653 Close #636 feat(server): Add exllama GPTQ CUDA kernel support #553 (#666) Just trying to get the integration tests to pass.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com> Directly load GPTBigCode to specified device (#618) This PR directly load GPTBigCode to specified device, avoiding moving model between devices. This PR directly load GPTBigCode to specified device, avoiding moving model between devices. - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @OlivierDehaene OR @Narsil feat(server): add local prom and health routes if running w/ ngrok feat: add cuda memory fraction (#659) Close #673 fix(server): fix exllama buffers (#689) Close #683 feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671) - Current PR is not great because we're side stepping the `Weights.__init__` but Weights shouldn't requires anything related to the config or the model_id as it aims to be a simple Wrapper over multi file loading. - Ideal solution would be to use something like Rust enum ``` enum Quantize{ Bitandbytes(Bitsandbytes), GPTQ(bits: usize, groupsize: usize) ``` And passing that around during load. Unfortunately we don't have access to this, so for now, side-stepping seems easier. - Re-enabling groupsize<0 with exllama (confirmed it works.) Helps #601 In next steps we should make sure our quantization script uses that format and make it standard.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  docs(README): update readme fix(server): fix quantization python requirements (#708) fix(server): fix missing datasets in quantize feat(server): support new falcon config (#712) v0.9.4 (#713) Add section about TGI on other AI hardware accelerators in README (#715)   As per title. - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  docs: Add hardware section to TOC in README (#721) feat(server): update vllm version (#723) chore: update license to HFOIL (#725) v1.0.0 (#727) Local gptq support. (#738) Redoes #719   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  Fix typing in `Model.generate_token` (#733) This PR fixes a minor type annotation issue in the signature of `Model.generate_token`. All existing overrides of `Model.generate_token` return `Tuple[List[Generation], Optional[B]]`: https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537 https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804 https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591 I suspect that back in 017a2a8c when `GeneratedText` and `Generation` were separated, the function signature was not updated. - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? CC @OlivierDehaene Adding Rope scaling. (#741) - Adds Rope NTK scaling. Done because https://github.com/huggingface/text-generation-inference/pull/529 was closed Took some code from https://github.com/huggingface/transformers/pull/24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  chore: fix typo in mpt_modeling.py (#737) Fixed typo.   implemetation -> implementation - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update…

commit 6adf97815ef6828e0aa06f2a4635370b4ad7476e Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sat Jul 6 13:18:16 2024 -0400 Fix the decoding logic in test_local_grpc.py (#44) * fix the test_local_grpc script * lint fix commit f355733482f4ebc15916df151ad00ad9d64d451d Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jul 6 07:50:55 2024 -0700 bug fixes commit 466b0a65429d339a1c004c5991749e6f9cb1230b Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jul 1 22:48:56 2024 -0400 Add the batch concatenation functionality for flashinfer server (#43) * refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint commit b9838c5c4720ff09f946e7fce8dd328aab57dc16 Author: NovTi <yx2432@nyu.edu> Date: Tue Jul 2 00:07:24 2024 +0800 Add ChatGLM and refactor Qwen2 commit 9fafffcfacb8ded0d0d5aefac2cf38ae3a44876f Author: PeterYaoNYU <yy4108@nyu.edu> Date: Mon Jul 1 10:30:21 2024 +0800 update mistral flashinfer commit d099bbbbeeaf638220696b5c9f94cf9634f8c221 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 30 18:39:44 2024 -0700 update submodules commit 4edacd568d064cb834597d8cf2f24bf1bef20683 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 30 18:29:34 2024 -0700 update submodules commit 9da076dc488140273ab17773ae642e8ac3edb119 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 30 18:17:41 2024 -0700 minor fix in makefile commit fa213e263fd86ec41d033cb8d46dea07076720bd Author: MichaelYuan2 <hy2203@nyu.edu> Date: Tue Jun 25 10:41:09 2024 +0800 update FlashinferAttentionWrapper to flashinfer 0.0.6 commit 8d3dd4898a26f89d82233640123aad90e2477bb6 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 24 11:25:08 2024 -0400 Fix the server CLI issue with use_flashinfer flag (#42) * fix refactor * empty * fix lint commit 23118727bdf000d87115df9ac6a6ccf3aee7a2ef Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sat Jun 22 17:22:51 2024 -0400 decouple flashinfer files from flash attention (#41) commit 9b3c09850ddfdd8141601ee9b1b027e4aa2d4b83 Merge: 4a40c64 f0d3664 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Thu Jun 20 11:13:14 2024 -0400 Merge pull request #40 from mlsys-io/add_baichuan Adjust the flashinfer llama model to accommodate the baichuan model commit f0d3664f34acae5020f045fabca15aa310ce60ec Author: Alfred Gui <alfredzqgui@gmail.com> Date: Thu Jun 20 10:46:12 2024 -0400 adjust the flashinfer llama model to accomodate baichuan commit 4a40c6415cd7f1d29bab6de9907ca8ac66833863 Merge: 0ba0ac9 6aaab88 Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 17 10:15:42 2024 -0700 Merge branch 'master' of github.com:mlsys-io/kv.run commit 0ba0ac9dd8825cef92cd7b92fef49ab0efcb8fbd Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 17 10:01:44 2024 -0700 minor fix in output example commit 6aaab883fb154b960de9ad501de74ad90f447725 Merge: 7a93d84 08fde0f Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:46:13 2024 -0400 Merge pull request #38 from mlsys-io/flash_attn_rotary Use Flash attention for rotary embedding and layer normalization for Phi2 and Phi3 commit 08fde0f9ab74fd54fe59bbca5020448a862c1188 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:43:19 2024 -0400 revert test file commit c51e36e3a3bf60f5e23f3a1fee5fe6b116fcc362 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:42:16 2024 -0400 fix lint commit 7dfa57d5ca29e366c1d7c6de01ef6e81840fd7d5 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:40:40 2024 -0400 empty commit b45e8968e75976ac506dc25b467e41520c457d48 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 14:17:55 2024 +0000 fix phi2 and phi3 modeling commit 31ad6bd942293ce18addf79944da5d742518f900 Merge: 1e2bf10 7a93d84 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 08:55:24 2024 -0400 merge master commit 1e2bf1026420e298cb7fbed4d73166baefcbf615 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 06:43:51 2024 -0400 fix the flashinfer adapter commit da84f6bcce038029714f48916510964d5b00d757 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 00:51:55 2024 +0000 fixes commit e0feabb012e8d82d6265dc85811a47fff44c1c65 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sun Jun 16 20:20:59 2024 -0400 fix rotary bug commit 7a93d8413fbfb62e8ae6646a12aaed55b36afaa1 Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 15 22:50:16 2024 -0700 update to rust 1.79 commit 6c4fa6effac801c7c4a30479eca30a7c5ecb057d Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 15 22:15:41 2024 -0700 minor fixes commit ad40a1752d5964554814261754e63a1122829ce9 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sun Jun 16 01:49:28 2024 +0000 flash attn rotary commit 868d3f2fa74a07178806eadc79a2f23f59bafa77 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 22:57:09 2024 -0700 minor router-server fix commit b8a47854a60d21347d6e4f66a507d1a4d2580c30 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 16:43:32 2024 -0700 finalize docker build workflow commit fa2f2f2c8d5249e151cb51c35ef2952cf937b98c Merge: 93edec5 85f34cb Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 14:16:29 2024 -0700 Merge branch 'master' of github.com:mlsys-io/kv.run commit 93edec51ef1714c95b56699ad0b284f6c0b7a916 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 14:16:18 2024 -0700 dependency and rust toolchain fix commit 85f34cb1147265e3d13080d032c92b7d81d09895 Merge: de58365 e263ba8 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Fri Jun 14 15:16:44 2024 -0400 Merge pull request #36 from mlsys-io/fix_warm Fix the warm-up issue commit de5836558a56c3541ec9be3b1d41dde51d08969a Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 12:06:42 2024 -0700 fix in workflow commit 83fc271da0ef6c0580d5d8491605b582c2d730cc Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 11:32:29 2024 -0700 build workflow update commit 66d272347539741c6750841938123b5522abb144 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 09:10:00 2024 -0700 docker workflow commit e8f9ff4f2be08421219acc6d2b611e2c4ba87768 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 09:08:55 2024 -0700 docker workflow commit e49f754e1fb33af4b9bf33bcc08a6d23d4cacb56 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 00:04:32 2024 -0700 remove tgi build workflow commit a4802b7867e766e492cb1f99877f386148962c3a Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 00:01:15 2024 -0700 docker build workflow; remove submodules (#35) * test docker * docker * remove submodule * updates commit e263ba802023d45ee5b26df0d90f8401ee0f87aa Author: Alfred Gui <alfredzqgui@gmail.com> Date: Thu Jun 13 20:32:48 2024 -0400 fix warm up issue commit c7613eb887ac10ba8d38b00ab26b85ff395ecdc6 Author: Yao Lu <fdyaolu@gmail.com> Date: Thu Jun 13 17:01:27 2024 -0700 test docker (#34) commit e61ea779f8dffacab0a161aa13135999d6ec3ee7 Author: Yao Lu <fdyaolu@gmail.com> Date: Thu Jun 13 09:47:33 2024 -0700 minor fixes and rename tests.xml commit 8ae802cb8848df58fc9c0c279044f5b50309044e Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 14:09:50 2024 -0700 fix dtype bugs in flashinfer model def commit b821d68f4120951bbde7f57ca0ad9ba914d33354 Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 11:30:51 2024 -0700 bug fix in layers/__init__.py commit b7c8735c77cb76446ba30efbb20f19067289fcab Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 10:33:50 2024 -0700 minor typo commit 6010fad087f477174766981acc162322e1d767da Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 10:30:45 2024 -0700 critical output bug (#25) * output debug * update minor commit b599cc65ecb8215cfcc8a9db6daa0d88450b9cc5 Author: Alfred Gui <zgui@flexport.com> Date: Tue Jun 11 10:34:24 2024 -0400 Decouple flashinfer code paths from flash attention library dependencies (#33) * decouple flash attn dependency from flashinfer code paths * follow up commit e0cd4a67f7cffdc620baa5d1ae22a32a3be94d4e Author: Alfred Gui <zgui@flexport.com> Date: Tue Jun 11 09:47:06 2024 -0400 reformat the llama files (#32) commit 6c96fddcbbe4c16f97fe391ef3387702234f4f65 Author: Alfred Gui <zgui@flexport.com> Date: Mon Jun 10 21:02:42 2024 -0400 Llama rewrite (#31) * write llama in tgi style * fixes * fix the runtime issues commit 9dd3b75af84cb0d3411bd43fc0414e4592193037 Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 17:10:20 2024 -0700 Kv.run test workflows (#30) * python 3.10 * python 3.10.14 * update doc * dispatch * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow commit 9ec483dae3eb34f594511b649370af354d5d0923 Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 15:15:35 2024 -0700 kv.run test workflows (#29) * python 3.10 * python 3.10.14 * update doc * dispatch commit 4757af8b6bb5b5548e17c5aeee767f5650607aed Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 14:53:52 2024 -0700 kv.run test workflow commit d58a35ed4694a18b1d3028b79cab9b3227ccdafc Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 11:41:13 2024 -0700 Compliant for pre-commit configs commit a8144374aa50e85016c19fa6f4a45c7f7c724d46 Author: Alfred Gui <zgui@flexport.com> Date: Mon Jun 10 06:45:29 2024 -0400 Introduce the flashinfer attention wrapper abstraction and use it for Llama and Gemma models (#28) * abstract the attention layer * fix the bugs commit 3956e467fd043e8218462e475d71892784ad5907 Author: Alfred Gui <zgui@flexport.com> Date: Sun Jun 9 06:36:01 2024 -0400 Refactor the Flashinfer models (#27) * refactor the flashinfer models * fixes commit 7dda533b23d548bff8c569370daff203699a6e60 Author: Alfred Gui <zgui@flexport.com> Date: Sat Jun 8 08:40:55 2024 -0400 Support Flashinfer based Phi2 and Phi3 models (#26) * add phi model * fix phi integration errors * padding for phi * fix modeling for phi * workarounds for phi * use flash attn's position rotary embedding * support phi3 and baichuan * fix position encoding * clean up commit 482ef988e2c2ef59743aeaff01d79b72e0546baa Author: NovTi <yx2432@nyu.edu> Date: Wed Jun 5 22:04:14 2024 +0800 Add qwen2 1.8b and 72b base inference commit 5935ccedd980669c1366d70f20b5c3739184815f Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 4 21:30:52 2024 -0700 add lora functions to python client; test llama-3-70b AWQ commit 48b505376376f01e36b69bb0026f9a6af7e95676 Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 4 13:28:18 2024 -0700 testing llama-3-70b-gptq commit 80d4a605347f60c6d12958a577182b27ec413def Author: NovTi <yx2432@nyu.edu> Date: Tue Jun 4 22:03:11 2024 +0800 Fix minor typos commit e6af233933f9709e7da606409151c0802520f6ef Author: NovTi <yx2432@nyu.edu> Date: Mon Jun 3 22:33:17 2024 +0800 Integrate qwen2 commit 72d74cf82d1976457881318ae035b956fde3f220 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 2 20:42:44 2024 -0700 Update Makefile to include punica kernels commit e7fb9b9dc6651aeb68e9e793d0d25381a14e12b5 Author: PeterYaoNYU <yy4108@nyu.edu> Date: Mon Jun 3 10:51:16 2024 +0800 integrate lora intommistral commit 47f4685004ac7db295c46ec9a69f62a783fe07a6 Author: Alfred Gui <zgui@flexport.com> Date: Sun Jun 2 08:34:24 2024 -0400 add placeholder for flashinfer phi modeling (#24) commit 40a70bcc369c6b61f486dc273ab0fd4330e21d58 Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 1 22:06:30 2024 -0700 Update README.md commit f125e73ade681ac4e60cd48488a59f2bab162f97 Merge: 79402fb 7243638 Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 1 21:22:58 2024 -0700 Merge pull request #23 from mlsys-io/reorder-codebase Reorder code base commit 72436388e230d6778a6303fd656befa19632dbba Author: rainj-me <rain-jiang@outlook.com> Date: Sat Jun 1 19:10:39 2024 -0700 fix the lora-id parameter in the benchmark commit 650c743e1572b35c0c304edcba8afb3b8865935d Merge: 79402fb 799a193 Author: rainj-me <rain-jiang@outlook.com> Date: Sat Jun 1 18:58:38 2024 -0700 directly merge from tgi commit 799a193b109662743bed1b18a09af1fdcd508c8b Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Sat Jun 1 08:47:00 2024 +0000 Fixing Phi3. commit 79402fb10d115a1ebe19ad97dd1482bd03479c80 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri May 31 16:02:53 2024 -0700 Rest API to download lora adapter on router commit 08b3eac2ce54e25bec12088fd7e69ee3c07adaf5 Author: Nicholas Broad <nbroad94@gmail.com> Date: Fri May 31 09:42:14 2024 -0700 single char ` addition for docs (#1989) # What does this PR do? I think this will fix the docs from being weirdly formatted. All the sections after MAX_TOP_N_TOKENS don't show up in the bar on the right (https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxtopntokens) ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @merveenoyan --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> commit 5ab4cef67ef6326429a0e4e3d44b9710d9f26c53 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 31 18:01:43 2024 +0200 Fixing exl2 scratch buffer. (#1990) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 06edde94910594eef86988934cbbc43d775eb965 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 31 17:57:01 2024 +0200 Purely refactors paged/attention into `layers/attention` and make hardware differences more obvious with 1 file per hardware. (#1986) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 659bd67fec0a874e325fc2a2afd0c2ed2af692f0 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 31 07:03:24 2024 -0700 Update documentation version to 2.0.4 (#1980) As per title cc @Narsil commit 967ced2ff4565a5358d45a1372d32fbab113700b Author: Daniël de Kok <me@danieldk.eu> Date: Thu May 30 07:10:10 2024 +0000 Gemma GPTQ checks: skip logprob checks This test fails somewhat regularly due to non-determinism and this test is primarily to verify that we are loading a model which doesn't have `float16` as the default dtype correctly. commit 36dd16017c7211b7760d1daa188172bb902e486f Author: Daniël de Kok <me@danieldk.eu> Date: Tue May 28 09:51:31 2024 +0000 Add support for exl2 quantization Mostly straightforward, changes to existing code: * Wrap quantizer parameters in a small wrapper to avoid passing around untyped tuples and needing to repack them as a dict. * Move scratch space computation to warmup, because we need the maximum input sequence length to avoid allocating huge scratch buffers that OOM. commit cbced7f0f9ca0b62216223859b82a2632d1c7a1f Author: drbh <david.richard.holtz@gmail.com> Date: Wed May 29 12:42:11 2024 -0400 feat: adjust attn weight loading logic (#1975) This PR updates `load_attention` to prefer loading specific attention based on the model type. Additionally there were two cases where `TensorParallelColumnLinear.load_multi` was called and this reduces it to a single path commit 612bc483b6f5029918039e684982fc1bfbe1b502 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Tue May 28 16:55:36 2024 +0200 Fixing the text part from tokenizer endpoint. (#1967) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit f20463e4e3a994fbcbc836cd315c14b766c72205 Author: Daniël de Kok <me@danieldk.eu> Date: Tue May 28 07:25:14 2024 +0000 Fix (non-container) pytest stdout buffering-related lock-up Two issues: 1. When one of the stdout/stderr pipe buffers of a process started with `subprocess.Popen` is full, the process can get blocked until the buffer is drained. 2. Calling `Popen.wait` can deadlock when called before draining the pipe buffers (if they are full). This avoids the issue altogether by giving the child process a temporary file to write to. commit e76b9824ae965e95923dbcf50aa30efb633a1974 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Tue May 28 14:52:17 2024 +0200 Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). (#1959) - Axum upgraded to hyper 1.0 and most of the ecosystem switched so it's our time now - [ngrok-rust](https://github.com/ngrok/ngrok-rust/pull/137/files) hasn't yet, and hasn't for several months now, so let's disabled the feature for the time being. # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit b7ffa287f228e065c45a99684e73b862a5166fac Author: Moritz Laurer <41862082+MoritzLaurer@users.noreply.github.com> Date: Mon May 27 17:31:06 2024 +0200 fix small typo and broken link (#1958) # What does this PR do? Fix a typo; fix a broken link; add one sentence in the guidance docs to make the word "grammar" less abstract ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @drbh commit 0732b9d2f0fb9a4dd9753bdabe3ddb7d452c49cf Author: drbh <david.richard.holtz@gmail.com> Date: Mon May 27 10:03:16 2024 -0400 Processor config chat template (#1954) This PR loads the `processor_config` similar to the `tokenizer_config` and uses the processor_config's chat_template if the tokenizer_config does not include one. These changes enable chat with idefics2 commit a401c83c355d3b66ad158f4798b58bb5c696caac Author: Daniël de Kok <me@danieldk.eu> Date: Mon May 27 14:41:28 2024 +0200 Fix GPTQ for models which do not have float16 at the default dtype (simpler) (#1953) # What does this PR do? Fix GPTQ for models which do not have float16 at the default dtype Before this change GPTQ models would not work if the model's default data type is not `float16`. For example, Gemma GPTQ models would fail because the default dtype of Gemma is `bfloat16`. There are two issues: If the default `dtype` is not `float16`, the quantizer's `float16` parameters get converted to that dtype. The kernels cannot deal with non-`float16` types. The same applies to inputs of quantized ops. This is resolved by setting the dtype of gptq/awq-quantized models to `float16`. Simpler version of #1951. **Draft:** just testing... ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 9231098f3a9b2f0fe7f6652f10f02f4d8f551143 Author: Daniël de Kok <me@danieldk.eu> Date: Fri May 24 15:34:42 2024 +0000 Fix (flash) Gemma prefix and enable tests commit d32e33bd489f2419e579f5d423073791ee19f789 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 24 15:36:13 2024 +0200 Fix seeded output. (#1949) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit cff472ba2b9147015ffd005aace282481d489695 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 24 12:40:39 2024 +0200 Fixing codellama loads by using purely `AutoTokenizer`. (#1947) - The need for the slow tokenizer default stems from back when llama 1 was introduced and all the flags where not supported in `tokenizers`. - Fixes #1891 # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 954653466d24a9b3435988136983398bdf788a2f Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 23 15:40:40 2024 +0200 Improving the logging system. (#1938) - Added a debug log for speculated ids (helps seeing in logs quality of a speculator). - Remove newlines from child process logs when re-emitting in non JSON mode. - Made standard level be closer to what's expected (only our binaries level). - Propagate that level correctly to the shard (was forced into INFO). # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 629047cb82d2ff97a8f0d0446ed7a3a68bed63a7 Author: Thomas Schillaci <thomas.schillaci@gmail.com> Date: Thu May 23 15:37:09 2024 +0200 Add completion route to client and add stop parameter where it's missing (#1869) # What does this PR do? - Add the stop parameter to the completion route - Add the completion method to the python client - Add the stop parameter to the python client's chat method ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil --------- Co-authored-by: Thomas SCHILLACI <tschilla@px101.prod.exalead.com> Co-authored-by: Thomas Schillaci <thomas.schillaci@3ds.com> commit f4a073ae6d2cbcf6ee353b4e27ea90586893fe8b Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 23 14:39:38 2024 +0200 Fixing some legacy behavior (big swapout of serverless on legacy stuff). (#1937) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Daniël de Kok <me@github.danieldk.eu> commit f41d644a903d179915e122896aba6bc77821795a Author: Wang, Yi <yi.a.wang@intel.com> Date: Thu May 23 20:11:08 2024 +0800 reenable xpu for tgi (#1939) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> commit a103e3e9e2041add8bd83a8b5b35c497784b9722 Author: drbh <david.richard.holtz@gmail.com> Date: Thu May 23 05:34:18 2024 -0400 feat: add train medusa head tutorial (#1934) This PR adds a tutorial to self distill and train medusa heads for a specific model --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> commit efb73fcb598fbb93c6cae7d6667a58b373b0de96 Author: drbh <david.richard.holtz@gmail.com> Date: Wed May 22 14:46:29 2024 -0400 fix: use path inside of speculator config (#1935) This PR access the path on the speculator similar to `MLPSpeculatorHead.load` and `MedusaHeadV1.load` these changes resolves this error locally when loading a `MedusaHeadV2` ``` TypeError: expected str, bytes or os.PathLike object, not dict ``` commit 2f243a1a150da40fc71cbdd08cd07e314cf7098e Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Wed May 22 16:22:57 2024 +0200 Creating doc automatically for supported models. (#1929) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit fc0eaffc81fafcc0fb554692f32efbed1c4b2683 Author: drbh <david.richard.holtz@gmail.com> Date: Wed May 22 03:58:26 2024 -0400 feat: include token in client test like server tests (#1932) This PR simply includes the HF token in the client tests similar to how it's included in the server tests. This helps avoid CI failure due to rate limiting commit 904ff36917e100047669bd6168d7138045469bbe Author: Junlin Zhou <jameszhou2108@hotmail.com> Date: Wed May 22 01:12:14 2024 +0800 docs: Fix grafana dashboard url (#1925) # What does this PR do?   Fixes an incorrect url in monitoring doc. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 293b8125e7a6ebd3eff65b55699e9386d1c1abf5 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Mon May 20 02:44:48 2024 +0200 ROCm: make CK FA2 default instead of Triton (#1924) As per title. Triton autotune overhead is prohibitive, as it needs to be done for each different prompt length. commit f871f114ca5f5a18a2a4a2c7658aed87440d381f Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Sat May 18 13:31:24 2024 +0200 Fixing the download strategy for ibm-fms (#1917) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 5dad0c0b29cf31271c01948653ac164649a3ac78 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 19:50:52 2024 +0200 Fix TGI issues with ROCm (#1921) Not all models were tested in https://github.com/huggingface/text-generation-inference/pull/1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two commit b5f1c9de06ad00bbdeec0348c47f53bee271cedc Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 18:21:51 2024 +0200 Fix TunableOp bug (#1920) cc @Narsil commit 422bf1f9866e99ef287d6280e8236d22173ee709 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 17:37:23 2024 +0200 Update grafana template (#1918) As per title, there was a mistake credit to @Narsil updated https://huggingface.co/docs/text-generation-inference/basic_tutorials/monitoring as well Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> commit c4cf8b49d1ecce2353935c2497bd8c028cb25320 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 16:34:44 2024 +0200 Add TGI monitoring guide through Grafana and Prometheus (#1908) As per title. It is very useful. commit 232e8d522713f43834d48ae45d1330b0e6dd367e Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 15:30:47 2024 +0200 MI300 compatibility (#1764) Adds support for AMD Instinct MI300 in TGI. Most changes are: * Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable. TunableOp is disabled by default, and can be enabled with `PYTORCH_TUNABLEOP_ENABLED=1`. * Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes from https://github.com/pytorch/pytorch/pull/124362) * Support SILU & Linear custom kernels contributed by AMD * Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/, branching out of a much more recent commit https://github.com/ROCm/vllm/commit/3489ce7936c5de588916ae3047c44c23c0b0c308 * Support FA2 Triton kernel as recommended by AMD. Can be used by specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`. * Update dockerfile to ROCm 6.1 By default, TunableOp tuning results are saved in `/data` (e.g. `/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order to avoid to have to rerun the tuning at each `docker run`. Example: ``` Validator,PT_VERSION,2.3.0 Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c Validator,HIPBLASLT_VERSION,0.7.0-1549b021 Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack- Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098 GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431 GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546 GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119 GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645 GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971 GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694 GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522 GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671 GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834 GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622 GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122 GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191 GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514 GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914 GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516 GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953 GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043 GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497 GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895 GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716 GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731 GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816 GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701 GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159 GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524 GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074 GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045 GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582 GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705 GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489 ``` --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com> commit a60fa8406abd98d41e2bfafaf6f81f3dd6044b15 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 17 11:35:49 2024 +0200 Removing some unused code. (#1915) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 3b5d93e68d22f5db7950175b5210ce6390df8172 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 16 21:40:10 2024 +0200 Fixing signals. (#1910) Taking the signal handles later, so during loads, regular signal handling is done, we only need to handle SIGINT and SIGTERM during real loads to get more graceful shutdowns when queries are in flight. Fixes #1842 # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit b3dd3902e76df777d28ee76993800f4baf73c40c Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 16 17:21:00 2024 +0200 Types. (#1909) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting d…

Adjust the flashinfer llama model to accommodate the baichuan model

adjust the flashinfer llama model to accomodate baichuan

f0d3664

alfredgui2 merged commit 9b3c098 into master Jun 20, 2024
3 checks passed

alfredgui2 added a commit that referenced this pull request Jul 6, 2024

Merge pull request #40 from mlsys-io/add_baichuan

d3fc3f7

Adjust the flashinfer llama model to accommodate the baichuan model

tjluyao pushed a commit that referenced this pull request Jul 7, 2024

Merge pull request #40 from mlsys-io/add_baichuan

01de630

Adjust the flashinfer llama model to accommodate the baichuan model

alfredgui2 added a commit that referenced this pull request Jul 7, 2024

Merge pull request #40 from mlsys-io/add_baichuan

dca4139

Adjust the flashinfer llama model to accommodate the baichuan model

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Merge pull request #40 from mlsys-io/add_baichuan

b3aadf0

Adjust the flashinfer llama model to accommodate the baichuan model

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Merge pull request #40 from mlsys-io/add_baichuan

81509d4

Adjust the flashinfer llama model to accommodate the baichuan model

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Merge pull request #40 from mlsys-io/add_baichuan

083e182

Adjust the flashinfer llama model to accommodate the baichuan model

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Merge pull request #40 from mlsys-io/add_baichuan

0f6d7f2

Adjust the flashinfer llama model to accommodate the baichuan model

alfredgui2 added a commit that referenced this pull request Jul 8, 2024

Merge pull request #40 from mlsys-io/add_baichuan

df78561

Adjust the flashinfer llama model to accommodate the baichuan model

alfredgui2 deleted the add_baichuan branch July 11, 2024 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust the flashinfer llama model to accommodate the baichuan model #40

Adjust the flashinfer llama model to accommodate the baichuan model #40

alfredgui2 commented Jun 20, 2024

Adjust the flashinfer llama model to accommodate the baichuan model #40

Adjust the flashinfer llama model to accommodate the baichuan model #40

Conversation

alfredgui2 commented Jun 20, 2024

What does this PR do?

Before submitting

Who can review?