Add the batch concatenation functionality for flashinfer server #43

alfredgui2 · 2024-07-02T02:48:45Z

What does this PR do?

The main purpose is to add the capability of concatenating multiple batches (this is needed when processing the batch decode request). This PR also has the following side effects:

Simplified the flashinfer batch management
Simplified the flashinfer KV cache management
Moved the batch cache from the controller to flashinfer_causal_lm.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

commit 6adf97815ef6828e0aa06f2a4635370b4ad7476e Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sat Jul 6 13:18:16 2024 -0400 Fix the decoding logic in test_local_grpc.py (#44) * fix the test_local_grpc script * lint fix commit f355733482f4ebc15916df151ad00ad9d64d451d Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jul 6 07:50:55 2024 -0700 bug fixes commit 466b0a65429d339a1c004c5991749e6f9cb1230b Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jul 1 22:48:56 2024 -0400 Add the batch concatenation functionality for flashinfer server (#43) * refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint commit b9838c5c4720ff09f946e7fce8dd328aab57dc16 Author: NovTi <yx2432@nyu.edu> Date: Tue Jul 2 00:07:24 2024 +0800 Add ChatGLM and refactor Qwen2 commit 9fafffcfacb8ded0d0d5aefac2cf38ae3a44876f Author: PeterYaoNYU <yy4108@nyu.edu> Date: Mon Jul 1 10:30:21 2024 +0800 update mistral flashinfer commit d099bbbbeeaf638220696b5c9f94cf9634f8c221 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 30 18:39:44 2024 -0700 update submodules commit 4edacd568d064cb834597d8cf2f24bf1bef20683 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 30 18:29:34 2024 -0700 update submodules commit 9da076dc488140273ab17773ae642e8ac3edb119 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 30 18:17:41 2024 -0700 minor fix in makefile commit fa213e263fd86ec41d033cb8d46dea07076720bd Author: MichaelYuan2 <hy2203@nyu.edu> Date: Tue Jun 25 10:41:09 2024 +0800 update FlashinferAttentionWrapper to flashinfer 0.0.6 commit 8d3dd4898a26f89d82233640123aad90e2477bb6 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 24 11:25:08 2024 -0400 Fix the server CLI issue with use_flashinfer flag (#42) * fix refactor * empty * fix lint commit 23118727bdf000d87115df9ac6a6ccf3aee7a2ef Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sat Jun 22 17:22:51 2024 -0400 decouple flashinfer files from flash attention (#41) commit 9b3c09850ddfdd8141601ee9b1b027e4aa2d4b83 Merge: 4a40c64 f0d3664 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Thu Jun 20 11:13:14 2024 -0400 Merge pull request #40 from mlsys-io/add_baichuan Adjust the flashinfer llama model to accommodate the baichuan model commit f0d3664f34acae5020f045fabca15aa310ce60ec Author: Alfred Gui <alfredzqgui@gmail.com> Date: Thu Jun 20 10:46:12 2024 -0400 adjust the flashinfer llama model to accomodate baichuan commit 4a40c6415cd7f1d29bab6de9907ca8ac66833863 Merge: 0ba0ac9 6aaab88 Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 17 10:15:42 2024 -0700 Merge branch 'master' of github.com:mlsys-io/kv.run commit 0ba0ac9dd8825cef92cd7b92fef49ab0efcb8fbd Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 17 10:01:44 2024 -0700 minor fix in output example commit 6aaab883fb154b960de9ad501de74ad90f447725 Merge: 7a93d84 08fde0f Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:46:13 2024 -0400 Merge pull request #38 from mlsys-io/flash_attn_rotary Use Flash attention for rotary embedding and layer normalization for Phi2 and Phi3 commit 08fde0f9ab74fd54fe59bbca5020448a862c1188 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:43:19 2024 -0400 revert test file commit c51e36e3a3bf60f5e23f3a1fee5fe6b116fcc362 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:42:16 2024 -0400 fix lint commit 7dfa57d5ca29e366c1d7c6de01ef6e81840fd7d5 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:40:40 2024 -0400 empty commit b45e8968e75976ac506dc25b467e41520c457d48 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 14:17:55 2024 +0000 fix phi2 and phi3 modeling commit 31ad6bd942293ce18addf79944da5d742518f900 Merge: 1e2bf10 7a93d84 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 08:55:24 2024 -0400 merge master commit 1e2bf1026420e298cb7fbed4d73166baefcbf615 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 06:43:51 2024 -0400 fix the flashinfer adapter commit da84f6bcce038029714f48916510964d5b00d757 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 00:51:55 2024 +0000 fixes commit e0feabb012e8d82d6265dc85811a47fff44c1c65 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sun Jun 16 20:20:59 2024 -0400 fix rotary bug commit 7a93d8413fbfb62e8ae6646a12aaed55b36afaa1 Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 15 22:50:16 2024 -0700 update to rust 1.79 commit 6c4fa6effac801c7c4a30479eca30a7c5ecb057d Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 15 22:15:41 2024 -0700 minor fixes commit ad40a1752d5964554814261754e63a1122829ce9 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sun Jun 16 01:49:28 2024 +0000 flash attn rotary commit 868d3f2fa74a07178806eadc79a2f23f59bafa77 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 22:57:09 2024 -0700 minor router-server fix commit b8a47854a60d21347d6e4f66a507d1a4d2580c30 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 16:43:32 2024 -0700 finalize docker build workflow commit fa2f2f2c8d5249e151cb51c35ef2952cf937b98c Merge: 93edec5 85f34cb Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 14:16:29 2024 -0700 Merge branch 'master' of github.com:mlsys-io/kv.run commit 93edec51ef1714c95b56699ad0b284f6c0b7a916 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 14:16:18 2024 -0700 dependency and rust toolchain fix commit 85f34cb1147265e3d13080d032c92b7d81d09895 Merge: de58365 e263ba8 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Fri Jun 14 15:16:44 2024 -0400 Merge pull request #36 from mlsys-io/fix_warm Fix the warm-up issue commit de5836558a56c3541ec9be3b1d41dde51d08969a Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 12:06:42 2024 -0700 fix in workflow commit 83fc271da0ef6c0580d5d8491605b582c2d730cc Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 11:32:29 2024 -0700 build workflow update commit 66d272347539741c6750841938123b5522abb144 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 09:10:00 2024 -0700 docker workflow commit e8f9ff4f2be08421219acc6d2b611e2c4ba87768 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 09:08:55 2024 -0700 docker workflow commit e49f754e1fb33af4b9bf33bcc08a6d23d4cacb56 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 00:04:32 2024 -0700 remove tgi build workflow commit a4802b7867e766e492cb1f99877f386148962c3a Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 00:01:15 2024 -0700 docker build workflow; remove submodules (#35) * test docker * docker * remove submodule * updates commit e263ba802023d45ee5b26df0d90f8401ee0f87aa Author: Alfred Gui <alfredzqgui@gmail.com> Date: Thu Jun 13 20:32:48 2024 -0400 fix warm up issue commit c7613eb887ac10ba8d38b00ab26b85ff395ecdc6 Author: Yao Lu <fdyaolu@gmail.com> Date: Thu Jun 13 17:01:27 2024 -0700 test docker (#34) commit e61ea779f8dffacab0a161aa13135999d6ec3ee7 Author: Yao Lu <fdyaolu@gmail.com> Date: Thu Jun 13 09:47:33 2024 -0700 minor fixes and rename tests.xml commit 8ae802cb8848df58fc9c0c279044f5b50309044e Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 14:09:50 2024 -0700 fix dtype bugs in flashinfer model def commit b821d68f4120951bbde7f57ca0ad9ba914d33354 Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 11:30:51 2024 -0700 bug fix in layers/__init__.py commit b7c8735c77cb76446ba30efbb20f19067289fcab Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 10:33:50 2024 -0700 minor typo commit 6010fad087f477174766981acc162322e1d767da Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 10:30:45 2024 -0700 critical output bug (#25) * output debug * update minor commit b599cc65ecb8215cfcc8a9db6daa0d88450b9cc5 Author: Alfred Gui <zgui@flexport.com> Date: Tue Jun 11 10:34:24 2024 -0400 Decouple flashinfer code paths from flash attention library dependencies (#33) * decouple flash attn dependency from flashinfer code paths * follow up commit e0cd4a67f7cffdc620baa5d1ae22a32a3be94d4e Author: Alfred Gui <zgui@flexport.com> Date: Tue Jun 11 09:47:06 2024 -0400 reformat the llama files (#32) commit 6c96fddcbbe4c16f97fe391ef3387702234f4f65 Author: Alfred Gui <zgui@flexport.com> Date: Mon Jun 10 21:02:42 2024 -0400 Llama rewrite (#31) * write llama in tgi style * fixes * fix the runtime issues commit 9dd3b75af84cb0d3411bd43fc0414e4592193037 Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 17:10:20 2024 -0700 Kv.run test workflows (#30) * python 3.10 * python 3.10.14 * update doc * dispatch * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow commit 9ec483dae3eb34f594511b649370af354d5d0923 Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 15:15:35 2024 -0700 kv.run test workflows (#29) * python 3.10 * python 3.10.14 * update doc * dispatch commit 4757af8b6bb5b5548e17c5aeee767f5650607aed Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 14:53:52 2024 -0700 kv.run test workflow commit d58a35ed4694a18b1d3028b79cab9b3227ccdafc Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 11:41:13 2024 -0700 Compliant for pre-commit configs commit a8144374aa50e85016c19fa6f4a45c7f7c724d46 Author: Alfred Gui <zgui@flexport.com> Date: Mon Jun 10 06:45:29 2024 -0400 Introduce the flashinfer attention wrapper abstraction and use it for Llama and Gemma models (#28) * abstract the attention layer * fix the bugs commit 3956e467fd043e8218462e475d71892784ad5907 Author: Alfred Gui <zgui@flexport.com> Date: Sun Jun 9 06:36:01 2024 -0400 Refactor the Flashinfer models (#27) * refactor the flashinfer models * fixes commit 7dda533b23d548bff8c569370daff203699a6e60 Author: Alfred Gui <zgui@flexport.com> Date: Sat Jun 8 08:40:55 2024 -0400 Support Flashinfer based Phi2 and Phi3 models (#26) * add phi model * fix phi integration errors * padding for phi * fix modeling for phi * workarounds for phi * use flash attn's position rotary embedding * support phi3 and baichuan * fix position encoding * clean up commit 482ef988e2c2ef59743aeaff01d79b72e0546baa Author: NovTi <yx2432@nyu.edu> Date: Wed Jun 5 22:04:14 2024 +0800 Add qwen2 1.8b and 72b base inference commit 5935ccedd980669c1366d70f20b5c3739184815f Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 4 21:30:52 2024 -0700 add lora functions to python client; test llama-3-70b AWQ commit 48b505376376f01e36b69bb0026f9a6af7e95676 Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 4 13:28:18 2024 -0700 testing llama-3-70b-gptq commit 80d4a605347f60c6d12958a577182b27ec413def Author: NovTi <yx2432@nyu.edu> Date: Tue Jun 4 22:03:11 2024 +0800 Fix minor typos commit e6af233933f9709e7da606409151c0802520f6ef Author: NovTi <yx2432@nyu.edu> Date: Mon Jun 3 22:33:17 2024 +0800 Integrate qwen2 commit 72d74cf82d1976457881318ae035b956fde3f220 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 2 20:42:44 2024 -0700 Update Makefile to include punica kernels commit e7fb9b9dc6651aeb68e9e793d0d25381a14e12b5 Author: PeterYaoNYU <yy4108@nyu.edu> Date: Mon Jun 3 10:51:16 2024 +0800 integrate lora intommistral commit 47f4685004ac7db295c46ec9a69f62a783fe07a6 Author: Alfred Gui <zgui@flexport.com> Date: Sun Jun 2 08:34:24 2024 -0400 add placeholder for flashinfer phi modeling (#24) commit 40a70bcc369c6b61f486dc273ab0fd4330e21d58 Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 1 22:06:30 2024 -0700 Update README.md commit f125e73ade681ac4e60cd48488a59f2bab162f97 Merge: 79402fb 7243638 Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 1 21:22:58 2024 -0700 Merge pull request #23 from mlsys-io/reorder-codebase Reorder code base commit 72436388e230d6778a6303fd656befa19632dbba Author: rainj-me <rain-jiang@outlook.com> Date: Sat Jun 1 19:10:39 2024 -0700 fix the lora-id parameter in the benchmark commit 650c743e1572b35c0c304edcba8afb3b8865935d Merge: 79402fb 799a193 Author: rainj-me <rain-jiang@outlook.com> Date: Sat Jun 1 18:58:38 2024 -0700 directly merge from tgi commit 799a193b109662743bed1b18a09af1fdcd508c8b Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Sat Jun 1 08:47:00 2024 +0000 Fixing Phi3. commit 79402fb10d115a1ebe19ad97dd1482bd03479c80 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri May 31 16:02:53 2024 -0700 Rest API to download lora adapter on router commit 08b3eac2ce54e25bec12088fd7e69ee3c07adaf5 Author: Nicholas Broad <nbroad94@gmail.com> Date: Fri May 31 09:42:14 2024 -0700 single char ` addition for docs (#1989) # What does this PR do? I think this will fix the docs from being weirdly formatted. All the sections after MAX_TOP_N_TOKENS don't show up in the bar on the right (https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxtopntokens) ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @merveenoyan --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> commit 5ab4cef67ef6326429a0e4e3d44b9710d9f26c53 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 31 18:01:43 2024 +0200 Fixing exl2 scratch buffer. (#1990) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 06edde94910594eef86988934cbbc43d775eb965 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 31 17:57:01 2024 +0200 Purely refactors paged/attention into `layers/attention` and make hardware differences more obvious with 1 file per hardware. (#1986) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 659bd67fec0a874e325fc2a2afd0c2ed2af692f0 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 31 07:03:24 2024 -0700 Update documentation version to 2.0.4 (#1980) As per title cc @Narsil commit 967ced2ff4565a5358d45a1372d32fbab113700b Author: Daniël de Kok <me@danieldk.eu> Date: Thu May 30 07:10:10 2024 +0000 Gemma GPTQ checks: skip logprob checks This test fails somewhat regularly due to non-determinism and this test is primarily to verify that we are loading a model which doesn't have `float16` as the default dtype correctly. commit 36dd16017c7211b7760d1daa188172bb902e486f Author: Daniël de Kok <me@danieldk.eu> Date: Tue May 28 09:51:31 2024 +0000 Add support for exl2 quantization Mostly straightforward, changes to existing code: * Wrap quantizer parameters in a small wrapper to avoid passing around untyped tuples and needing to repack them as a dict. * Move scratch space computation to warmup, because we need the maximum input sequence length to avoid allocating huge scratch buffers that OOM. commit cbced7f0f9ca0b62216223859b82a2632d1c7a1f Author: drbh <david.richard.holtz@gmail.com> Date: Wed May 29 12:42:11 2024 -0400 feat: adjust attn weight loading logic (#1975) This PR updates `load_attention` to prefer loading specific attention based on the model type. Additionally there were two cases where `TensorParallelColumnLinear.load_multi` was called and this reduces it to a single path commit 612bc483b6f5029918039e684982fc1bfbe1b502 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Tue May 28 16:55:36 2024 +0200 Fixing the text part from tokenizer endpoint. (#1967) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit f20463e4e3a994fbcbc836cd315c14b766c72205 Author: Daniël de Kok <me@danieldk.eu> Date: Tue May 28 07:25:14 2024 +0000 Fix (non-container) pytest stdout buffering-related lock-up Two issues: 1. When one of the stdout/stderr pipe buffers of a process started with `subprocess.Popen` is full, the process can get blocked until the buffer is drained. 2. Calling `Popen.wait` can deadlock when called before draining the pipe buffers (if they are full). This avoids the issue altogether by giving the child process a temporary file to write to. commit e76b9824ae965e95923dbcf50aa30efb633a1974 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Tue May 28 14:52:17 2024 +0200 Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). (#1959) - Axum upgraded to hyper 1.0 and most of the ecosystem switched so it's our time now - [ngrok-rust](https://github.com/ngrok/ngrok-rust/pull/137/files) hasn't yet, and hasn't for several months now, so let's disabled the feature for the time being. # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit b7ffa287f228e065c45a99684e73b862a5166fac Author: Moritz Laurer <41862082+MoritzLaurer@users.noreply.github.com> Date: Mon May 27 17:31:06 2024 +0200 fix small typo and broken link (#1958) # What does this PR do? Fix a typo; fix a broken link; add one sentence in the guidance docs to make the word "grammar" less abstract ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @drbh commit 0732b9d2f0fb9a4dd9753bdabe3ddb7d452c49cf Author: drbh <david.richard.holtz@gmail.com> Date: Mon May 27 10:03:16 2024 -0400 Processor config chat template (#1954) This PR loads the `processor_config` similar to the `tokenizer_config` and uses the processor_config's chat_template if the tokenizer_config does not include one. These changes enable chat with idefics2 commit a401c83c355d3b66ad158f4798b58bb5c696caac Author: Daniël de Kok <me@danieldk.eu> Date: Mon May 27 14:41:28 2024 +0200 Fix GPTQ for models which do not have float16 at the default dtype (simpler) (#1953) # What does this PR do? Fix GPTQ for models which do not have float16 at the default dtype Before this change GPTQ models would not work if the model's default data type is not `float16`. For example, Gemma GPTQ models would fail because the default dtype of Gemma is `bfloat16`. There are two issues: If the default `dtype` is not `float16`, the quantizer's `float16` parameters get converted to that dtype. The kernels cannot deal with non-`float16` types. The same applies to inputs of quantized ops. This is resolved by setting the dtype of gptq/awq-quantized models to `float16`. Simpler version of #1951. **Draft:** just testing... ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 9231098f3a9b2f0fe7f6652f10f02f4d8f551143 Author: Daniël de Kok <me@danieldk.eu> Date: Fri May 24 15:34:42 2024 +0000 Fix (flash) Gemma prefix and enable tests commit d32e33bd489f2419e579f5d423073791ee19f789 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 24 15:36:13 2024 +0200 Fix seeded output. (#1949) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit cff472ba2b9147015ffd005aace282481d489695 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 24 12:40:39 2024 +0200 Fixing codellama loads by using purely `AutoTokenizer`. (#1947) - The need for the slow tokenizer default stems from back when llama 1 was introduced and all the flags where not supported in `tokenizers`. - Fixes #1891 # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 954653466d24a9b3435988136983398bdf788a2f Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 23 15:40:40 2024 +0200 Improving the logging system. (#1938) - Added a debug log for speculated ids (helps seeing in logs quality of a speculator). - Remove newlines from child process logs when re-emitting in non JSON mode. - Made standard level be closer to what's expected (only our binaries level). - Propagate that level correctly to the shard (was forced into INFO). # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 629047cb82d2ff97a8f0d0446ed7a3a68bed63a7 Author: Thomas Schillaci <thomas.schillaci@gmail.com> Date: Thu May 23 15:37:09 2024 +0200 Add completion route to client and add stop parameter where it's missing (#1869) # What does this PR do? - Add the stop parameter to the completion route - Add the completion method to the python client - Add the stop parameter to the python client's chat method ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil --------- Co-authored-by: Thomas SCHILLACI <tschilla@px101.prod.exalead.com> Co-authored-by: Thomas Schillaci <thomas.schillaci@3ds.com> commit f4a073ae6d2cbcf6ee353b4e27ea90586893fe8b Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 23 14:39:38 2024 +0200 Fixing some legacy behavior (big swapout of serverless on legacy stuff). (#1937) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Daniël de Kok <me@github.danieldk.eu> commit f41d644a903d179915e122896aba6bc77821795a Author: Wang, Yi <yi.a.wang@intel.com> Date: Thu May 23 20:11:08 2024 +0800 reenable xpu for tgi (#1939) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> commit a103e3e9e2041add8bd83a8b5b35c497784b9722 Author: drbh <david.richard.holtz@gmail.com> Date: Thu May 23 05:34:18 2024 -0400 feat: add train medusa head tutorial (#1934) This PR adds a tutorial to self distill and train medusa heads for a specific model --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> commit efb73fcb598fbb93c6cae7d6667a58b373b0de96 Author: drbh <david.richard.holtz@gmail.com> Date: Wed May 22 14:46:29 2024 -0400 fix: use path inside of speculator config (#1935) This PR access the path on the speculator similar to `MLPSpeculatorHead.load` and `MedusaHeadV1.load` these changes resolves this error locally when loading a `MedusaHeadV2` ``` TypeError: expected str, bytes or os.PathLike object, not dict ``` commit 2f243a1a150da40fc71cbdd08cd07e314cf7098e Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Wed May 22 16:22:57 2024 +0200 Creating doc automatically for supported models. (#1929) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit fc0eaffc81fafcc0fb554692f32efbed1c4b2683 Author: drbh <david.richard.holtz@gmail.com> Date: Wed May 22 03:58:26 2024 -0400 feat: include token in client test like server tests (#1932) This PR simply includes the HF token in the client tests similar to how it's included in the server tests. This helps avoid CI failure due to rate limiting commit 904ff36917e100047669bd6168d7138045469bbe Author: Junlin Zhou <jameszhou2108@hotmail.com> Date: Wed May 22 01:12:14 2024 +0800 docs: Fix grafana dashboard url (#1925) # What does this PR do?   Fixes an incorrect url in monitoring doc. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 293b8125e7a6ebd3eff65b55699e9386d1c1abf5 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Mon May 20 02:44:48 2024 +0200 ROCm: make CK FA2 default instead of Triton (#1924) As per title. Triton autotune overhead is prohibitive, as it needs to be done for each different prompt length. commit f871f114ca5f5a18a2a4a2c7658aed87440d381f Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Sat May 18 13:31:24 2024 +0200 Fixing the download strategy for ibm-fms (#1917) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 5dad0c0b29cf31271c01948653ac164649a3ac78 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 19:50:52 2024 +0200 Fix TGI issues with ROCm (#1921) Not all models were tested in https://github.com/huggingface/text-generation-inference/pull/1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two commit b5f1c9de06ad00bbdeec0348c47f53bee271cedc Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 18:21:51 2024 +0200 Fix TunableOp bug (#1920) cc @Narsil commit 422bf1f9866e99ef287d6280e8236d22173ee709 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 17:37:23 2024 +0200 Update grafana template (#1918) As per title, there was a mistake credit to @Narsil updated https://huggingface.co/docs/text-generation-inference/basic_tutorials/monitoring as well Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> commit c4cf8b49d1ecce2353935c2497bd8c028cb25320 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 16:34:44 2024 +0200 Add TGI monitoring guide through Grafana and Prometheus (#1908) As per title. It is very useful. commit 232e8d522713f43834d48ae45d1330b0e6dd367e Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 15:30:47 2024 +0200 MI300 compatibility (#1764) Adds support for AMD Instinct MI300 in TGI. Most changes are: * Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable. TunableOp is disabled by default, and can be enabled with `PYTORCH_TUNABLEOP_ENABLED=1`. * Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes from https://github.com/pytorch/pytorch/pull/124362) * Support SILU & Linear custom kernels contributed by AMD * Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/, branching out of a much more recent commit https://github.com/ROCm/vllm/commit/3489ce7936c5de588916ae3047c44c23c0b0c308 * Support FA2 Triton kernel as recommended by AMD. Can be used by specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`. * Update dockerfile to ROCm 6.1 By default, TunableOp tuning results are saved in `/data` (e.g. `/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order to avoid to have to rerun the tuning at each `docker run`. Example: ``` Validator,PT_VERSION,2.3.0 Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c Validator,HIPBLASLT_VERSION,0.7.0-1549b021 Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack- Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098 GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431 GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546 GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119 GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645 GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971 GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694 GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522 GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671 GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834 GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622 GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122 GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191 GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514 GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914 GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516 GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953 GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043 GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497 GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895 GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716 GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731 GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816 GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701 GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159 GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524 GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074 GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045 GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582 GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705 GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489 ``` --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com> commit a60fa8406abd98d41e2bfafaf6f81f3dd6044b15 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 17 11:35:49 2024 +0200 Removing some unused code. (#1915) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 3b5d93e68d22f5db7950175b5210ce6390df8172 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 16 21:40:10 2024 +0200 Fixing signals. (#1910) Taking the signal handles later, so during loads, regular signal handling is done, we only need to handle SIGINT and SIGTERM during real loads to get more graceful shutdowns when queries are in flight. Fixes #1842 # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit b3dd3902e76df777d28ee76993800f4baf73c40c Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 16 17:21:00 2024 +0200 Types. (#1909) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting d…

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

alfredgui2 added 6 commits June 27, 2024 02:26

refactor flashinfer causal lm

04b1959

modify test_local_api

2ed3b13

fixes

ef06622

merge master

cdfc48b

fixes

78ec9a6

lint

6c620a1

alfredgui2 merged commit 466b0a6 into master Jul 2, 2024
1 of 3 checks passed

alfredgui2 added a commit that referenced this pull request Jul 6, 2024

Add the batch concatenation functionality for flashinfer server (#43)

7ddcac2

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 7, 2024

Add the batch concatenation functionality for flashinfer server (#43)

6283868

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 7, 2024

Add the batch concatenation functionality for flashinfer server (#43)

68f0cab

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 7, 2024

Add the batch concatenation functionality for flashinfer server (#43)

e3b765a

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 7, 2024

Add the batch concatenation functionality for flashinfer server (#43)

a457b6f

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 7, 2024

Add the batch concatenation functionality for flashinfer server (#43)

5b2721f

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 7, 2024

Add the batch concatenation functionality for flashinfer server (#43)

23b36a2

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

alfredgui2 added a commit that referenced this pull request Jul 7, 2024

Add the batch concatenation functionality for flashinfer server (#43)

518be9e

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Add the batch concatenation functionality for flashinfer server (#43)

01554a5

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Add the batch concatenation functionality for flashinfer server (#43)

4849f0b

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Add the batch concatenation functionality for flashinfer server (#43)

c0212c8

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Add the batch concatenation functionality for flashinfer server (#43)

7a39fca

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

alfredgui2 added a commit that referenced this pull request Jul 8, 2024

Add the batch concatenation functionality for flashinfer server (#43)

9b0d231

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Add the batch concatenation functionality for flashinfer server (#43)

feabd0b

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Add the batch concatenation functionality for flashinfer server (#43)

778877d

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 8, 2024

Add the batch concatenation functionality for flashinfer server (#43)

a176915

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

tjluyao pushed a commit that referenced this pull request Jul 9, 2024

Add the batch concatenation functionality for flashinfer server (#43)

4440030

* refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint

alfredgui2 deleted the test_batch_switch branch July 11, 2024 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the batch concatenation functionality for flashinfer server #43

Add the batch concatenation functionality for flashinfer server #43

alfredgui2 commented Jul 2, 2024

Add the batch concatenation functionality for flashinfer server #43

Add the batch concatenation functionality for flashinfer server #43

Conversation

alfredgui2 commented Jul 2, 2024

What does this PR do?

Before submitting

Who can review?