Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

iantbutler01 · 2023-07-03T04:12:48Z

What does this PR do?

Implements NTK-Aware scaled and dynamically scaled RoPE for the PositionRotaryEmbedding to allow models to scale beyond their default max_tokens.

https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

Fixes #512

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

iantbutler01 · 2023-07-03T04:15:04Z

I've tested fixed NTK-Aware scaling on a project I'm working on and was successfully generating at 2400 tokens which is about the limit my RTX 6000 Ada can handle from VRAM with falcon 40BN instruct, but it was entirely coherent generation above the original 2048 token context.

I still need to test dynamic scaling and clean up the PR further to comply with guidelines and the checklist, but wanted to open this up in the meantime.

ssmi153 · 2023-07-14T05:47:53Z

Just a note that Huggingface Transformers natively supports this now: huggingface/transformers@34d9409. Does this make it easier to implement here?

iantbutler01 · 2023-07-17T01:37:00Z

@ssmi153 Not particularly, most of the attention modules in this repo are custom to support flash attention. The work in transformers is good to review for my implementation and that's about it from what I see.

Narsil

Thanks for the PR.

We need only 1 new CLI argument. There's already a LOT of arguments, so let's try to keep them to a bare minimum for new features.
Overall, can we remove a lot of the complexity ?
From what I read, dynamic scaling seems just better than static scaling, so let's just use dynamic scaling, no ?
The current code has a lot of pathways, can we keep them to a minimum ?
Keep the code as close to the original as possible
Nothing should be directly in custom_modeling file. This behavior it seems should be entirely agnostic of modeling code.

This can go in the config for instance (like quantize) and be in models/flash_llama.py for instance (this is not modeling code, but wrapping the model itself, this will probably be factored away at some point, but here would be a good place for now).

I'm happy to make those changes if you want, as they are mostly stylistic choices rather than business logic.

Narsil · 2023-07-18T09:11:20Z

router/src/main.rs

+    #[clap(default_value = "2048", long, env)]
    max_batch_prefill_tokens: u32,
-    #[clap(default_value = "16000", long, env)]
+    #[clap(default_value = "8192", long, env)]


This does not belong in this PR.

We can discuss changing the defaults, but it's a separate concerns.

Oh yup fair, I don't want to change them I meant to clean this out. I'll remove!

Narsil · 2023-07-18T09:11:57Z

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py

+if os.getenv("ROPE_DYNAMIC_SCALING", False).lower() == "true":
+    ROPE_DYNAMIC_SCALING = True
+else:
+    ROPE_DYNAMIC_SCALING = False


Nothing should be model specific.

Narsil · 2023-07-18T10:01:23Z

server/text_generation_server/utils/layers.py

@@ -369,7 +369,7 @@ def forward(self, hidden_states, residual=None):
    import rotary_emb

    class PositionRotaryEmbedding(nn.Module):
-        def __init__(self, inv_freq):
+        def __init__(self, inv_freq, scale_factor=1, dynamic_scaling=False, max_seq_len=2048, dim=None, base=None):


Can we have at most 1 extra argument.

A lot of information should be extractable directly from inv_freq.

Yup, I can try to simplify this

Narsil · 2023-07-18T10:01:50Z

server/text_generation_server/utils/layers.py

+            if self.dynamic_scaling:
+                scale_factor = (self.scale_factor * length / self.original_max_seq_len) - (self.scale_factor - 1)
+                max_seq_len = self.original_max_seq_len * scale_factor
+                self.inv_freq = self._get_inv_freq(self.dim, self.base, inv_freq.device, scale_factor)


This is not really OK I think.

You ditching entirely the original self.inv_freq which unfortunately for us is sometimes different from the calculation proposed (that's why not all models are static and some are load.

Llama most notably has different saved inv_freq (not sure why but it's indeed the case).

Part of dynamic scaling is calculating the new inv_freq, looking at the dynamic scaling implementation in Transformers I don't see them preserving this value either.

What would you suggest alternatively?

I was thinking interpolation when I wrote this.

Now that I reflect more it would make the code even more complex, which is not the desired effect.

Can we maybe move out the scaling factor out of get_inv_freq and keep it directly here (since it just seems to be rescaling the base)

And so let's keep rewriting inv_freq. It has some indesirable effects on those models, but the other way is even worse.

That sounds reasonable, I'll make this change after work.

iantbutler01 · 2023-07-18T11:47:10Z

@Narsil thanks for the review!

So my only reason for suggesting we keep static scaling is that it's much easier to consider VRAM usage of a statically scaled context window. If you have a model with 2048 context default, and scale it by 4 to a max of 8192 you can much more easily consider the VRAM consumption of that max.

Otherwise I agree from what I've read as well dynamic is better performing.

That said it's like you mentioned, having both adds complexity! If you still feel that's not enough of a reason to keep static I'll remove it 👍

M-Chris · 2023-07-19T03:36:32Z

Would love to see this exposed now that the huggingface/transformers#24653 merge is complete.

I understand there are various complexities with flash attention, and if flash attention 2 will be implemented..
but maybe this can still follow the same implementation of transformers merge to reduce confusion and use the "rope_scaling={"type": "dynamic", "factor": 2.0}," args? or at least a variation there of?

--rope_scaling {"type": "dynamic", "factor": 2.0} or however seen fit by you guys ?

🙌

gante · 2023-07-19T10:54:26Z

@iantbutler01 @Narsil some information for decision-making 🤗

Overall, can we remove a lot of the complexity ?
From what I read, dynamic scaling seems just better than static scaling, so let's just use dynamic scaling, no ?

The current state of scaling techniques in transformers:

dynamic scaling is the best for scaling without fine-tuning
linear scaling is the best for scaling with fine-tuning

What's out there that I'm adding next in transformers:

The folks behind dynamic scaling have created a technique that is the best in both regimes, before and after fine-tuning (see this comment)

So... perhaps we can jump straight into the best technique in TGI? :D It should only need one flag in practice, the scale

iantbutler01 · 2023-07-20T02:25:30Z

@gante That sounds good to me! I'll work on this over the next few days.

M-Chris · 2023-07-21T23:12:05Z

FWIW I tested against Llama via a simple overwrite on the Docker image on the latest TGI 0.9.3, using a flash attn v2 compatible GPU and works good 😄

I pulled against this pr branch and layered it into the overwrite, with a few adjustments and assumptions.

Assumed SCALE is the only ENV
Assumed dynamic as the only option since this is only inference.
Applied changes to FlashLlamaAttention since rotary_emb has moved since the original pr

Couple thoughts//concerns..

The user will be required to set their --max-total-tokens,--max-input-length & --max-batch-prefill-tokens maybe something you have already a plan to address iantbutler01? also maybe its fine if a user manipulates these as seen fit.
Not sure if there is a specific reason to be holding back the current version of transformers? (deployed that the docker as a test, doesn't seem to have any affect from testing )

Here is the docker used to quickly overwrite for testing if its helpful:

FROM ghcr.io/huggingface/text-generation-inference:latest

WORKDIR /usr/src

RUN apt-get update -y

RUN pip install transformers==4.31.0
RUN pip install scipy
RUN pip install bitsandbytes==0.40.2
RUN pip install --upgrade accelerate

COPY ./app_overwrites/server/text_generation_server/. /opt/conda/lib/python3.9/site-packages/text_generation_server

Id be happy to share anything else if needed 🍻

iantbutler01 · 2023-07-24T23:30:31Z

@gante Looking at @jquesnelle's repo, and the comment you linked to it looks like there is actually both a standard by parts as well as a dynamic by parts method. So it looks like the improvement you were talking about applies to both types of NTK aware scaling? In that case I'm inclined to make this PR just the dynamic by parts method to save on some of the complexity.

iantbutler01 · 2023-07-25T02:21:41Z

@Narsil @gante I spent some time tonight working to implement the dynamic parts by method I mentioned in my last comment. I'm coming to realize, that with this new method and the comment here: #512 (comment) suggesting there are now models that have been fine tuned with scaling that the complexity here has the chance to really be ballooning.

Even just the parts by method itself is more gnarly and requires supporting a whole bunch of parameters on the attention module.

Before I continue, at the risk of the complexity putting this in review hell, I'd like some guidance on what you all think I should proceed with. Personally if I add the dynamic parts by method linked in my previous comment it will have effectively set up the ability to implement the other methods here anyway, but maybe a follow up PR for those makes sense.

gante · 2023-07-25T12:13:32Z

@iantbutler01 you raised good points: as users fine-tune their models with rope scaling, they may lose compatibility with TGI (depending on how we decide to do things here). And yes, let's settle on a path that avoids review hell!

I'd suggest separating the two use cases and making two separate decisions/PRs:

loading models with rope_scaling, such as the one linked here.
- Would NOT be in this PR
- Should consist in reading model.config and loading the right RoPE class.
- Support for each scaling strategy could be added progressively, depending on demand. For instance, all current relevant cases of fine-tuned RoPE scaling rely on linear scaling, so there is only 1 small modification needed in practice.
enabling dynamic RoPE scaling
- Would be in this PR
- Triggered by (e.g.) a simple env var
- Implementing the NTK-by-parts, since it is the best-performing dynamic method so far

@Narsil @iantbutler01 WDYT?

iantbutler01 · 2023-07-25T19:07:46Z

@gante I am fine with that approach, that's basically what I started last night but I wanted to make sure that's what everyone had in mind.

iantbutler01 · 2023-07-28T23:17:56Z

Given the license change I am no longer comfortable contributing my work.

OliverFM · 2023-07-31T09:17:04Z

@iantbutler01 would you be willing to contribute this change to a fork?
I was planning on adding support for Speculative Decoding, see the issue #729, but will no longer do that with the license changes.

I am strongly considering maintaining a fork of the repo from the commit before the license change. I would be adding support for speculative decoding there.

@OlivierDehaene

# What does this PR do? - Adds Rope NTK scaling. Done because #529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@OlivierDehaene

# What does this PR do? - Adds Rope NTK scaling. Done because huggingface/text-generation-inference#529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@OlivierDehaene

# What does this PR do? - Adds Rope NTK scaling. Done because huggingface/text-generation-inference#529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@OlivierDehaene

# What does this PR do? - Adds Rope NTK scaling. Done because huggingface/text-generation-inference#529 was closed Took some code from huggingface/transformers#24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Init fix: cleanup Add load testing Refactored gRPC interface Added validation logic ValidationError was not correctly handled Use axum feat: Docker image feat: Add AML deployment Update aml deployment feat: Improve error handling feat: Add arguments to CLI v0.1.0 fix(validation): Fix error messages feat(router): Add max_waiting_tokens Create LICENSE (#2) feat(server): Use safetensors Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> feat(client): Simplify sharded logic feat(server): Support bitsandbytes feat(server): Support all AutoModelForCausalLM on a best effort basis feat: Use json formatter by default in docker image fix(models): Revert buggy support for AutoModel feat(server): Support generic AutoModelForCausalLM feat(server): Support AutoModelForSeq2SeqLM feat(launcher): Pass CUDA_VISIBLE_DEVICES to the shard feat(server): Improved doc fix(server): Fix Transformers fork version feat(server): Clarify CausalLMBatch concatenate method feat(rust): Update to 1.65 fix(router): Fix HTTP status codes fix(readme): Typo fix(router): Handle tokenizer errors feat(server): Support Galactica (#4) fix(batching): Avoid theoretical hang in batcher loop (#5) - Avoid theoretical hang in batcher loop - Avoid a couple of clones in the router generate method - Keep attention mask tensors as integers - Remove num_heads attribute Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com> feat(server): Add model tests (#6) fix(server): Only pad to multiple of 8 on GPUs feat: Support stop sequences (#7) feat: Return logprobs (#8) feat(launcher): Add integration tests (#9) fix(server): Fix stop sequences (#11) fix(server): Check for device type correctly when determining initial padding (#16) AFAIK there is no torch device type called "gpu". fix(router): Include special tokens when tokenizing (#14) There's currently a discrepancy in the tokenization between the router and python server code. The latter includes special tokens but former does not. This results in a token count mismatch for seq2seq models such as mt0 where the tokenizer emits an EOS token at the end. This in turn results in some unexpected/incorrect output, in particular when batch concatenation is involved, because the python code uses the input length passed from the router for each row. As far as I can tell, it is better to include this token in the encoder `input_ids`, so I guess it's best to just adjust on the router side. feat(router): Add const parameters to validation logic (#15) I noticed some opportunity to collapse some of the logic, in case you are interested. fix(server): Use cleanup_tokenization_spaces=False for lossless decoding (#13) Fixes #12 in the easiest way I could think of. feat(launcher): Log server stdout (#19) Co-authored-by: Nick Hill <nickhill@us.ibm.com> fix(server): Minor refactorization using new_zeros (#24) - Fix some type hints, in particular base tokenizer class - Make use of `tensor.new_zero/empty` methods - Simplify env var string parsing in launcher fix(router): Obey max batch size (#23) feat(server): Support SantaCoder (#26) fix(server): Fix position ids (#28) feat(docker): Make the image compatible with api-inference (#29) fix(docker): fix api-inference deployment (#30) fix(router): fix api-inference deployment (#31) fix(dockerfile): fix docker build (#32) feat(bloom): use torch.nn.Linear and torch.nn.GELU (#33) feat(router): Remove second lock from batcher hot path (#27) @njhill feat: Support sampling seeding (#37) Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com> feat: Add token streaming using ServerSideEvents support (#36) Add token streaming using ServerSideEvents (SSE). The signature of the SSE events is: ```rust struct Details { finish_reason: String, generated_tokens: u32, seed: Option<u64>, } struct StreamResponse { token: Token, generated_text: Option<String>, details: Option<Details>, } struct ErrorResponse { error: String, } ``` Revert "feat: Add token streaming using ServerSideEvents support" (#40) Reverts huggingface/text-generation-inference#36 fix(server): fix seeding on gpu (#42) fix(server): fix seeding with multiple shards (#44) feat: Add token streaming using ServerSideEvents support (#41) fix(server): fix quantization for sharded models (#45) feat(server): Support GPT-Neox (#39) feat(ci): Docker build and push (#46) feat(server): allow gpt-neox models with odd vocab sizes to be sharded (#48) feat(server): support repetition penalty (#47) feat(server): allow the server to use a local weight cache (#49) fix(server): allow greedy repetition penalty (#51) feat(router): use background task to manage request queue (#52) Co-authored-by: Nick Hill <nickhill@us.ibm.com> breaking(router): modify /generate API to only return generated text (#50) @njhill, @yk FYI generated_text was concatenated to the user prompt for legacy reason. We want to remove this behaviour as we don't think it is useful and even detrimonial to usability. We also remove the unused Vec. feat(router): refactor API and add openAPI schemas (#53) feat(docs): Clarify installation steps (#54) Adds some bits for first-time users (like me 😄 ) feat(ci): push to AML registry (#56) fix(server): better handling of inference mode (#57) V0.2.1 (#58) feat(server): support t5 (#59) fix(docker): increase shm size (#60) fixed SSE naming (#61) https://en.wikipedia.org/wiki/Server-sent_events feat: add distributed tracing (#62) feat: add safetensors conversion (#63) feat(server): improve download logging (#66) feat(launcher): add disable_custom_kernels arg (#67) feat(router): add max_total_tokens and empty_input validation (#68) closes #65 fix(launcher): copy current env vars to subprocesses (#70) closes #69 feat(router): add prometheus metrics scrape endpoint (#71) v0.3.0 (#72) feat(router): add cors allow origin options (#73) feat(server): enable hf-transfer (#76) fix(server): remove position_ids from galactica forward (#82) closes #80 feat(server): pre-allocate max attention mask (#75) v0.3.1 (#84) feat(server): add special token bool (#85) fix(docs): fix openapi schema (#86) fix(server): fix token_is_special (#87) feat(router): add legacy route for api-inference support (#88) feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89) feat(router): add api-inference headers (#91) feat(server): add logits watermark (#90) feat(server): update to hf_transfer==0.1.2 (#93) feat(ci): improve CI speed (#94) fix(launcher): add router parameters to launcher (#95) feat(server): fix transformers commit (#96) v0.3.2 (#97) fix(server): fix generate_stream by forcing tokens to be decoded correctly (#100) feat: allow local models (#101) closes #99 feat: add supported models (#102) feat(clients): Python client (#103) fix(server): fix galactica batch (#106) closes #105 feat(launcher): allow parsing num_shard from CUDA_VISIBLE_DEVICES (#107) feat(launcher): default num_shard to CUDA_VISIBLE_DEVICES if possible (#108) fix(python-client): stream not set on the sync client (#109) fix(server): fix index out of range for watermarking (#110) feat: support typical sampling (#114) closes #112 fix(server): do not warp prefill logits (#116) feat(router): support left truncation (#115) closes #111 feat(router): add best_of parameter (#117) feat(python-client): add new parameters (#118) v0.4.0 (#119) feat: add OpenAssistant/oasst-sft-1-pythia-12b to the list of supported models (#122) …ed models fix(server): revert gpt-neox optims (#123) fix(server): add position ids to neox (#126) fix(server): use server tokenizer as gt (#128) fix(python-client): relax dependencies (#129) feat(python-client): add cookies to Client constructors and requests (#132) I have a use case where we need to pass cookies (for auth reasons) to an internally hosted server. Note: I couldn't get the client tests to pass - do you need to have an HF token? ```python FAILED tests/test_client.py::test_generate - text_generation.errors.BadRequestError: Authorization header is correct, but the token seems invalid ``` feat(ci): add ci paths (#134) feat: Add note about NVIDIA drivers (#64) Co-authored-by: OlivierDehaene <olivier@huggingface.co> feat(python-client): release v0.4.0 (#135) feat(python-client): add CI (#136) feat(server): flash neoX (#133) fix(server): fix flash-neox scores warping (#137) feat(server): cleanup flash neox loading (#139) v0.4.1 (#140) fix(server): Avoid using try/except to determine kind of AutoModel (#142) feat(server): Add mypy-protobuf (#141) Generates .pyi files for protobuf stubs which provide strong typing information. Very helpful for IDE auto-completion, etc. feat(server): clear cache on error (#143) feat(server): reduce mlp and attn in one op for flash neox (#145) feat: aws sagemaker compatible image (#147) The only difference is that now it pushes to registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:... instead of registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-... --------- Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com> fix(ci): fix sagemaker action (#148) feat(benchmark): tui based benchmarking tool (#149) fix(server): fix flash neox rotary embeddings (#150) v0.4.2 (#151) v0.4.3 (#152) feat(server): flash santacoder (#153) docs(readme): provide link Logits Warper README (#154) fix(server): fix escape characters in stop sequence (#155) feat(docker): improve flash_attention caching (#160) feat(launcher): allow disabling hf_transfer (#161) fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162) fix(router): use buckets for metrics histograms (#163) feat(router): make router input validation optional (#164) feat(server): add flash attention llama (#144) feat(server): support OPT models (#55) OPT models do not all have a `tokenizer.json` file on the hub at the moment. Can't merge for now. v0.5.0 (#168) feat(server): optimize decode for sane tokenizers (#170) feat(server): support sharded santacoder (#167) fix(launcher): revert change on shard errors (#173) fix(ci): fix CVE in github-slug-action (#174) feat(ci): add image signing with cosign (#175) feat(ci): add Trivy and scan docker image (#178) feat(ci): use large runners (#179) feat(ci): faster scanning (#180) fix(ci): fix ci permissions (#181) fea(dockerfile): better layer caching (#159) fix(ci): fix cosign error (#183) fix(docker): fix docker image (#184) fix(docker): fix image (#185) fix(docker): revert dockerfile changes (#186) fix(docker): fix docker image dependencies (#187) fix(router): fix truncation (#190) closes #189 feat(python-client): get list of currently deployed tgi models using the inference API (#191) feat(router): add info route (#196) close #125 feat(server): support quantization for flash models (#200) closes #197 feat(server): check cuda capability when importing flash models (#201) close #198 fix(server): fix hf_transfer issue with private repos (#203) fix(docker): remove unused dependencies (#205) fix(router): add auth token to get model info (#207) feat(router): add git sha to info route (#208) feat(router): drop requests when client closes the channel (#202) fix(ci): fix sha in docker image (#212) feat(server): flash attention past key value optimizations (#213) feat(router): add device and dtype info (#215) fix(server): fix past key values logic (#216) @njhill fyi fix(server): cleanup new flash past_key_values logic (#217) fix(server): fix flash causal (#218) fix(server): fix flash causal (#219) fix(server): fix flash batch filtering (#220) misc: update to rust 1.69 (#221) v0.6.0 (#222) feat(server): reduce memory requirement (#214) chore(server): update huggingface-hub (#227) feat(router): use number of tokens in batch as input for dynamic batching (#226) Co-authored-by: Nick Hill <nickhill@us.ibm.com> feat(router): add endpoint info to /info route (#228) chore(server): update safetensors version (#235) fix(python-client): add auth headers to is supported requests (#234) Starting some routing tests. (#233) fix(benchmarking): fix benchmarking tool chore(launcher): refactor logic (#242) Hopefully it's cleaner feat(router): add tests to validation (#237) feat(router): new healthcheck that skips the queue (#244) Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co> fix(server): fix reshaping of bloom past_key_values in concatenate() (#252) Introduced in #214 Fixes #249 fix(server): Small tidy of code from recent changes (#251) remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter() chore(server): update transformers (#250) feat(server): add watermarking tests (#248) feat(docker): add nvidia env vars (#255) doc(launcher): add more docs to the `launcher` itself and link in the README (#257) feat(benchmark): add support for private tokenizers (#262) Adding docs on how dynamic batching works. (#258) This PR starts the minimal possible amount of explanation I could think of. It tries to explain how dynamic batching occurs, the interactions with past key values and ignores the padding problem. Maybe some drawings could help too but I kept it to text for now. chore(github): add templates (#264) fix(server): fix typo in tokenizers decode (#269) closes #268 feat(server): support hf endpoint weight layout (#266) fix(launcher): pass weights cache override to the download process (#274) closes #273 fix(launcher): handle hub branches (#278) fix(server): Removes the parallelism in file convertion (during download) (#275) feat(launcher): Improve error message when download process fails. (#276) fix(server): fix convert (#284) chore: add `flash-attention` to docker ignore (#287) included when building docker locally. (Where the local dirs might have the flash-attention folder.)   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  fea(server): decrease convert RAM requirements (#286) fix(dockerfile): fix nvidia env vars (#297) Fixes #291 feat(router): Adding response schema for compat_generate (#292) feat(docker): add benchmarking tool to docker image (#298) fix(docker): fix docker build (#299) feat(server): optim flash causal lm decode_token (#285) fix(docker): fix nvidia env vars (#305) fix(docker): remove nvidia require cuda env (#310) feat(server): shard token decode (#303) feat(server): use float16 (#304) fix(docker): remove CUDA_VERSION feat(server): use cuda graph in logits warping (#302) fix(server): fix multinomial implem in Sampling feat(server): GPTQ quantization (step1) (#277) Changes only the type from `bool` to `Option<Enum>` pretty much everywhere. - Use `Optional[str]` in Python (easier to manage than importing type everywhere). Except for the cli to get proper validation - Updated all models to handle gracefully new values. (Error out if unknown value, or gptq since not implemented).   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  chore(docker): use nvidia base image (#318) fix(docker): remove quantize default fix(docker): use ubuntu20.04 Hotfixes for santacoder/bigcode. (#294) Hotfixes: - Uses `model_type`=`gpt_bigcode` for more general usage. - Hotfixes linked lm_head vs wte_embedding (safetensors file do not contain the key, correctly when the file is sharded, where as pytorch copies the tensor)   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Lifting check_unitialized. (#325) Lifting check_unitialized.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  Removing dead variables. (#327)   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  feat(ci): custom gpu runners (#328) Single place for TP layers + Dropout Layer Norm + FastLinear (#329)   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  feat: add snapshot testing (#282) feat(integration-tests): improve comparison and health checks (#336) fix(server): fix decode token (#334) Fixes #333 --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> fix: set MODEL_ID in sagemaker-entrypoint script (#343) feat(server): Support BLOOMChat-176B (#348) (#351) @njhill, temporary workaround to be able to run our CI as secrets are not available to runners run by external contributors. I will ask around to see if there is a better way. Co-authored-by: Nick Hill <nickhill@us.ibm.com> fix(server): fix init for flash causal lm (#352) Fixes #347 fix(server): t5 cannot run in f16 (#356) Fix #349 fix(ci): fix security group (#359) Switch security group used for ci (open outbound rules) Signed-off-by: Raphael <oOraph@users.noreply.github.com> Co-authored-by: Raphael <oOraph@users.noreply.github.com> feat: add nightly load testing (#358) chore(sever): update requirements (#357) Fixes #338 feat(server): support fp16 for t5 (#360) Fixes #349 feat(server): do not use device_map auto on single GPU (#362) feat(server): support trust_remote_code (#363) feat(router): log input/ouput at debug level (#364) @njhill FYI v0.7.0 (#353) feat: decrease IPC proto size (#367) Closes #307 #308 feat(benchmarker): add summary tables (#368) feat(server): support vectorized warpers in flash causal lm (#317) Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com> Fix issue when load AutoModelForSeq2SeqLM model (#370) fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES fix(server): fix quantization feat(server): support RefinedWeb models (#379) v0.8.0 increase health checks feat(server): add retry on download (#384) fix(server): fix bnb quantization for CausalLM models (#385) v0.8.1 fix(server): fix has_position_ids (#395) Fix #389 feat(server): remove trust_remote_code requirement for falcon models (#396) feat(server): load santacoder/starcoder models with safetensors (#393) Fix #366 v0.8.2 feat(sagemaker): add trust remote code to entrypoint (#394) feat(launcher): parse oom signal (#404) feat(server): only compute prefill logprobs when asked (#406) Close #288 feat(server): batch tokenization for flash causal lm (#411) chore: update openapi schema feat(server): Rework model loading (#344) Reworked the loading logic. Idea is to use cleaner loading code: - Remove need for `no_init_weights` - Remove all weird `bnb_linear` and `load_weights` and `post_load_weights`. New code layout: - New class `Weights` in charge of handling loading the weights from multiple files into appropiate tensors (potentially sharded) - TP layers now are "shells", they contain the code to know what kind of sharding we need + eventual `all_reduce`. They do not inherit from linear, but they contain some kind of Linear instead - the contained linear can be either FastLinear, BnbLinear or GPTq Linear next. - All modeling code is explictly made for sharding, process group is just no-ops for non sharded code (removes a lot of test cases) ![Screenshot from 2023-05-19 23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f) --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net> Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> feat(server): optimize dist ops (#434) docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441) It solves a typo in the comment sections referencing the environment variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this variable have been found in code logic leading to undefined behaviour or bugs. This PR is not expected to perform any code logic modification. fix(makefile): Fix typo and use POSIX comparison in the makefile (#443) This PR fixes: - The usage of non posix comparison which may fail depending on the shell used (`=` will always work, `==` only with bash) - Typo in the env variable name displayed in the error message `BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`  Fixes #422 feat(server): pre-allocate past key values for flash causal LM (#412) feat(router): add ngrok integration (#453) feat(server): improve flash attention import errors (#465) @lewtun, is this enough? Closes #458 Closes #456 fix(server): fix warpers on CPU (#472) Closes #471 fix(server): Fixing T5 in case the names are mixed up. (#475) feat(server): Update convert logic. (#483) Should be more robust to shared tensors (ok when using `from_pretrained). But forcing us to add new checks in our loading code (since the chosen key to keep might be different from `transformers`). --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> feat(server): Adding new ignore_rule for conversion. (#485) fix(router): add timeout on flume sends (#488) feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438) Let's start discussing implementation. - Need to expose the quantization scripts (either included here or add doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa) - Make sure GPTQ works for multiple models (priority to Falcon). Currently it means that every place we use `get_{tensor|sharded}` to check for quantization. My idea is to reintegrate as much as possible into `utils/layer.py` by expanding `load_multi` to be a bit more generic. This might require some thinking, but ultimately the `qweight,qzeros,scales,g_idx` should be in a single place, and independant of bias presence.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co> fix(server): Do not init process group if already initialized (#388) feat(router): add header option to disable buffering for the generate_stream response (#498) generate_stream endpoint response stream. Problem: If a model is run behind a proxy server such as nginx that has buffering enabled then the response stream from generate_stream gets aggregated into a single response which basically disables streaming. Instead of getting a chunked response where each token is presented over time the response presents everything all at once. Solution: This change adds the `X-Accel-Buffering` http header which disables buffering for the generate_stream response, allowing the response to stream properly. feat(server): add paged attention to flash models (#516) Closes #478 feat(router): arg validation (#519) feat: Add the option to force another dtype than `f16`. (#513) fix(launcher): fix issue where launcher does not properly report shard failures (#522) v0.9.0 (#525) feat(server): Add Non flash MPT. (#514) This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention. Fixes https://github.com/huggingface/text-generation-inference/issues/361 Fixes https://github.com/huggingface/text-generation-inference/issues/491 Fixes https://github.com/huggingface/text-generation-inference/issues/290 fix: Update server/Makefile to include Makefile-vllm (#520) For consistency and ease of use (you can just run `make` to install vllm without any extra steps).   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462) fix(server): Handle loading from local files for MPT (#534) This PR allows the MPT model to be loaded from local files. Without this change, an exception will be thrown by `hf_hub_download` function if `model_id` is a local path. fix(server): avoid errors for very small top_p values (#544) See https://github.com/huggingface/transformers/pull/24111 I didn't add validation to the `__init__` method since it's not done for other values/warpers. feat(server): use latest flash attention commit (#543) @njhill FYI feat(router): add argument for hostname in router (#545) (#550) In title. Adds argument `--hostname` in router to support something like `--hostname ::`. Tested with ```commandline cargo run -- --port 8080 --hostname :: curl -I -X GET 'http://[::1]:8080/health' # failed before this commit ``` Trigger CI --------- Co-authored-by: Phil Chen <philchen2000@gmail.com> fix(server): decrease memory fragmentation (#557) v0.9.1 (#558) fix(server): harden the weights choice to save on disk. (#561) - Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : https://github.com/huggingface/text-generation-inference/issues/555 and : https://github.com/huggingface/text-generation-inference/pull/501 and https://github.com/huggingface/text-generation-inference/issues/556 and https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593 feat: better errors for warmup and TP (#575) Close #571 fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579) Fixes #555 feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580) Some models are already converted, and do not have those values in the file, this enables users to use them with less friction. Went for pure env based because adding flags would end up (imo) very tedious to maintain. There's a lot of sanitation to do: those flags would be errors if not used in conjuction with `--quantize gptq`. Then the flags need to exist in the launcher and the server passing them all throughout all function calls. This PR is intended as an easy escape hatch, not the defacto method to use gptq in TGI. Fixes #500 chore: migrate ci region for more availability. (#581) fix(server): T5 weights names. (#582) Fixes #541 fix(server): Adding logger import to t5_modeling.py (#585) Logger is referenced during the apex importing but is not imported, causing a NameError fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590) This fixes a typo and extends the GPTP_BITS environment variables through to the second method which requires the same logic. Please let me know if there's anything I've misunderstood in this change. Thanks @Narsil for the original fix. feat(server): Implements sharding for non divisible `vocab_size`. (#583) - The code is relatively easy (just disable the checks on Embedding and Head) This cannot be done in the same easy fashion for hidden_dim/head_dim. It's relatively easy on some models (classic MHA) but it would make the other models (MQA) much more complex, and GPTQ quantization another quite hairy piece of code. feat(server): empty cache on errors GPTQ Env vars: catch correct type of error (#596) When passing in environment variables like gptq_bits, we still get errors thrown from TGI because the try/catch block is catching the wrong type of error. This PR aims to fix that. @Narsil - let me know if this is how you want this formatted. My Python is a little shaky, so I hope this syntax is correct. feat(launcher): add arg validation and drop subprocess (#595) feat(router): explicit warning if revision is not set (#608) docs: README: Add logo + baseline (#611) ![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984) fix(server): blacklist local files (#609) Close #589 #602 v0.9.2 (#616) fix(server): empty_cache when stopped fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621) fea(launcher): debug logs (#623) feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587) but should work on more configurations (no need for 2 GPUs, less RAM usage). Reworking the quantization script so it's still universal (not llama specific) but should work on more configurations (no need for 2 GPUs, less RAM usage). Still need to investigate the potential differences in quantization results.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  feat(server): flash attention v2 (#624) feat(server): add support for llamav2 (#633) v0.9.3 (#634) fix(server): fix llamav2 config (#635) feat(server): auto max_batch_total_tokens for flash att models (#630) feat(router): ngrok edge (#642) docs: Update README.md (#639) docs: Update README.md (#643) Add trust_remote_code to quantize script (#647)   Fixes a bug appeared with MR #587 fixing issue #552. See the discussion in #552. With MR #587 the trust_remote_code variable is not passed to AutoModelForCausalLM, but is found in the function signature. This prevents models like falcon to be quantized, because trust_remote_code is required. This MR fixes the issue. - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [X] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil  fix(server): llama v2 GPTQ (#648) As per title & reported https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956 https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5 Test it: ``` GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq ``` & ``` curl 127.0.0.1:8080/generate \ -X POST \ -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \ -H 'Content-Type: application/json' ``` fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661) fix(server): use mem_get_info to get kv cache size (#664) Close https://github.com/huggingface/text-generation-inference/issues/649 Close https://github.com/huggingface/text-generation-inference/issues/651 Close https://github.com/huggingface/text-generation-inference/issues/653 Close #636 feat(server): Add exllama GPTQ CUDA kernel support #553 (#666) Just trying to get the integration tests to pass.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com> Directly load GPTBigCode to specified device (#618) This PR directly load GPTBigCode to specified device, avoiding moving model between devices. This PR directly load GPTBigCode to specified device, avoiding moving model between devices. - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @OlivierDehaene OR @Narsil feat(server): add local prom and health routes if running w/ ngrok feat: add cuda memory fraction (#659) Close #673 fix(server): fix exllama buffers (#689) Close #683 feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671) - Current PR is not great because we're side stepping the `Weights.__init__` but Weights shouldn't requires anything related to the config or the model_id as it aims to be a simple Wrapper over multi file loading. - Ideal solution would be to use something like Rust enum ``` enum Quantize{ Bitandbytes(Bitsandbytes), GPTQ(bits: usize, groupsize: usize) ``` And passing that around during load. Unfortunately we don't have access to this, so for now, side-stepping seems easier. - Re-enabling groupsize<0 with exllama (confirmed it works.) Helps #601 In next steps we should make sure our quantization script uses that format and make it standard.   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  docs(README): update readme fix(server): fix quantization python requirements (#708) fix(server): fix missing datasets in quantize feat(server): support new falcon config (#712) v0.9.4 (#713) Add section about TGI on other AI hardware accelerators in README (#715)   As per title. - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  docs: Add hardware section to TOC in README (#721) feat(server): update vllm version (#723) chore: update license to HFOIL (#725) v1.0.0 (#727) Local gptq support. (#738) Redoes #719   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  Fix typing in `Model.generate_token` (#733) This PR fixes a minor type annotation issue in the signature of `Model.generate_token`. All existing overrides of `Model.generate_token` return `Tuple[List[Generation], Optional[B]]`: https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/causal_lm.py#L535-L537 https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/flash_causal_lm.py#L802-L804 https://github.com/huggingface/text-generation-inference/blob/3ef5ffbc6400370ff2e1546550a6bad3ac61b079/server/text_generation_server/models/seq2seq_lm.py#L589-L591 I suspect that back in 017a2a8c when `GeneratedText` and `Generation` were separated, the function signature was not updated. - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? CC @OlivierDehaene Adding Rope scaling. (#741) - Adds Rope NTK scaling. Done because https://github.com/huggingface/text-generation-inference/pull/529 was closed Took some code from https://github.com/huggingface/transformers/pull/24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  chore: fix typo in mpt_modeling.py (#737) Fixed typo.   implemetation -> implementation - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update…

iantbutler01 mentioned this pull request Jul 3, 2023

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. #512

Closed

iantbutler01 added 2 commits July 17, 2023 01:16

Implement scaled and dynamically scaled RoPE

f01c11b

Update conditionals for dynamic scaling

0ec4d81

iantbutler01 force-pushed the fork/main branch from abc4e31 to 0ec4d81 Compare July 17, 2023 01:18

iantbutler01 changed the title ~~[Draft] Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding~~ Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding Jul 17, 2023

Fix env vars

b4ce728

Narsil reviewed Jul 18, 2023

View reviewed changes

arnocandel mentioned this pull request Jul 20, 2023

long context h2oai/h2ogpt#360

Open

iantbutler01 closed this Jul 28, 2023

iantbutler01 deleted the fork/main branch July 29, 2023 02:44

OlivierDehaene mentioned this pull request Jul 29, 2023

Text-Generation-Inference v1.0+ new license: HFOIL 1.0 #726

Closed

Narsil mentioned this pull request Jul 31, 2023

Adding Rope scaling. #741

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

iantbutler01 commented Jul 3, 2023 •

edited

Loading

iantbutler01 commented Jul 3, 2023 •

edited

Loading

ssmi153 commented Jul 14, 2023

iantbutler01 commented Jul 17, 2023 •

edited

Loading

Narsil left a comment •

edited

Loading

Narsil Jul 18, 2023

iantbutler01 Jul 18, 2023

Narsil Jul 18, 2023

Narsil Jul 18, 2023

iantbutler01 Jul 18, 2023

Narsil Jul 18, 2023

iantbutler01 Jul 18, 2023 •

edited

Loading

iantbutler01 Jul 18, 2023

Narsil Jul 18, 2023

Narsil Jul 18, 2023

iantbutler01 Jul 18, 2023

iantbutler01 commented Jul 18, 2023 •

edited

Loading

M-Chris commented Jul 19, 2023

gante commented Jul 19, 2023

iantbutler01 commented Jul 20, 2023

M-Chris commented Jul 21, 2023 •

edited

Loading

iantbutler01 commented Jul 24, 2023

iantbutler01 commented Jul 25, 2023 •

edited

Loading

gante commented Jul 25, 2023

iantbutler01 commented Jul 25, 2023

iantbutler01 commented Jul 28, 2023

OliverFM commented Jul 31, 2023

Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

Implement NTK-Aware scaled and dynamically scaled RoPE for PositionRotaryEmbedding #529

Conversation

iantbutler01 commented Jul 3, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

iantbutler01 commented Jul 3, 2023 • edited Loading

ssmi153 commented Jul 14, 2023

iantbutler01 commented Jul 17, 2023 • edited Loading

Narsil left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iantbutler01 Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iantbutler01 commented Jul 18, 2023 • edited Loading

M-Chris commented Jul 19, 2023

gante commented Jul 19, 2023

iantbutler01 commented Jul 20, 2023

M-Chris commented Jul 21, 2023 • edited Loading

Couple thoughts//concerns..

iantbutler01 commented Jul 24, 2023

iantbutler01 commented Jul 25, 2023 • edited Loading

gante commented Jul 25, 2023

iantbutler01 commented Jul 25, 2023

iantbutler01 commented Jul 28, 2023

OliverFM commented Jul 31, 2023

iantbutler01 commented Jul 3, 2023 •

edited

Loading

iantbutler01 commented Jul 3, 2023 •

edited

Loading

iantbutler01 commented Jul 17, 2023 •

edited

Loading

Narsil left a comment •

edited

Loading

iantbutler01 Jul 18, 2023 •

edited

Loading

iantbutler01 commented Jul 18, 2023 •

edited

Loading

M-Chris commented Jul 21, 2023 •

edited

Loading

iantbutler01 commented Jul 25, 2023 •

edited

Loading