Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaMA support #506

Open
michaelroyzen opened this issue Mar 16, 2023 · 176 comments
Open

LLaMA support #506

michaelroyzen opened this issue Mar 16, 2023 · 176 comments
Labels
enhancement New feature or request

Comments

@michaelroyzen
Copy link

michaelroyzen commented Mar 16, 2023

Given existing support for GPT-J and its rotary embeddings, is LLaMA supported as well? Huggingface just shipped their implementation: huggingface/transformers@464d420

@byshiue

@byshiue
Copy link
Collaborator

byshiue commented Mar 17, 2023

Can you introduce what's difference between GPT-J and LLaMA?

@teknium1
Copy link

+1 for this

@michaelroyzen
Copy link
Author

They look very similar. In HuggingFace's doc page, they say that the implementation is based on the GPT-NeoX codebase, which seems to be supported by FasterTransformer: https://huggingface.co/docs/transformers/main/model_doc/llama.

Do you think it'll work?

@yuikns
Copy link

yuikns commented Mar 24, 2023

+1

@byshiue According to our investigation, it is not difficult to portal this model to Megatron as well. But I am not sure will one convert script works.

@byshiue
Copy link
Collaborator

byshiue commented Mar 24, 2023

Thank you for the suggestion and discussion. We may not have time to work on that issue right now. If you are interesting, you can try to support it.
It is welcome to ask question if you encounter any question, and merge back into our repo if you can support it.

@byshiue byshiue added the enhancement New feature or request label Mar 24, 2023
@Hap-Zhang
Copy link

+1 for this

@michaelroyzen
Copy link
Author

michaelroyzen commented Apr 7, 2023

It seems to be quite a simple implementation @byshiue. All that needs to be done is implement RMS layer norm in GPT-NeoX, as well as support the SILU activation. It seem that both of these features are already implemented elsewhere in FasterTransformer.

I'd be happy to take the lead if you can help me with the general steps.

@ZZR0
Copy link

ZZR0 commented Apr 11, 2023

+1 for this

1 similar comment
@troycheng
Copy link

+1 for this

@moonscar
Copy link

I compared the GPT-j and llama models in huggingface, they have the same attention layer. There are some differences in FFN, llama uses 3 weights, and the forward function is as follows

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

I checked the relevant code of the ffn layer in the source code, and it seems that there is no similar structure. Or such a layer already exists in the current code and I have not found it, I hope to get some tips.
@byshiue

@byshiue
Copy link
Collaborator

byshiue commented Apr 12, 2023

I compared the GPT-j and llama models in huggingface, they have the same attention layer. There are some differences in FFN, llama uses 3 weights, and the forward function is as follows

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

I checked the relevant code of the ffn layer in the source code, and it seems that there is no similar structure. Or such a layer already exists in the current code and I have not found it, I hope to get some tips. @byshiue

It looks like a standard gated silu. Can you explain what difference do you think?

@moonscar
Copy link

Thanks for the reminder, I missed this part.
I will try to make this work

@michaelroyzen
Copy link
Author

Wow, thank you @moonscar. Want any help? What's the status of your PR?

@AnShengqiang
Copy link

need this too

@Anychnn
Copy link

Anychnn commented Apr 17, 2023

moonscar

have you started this work? or i can help with it.

@michaelroyzen
Copy link
Author

Don't think it's been started yet @Anychnn

@michaelroyzen
Copy link
Author

Given the interest and activity here, I'd like to offer a bounty of $2,500 USD to whoever can get Llama implemented in FT. Please email me at michael@phind.com if you're interested. @moonscar @AnShengqiang @Anychnn @byshiue

It seems that all that needs to be done is copy over T5's RMS layer norm (already implemented in FT) and UL2's gated-silu (also already implemented elsewhere in FT) into GPT-NeoX. As per the Huggingface's implementation of Llama, it is otherwise completely identical to GPT-NeoX (which is already implemented in FT).

@michaelroyzen
Copy link
Author

The bounty will be $3,000 if a correct and working PR is opened by the end of Friday, April 21st (Pacific Time).

@jinluyang
Copy link

would be glad to help do a part of the work, for example converting the weights to FT

@cameronfr
Copy link

Made alot of progress on this, but my current FT model is outputting seemingly random tokens, so there's something wrong with my weight conversion or maybe even the exact layer implementation. If someone wants to pick up the torch (I am done for now 😞) the next step would prob be to compare layer-by-layer the output of the Huggingface model vs. this FT model:

Weights conversion: https://github.com/cameronfr/FasterTransformer/blob/main/examples/cpp/llama/huggingface_llama_convert.py
FT Model:
https://github.com/cameronfr/FasterTransformer/tree/main/src/fastertransformer/models/llama
Testing:
https://github.com/cameronfr/FasterTransformer/tree/main/examples/cpp/llama

Everything is modified from the respective GPTNeoX versions. LlamaContextDecoder and LlamaDecoder essentially just have the changes of Gelu -> Gated Silu and LayerNorm -> LayerNormT5. LlamaDecoderLayerWeight and LlamaWeight set the parameters of these layers.

@Anychnn
Copy link

Anychnn commented Apr 22, 2023

@cameronfr The default layernorm_eps_ of llama.h is set to be 1e-5, but llama-7b-torch set it default to 1e-6. And the attention module output is also incorrect, I am fixing this.

@jinluyang
Copy link

@cameronfr I think the reshape of qkv here might not be correct https://github.com/cameronfr/FasterTransformer/blob/45d48f9d06713cd006f7d95d4b2f99a4bd3abb11/examples/cpp/llama/huggingface_llama_convert.py#L97
Since the huggingface format qkv proj is prepared for rotary embedding https://github.com/huggingface/transformers/blob/d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c/src/transformers/models/llama/convert_llama_weights_to_hf.py#L101
So I tried something like :
qkvArr[:, 0, :, :] = qArr.reshape(n_heads,2, head_size//2, hidden_size).transpose((3,0,2,1)).reshape(hidden_size,n_heads,head_size)
and fixed the layernorm_eps, but the output tokens are still seemingly incorrect, not a sentence.
Also I changed the start_ids.csv not to use the one in gptneox, since they may not share the same token ids.

@michaelroyzen
Copy link
Author

Great progress @cameronfr @Anychnn @jinluyang. I'm doubling the bounty to $6k to whoever can get this working and merged in.

@void-main
Copy link

Hey @michaelroyzen @cameronfr @Anychnn @jinluyang , I got a self-tested working version and opened a pull request with it. Could you guys please take a look? Any chances we could get it merged?

@michaelroyzen
Copy link
Author

Nice! Works well so far in limited tests and is consistent with the Huggingface output using beam_size 1. One comment is that it should support max_position_embeddings (max_pos_seq_len in FT), but this is likely a simple change. Will continue testing and post the updates here.

@ZhuYuJin
Copy link

@michaelroyzen Does FT support the fine-tuned LLaMA with Lora? Training code is as follows: https://github.com/tloen/alpaca-lora/blob/main/finetune.py

@ZhuYuJin
Copy link

@michaelroyzen Does FT support the fine-tuned LLaMA with Lora? Training code is as follows: https://github.com/tloen/alpaca-lora/blob/main/finetune.py

Use the merge_adapter interface can merge lora weights into original linear weights. https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py#L279

@void-main
Copy link

Hey community, here are some updates:

  • supported bf16
  • supported triton decouple mode
  • verified that Llama 65B is working

@lucasjinreal
Copy link

Does llama able to run correctly with ft now?

@Anychnn
Copy link

Anychnn commented Jul 15, 2023 via email

@realgump
Copy link

llama支持dynamic batching吗?
我在config文件里面打开了dynamic batching,但是server端仍然是串行推理的。仿照https://github.com/triton-inference-server/fastertransformer_backend/blob/6df8877bee99d0c6eefc2e9127edd5ee71b1ad06/all_models/gpt/fastertransformer/config.pbtxt 里面开ragged input,可以成功组batch,但输出会出现很多乱码。

@pai4451
Copy link

pai4451 commented Jul 19, 2023

Llama 2 released: https://ai.meta.com/resources/models-and-libraries/llama/

Is it possible to serve it with Triton?

@SamuraiBUPT
Copy link

SamuraiBUPT commented Jul 19, 2023

Llama 2 released: https://ai.meta.com/resources/models-and-libraries/llama/

Is it possible to serve it with Triton?

worth trying, if there is no structural change, llama-2 may be supported.

from MetaAI blogs and their paper, they seemed to train new parameters on different methods.

@SamuraiBUPT
Copy link

The primary architectural differences from Llama 1 include increased context length and grouped-query attention (GQA).

structural changes may exist.

@CN-COTER
Copy link

The primary architectural differences from Llama 1 include increased context length and grouped-query attention (GQA).

structural changes may exist.

image

If we use LLAMA2-7B or LLAMA2-13B that without GQA, maybe we could apply current llama-ft inference architecture.

@chuanzhao0626
Copy link

chuanzhao0626 commented Jul 21, 2023

llama-2-70B: The model configuration has one more parameter.
'num_key_value_heads': 8

Converting model weights using huggingface_llama_convert.py indicates an error:
ValueError: all input arrays must have the same shape

Comparing the two versions of llama model, the weight dimensions of k, q and v are different.
llama-2-70B:
q_proj.weight:(8192, 8192)
k_proj.weight:(1024, 8192)
v_proj.weight:(1024, 8192)
Due to dimensional differences, use the np.vstack() method for parameter concatenation?

llama-65B:
k_proj.weight:(8192, 8192)
q_proj.weight:(8192, 8192)
v_proj.weight:(8192, 8192)

Can model transformation give some suggestions for changes? This dimension change is not the inference implementation code in FT also needs to be modified synchronously.

@Dimensionzw
Copy link

llama-2-70B:模型配置多了一个参数。 'num_key_value_heads':8

使用huggingface_llama_convert.py转换模型权重指示错误: ValueError:所有输入数组必须具有相同的形状

对比两个版本的llama模型,k、q、v的权重维度不同。 llama-2-70B: q_proj.weight:(8192, 8192) k_proj.weight:(1024, 8192) v_proj.weight:(1024, 8192) 由于维度差异,使用 np.vstack() 方法进行参数串联?

llama-65B: k_proj.weight:(8192, 8192) q_proj.weight:(8192, 8192) v_proj.weight:(8192, 8192)

模型改造能否给出一些改变的建议?这个维度变化并不是FT中的推理实现代码也需要同步修改。

Hello, I would like to know if there are any structural changes between llama2 7B and 13B, and can they be directly converted and deployed using the FT framework?

@CN-COTER
Copy link

CN-COTER commented Jul 21, 2023

llama-2-70B:模型配置多了一个参数。 'num_key_value_heads':8
使用huggingface_llama_convert.py转换模型权重指示错误: ValueError:所有输入数组必须具有相同的形状
对比两个版本的llama模型,k、q、v的权重维度不同。 llama-2-70B: q_proj.weight:(8192, 8192) k_proj.weight:(1024, 8192) v_proj.weight:(1024, 8192) 由于维度差异,使用 np.vstack() 方法进行参数串联?
llama-65B: k_proj.weight:(8192, 8192) q_proj.weight:(8192, 8192) v_proj.weight:(8192, 8192)
模型改造能否给出一些改变的建议?这个维度变化并不是FT中的推理实现代码也需要同步修改。

Hello, I would like to know if there are any structural changes between llama2 7B and 13B, and can they be directly converted and deployed using the FT framework?

I have tested llama2 13B with FT framework + int8 and I did not encounter any error.

@void-main
Copy link

Hey guys, don't try to use Llama2 with current Llama implementation.

Current implementation doesn't implement MultiQueryAttention (the num_key_value_heads field), and it is expected to not work.

If you guys are in a hurry to use Llama2, I highly recommend you turn to vllm which now supports Llama2.

@fmac2000
Copy link

@void-main will there be work done to implement MQA?

@void-main
Copy link

Hey @fmac2000 , I'd like to try implement MQA based on FlashAttention2, but I can't promise when this feature would be ready.

@Dimensionzw
Copy link

@fmac2000,我想尝试基于FlashAttention2实现MQA,但我不能保证这个功能什么时候准备好。

@void-main Maybe can refer to the implementation of this submission, this project is also implemented using the FT framework, recently supported the GQA function of llama2-70B, but for the llama2 models of 7B and 13B, the existing implementation should be directly usable
https://github.com/InternLM/lmdeploy

@AnyangAngus
Copy link

llama-2-70B:模型配置多了一个参数。 'num_key_value_heads':8
使用huggingface_llama_convert.py转换模型权重指示错误: ValueError:所有输入数组必须具有相同的形状
对比两个版本的llama模型,k、q、v的权重维度不同。 llama-2-70B: q_proj.weight:(8192, 8192) k_proj.weight:(1024, 8192) v_proj.weight:(1024, 8192) 由于维度差异,使用 np.vstack() 方法进行参数串联?
llama-65B: k_proj.weight:(8192, 8192) q_proj.weight:(8192, 8192) v_proj.weight:(8192, 8192)
模型改造能否给出一些改变的建议?这个维度变化并不是FT中的推理实现代码也需要同步修改。

Hello, I would like to know if there are any structural changes between llama2 7B and 13B, and can they be directly converted and deployed using the FT framework?

I have tested llama2 13B with FT framework + int8 and I did not encounter any error.

@CN-COTER
Hi:
Dose the llama-2 output tokens from FT is consistant with HF transformer ?
Thank U

@fmac2000
Copy link

Hey @fmac2000 , I'd like to try implement MQA based on FlashAttention2, but I can't promise when this feature would be ready.

@void-main - that’s great news, thank you for all the work you’ve put in so far - it’s extremely appreciated. Let us know 👍

@realgump
Copy link

realgump commented Jul 31, 2023

Is there any bugs in batching inference? The response of the model always appear garbled characters when request batches of input, like:
01.01.0395153939222e0 for 3010 for a neutral.tt.tt.222201401.01.01.5p.1.91.91.a01.20 with the first pitch-1.10 with01.1.10 with1.10 with01.1.20 with1.1.0 with1.20 with1.20 with1.22222222222201.1.1.1.1.133300 with1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1d1.1.1d1.1.1d1.1.1d1.
������������

@double-vin
Copy link

@cameronfr I think the reshape of qkv here might not be correct https://github.com/cameronfr/FasterTransformer/blob/45d48f9d06713cd006f7d95d4b2f99a4bd3abb11/examples/cpp/llama/huggingface_llama_convert.py#L97 Since the huggingface format qkv proj is prepared for rotary embedding https://github.com/huggingface/transformers/blob/d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c/src/transformers/models/llama/convert_llama_weights_to_hf.py#L101 So I tried something like : qkvArr[:, 0, :, :] = qArr.reshape(n_heads,2, head_size//2, hidden_size).transpose((3,0,2,1)).reshape(hidden_size,n_heads,head_size) and fixed the layernorm_eps, but the output tokens are still seemingly incorrect, not a sentence. Also I changed the start_ids.csv not to use the one in gptneox, since they may not share the same token ids.

I have also encountered this issue. There is a problem with the output token id. Have you resolved it?

@double-vin
Copy link

update the inference speed:

  • 38ms per token on A6000, 13B llama model with FP16 precision.
  • 18ms per token on A800, 13B llama model with FP16 precision.
[1685187041.895424] [ee6d00936280:22964:f]        vfs_fuse.c:424  UCX  WARN  failed to connect to vfs socket '': Invalid argument
Total ranks: 1.
Device NVIDIA RTX A6000
P0 is running with GPU #0.
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
after allocation    : free: 22.86 GB, total: 47.54 GB, used: 24.67 GB
d_sequence_lengths 91 elements
Writing 1036 elements
    1 12968 29901 29896 29974 29896 29922 29973    13  7900
zeroCount = 946
[INFO] request_batch_size 1 beam_width 1 head_num 40 size_per_head 128 total_output_len 1036 decoder_layers 40 vocab_size 32000 FT-CPP-decoding-beamsearch-time 3052.38 ms
[INFO] batch 0: input_token_len 12, gen_token_len 79, total_token_len 91, ave 38.64 ms/token
Total ranks: 1.
Device NVIDIA A800-SXM4-80GB
P0 is running with GPU #0.
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
after allocation    : free: 54.51 GB, total: 79.32 GB, used: 24.82 GB
d_sequence_lengths 91 elements
Writing 1036 elements
    1 12968 29901 29896 29974 29896 29922 29973    13  7900
zeroCount = 946
[INFO] request_batch_size 1 beam_width 1 head_num 40 size_per_head 128 total_output_len 1036 decoder_layers 40 vocab_size 32000 FT-CPP-decoding-beamsearch-time 1471.19 ms
[INFO] batch 0: input_token_len 12, gen_token_len 79, total_token_len 91, ave 18.62 ms/token

Can you share your code? My output token id is incorrect, I would like to compare it.Thank you!

@CN-COTER
Copy link

CN-COTER commented Aug 3, 2023

Sorry for late Reply.

I have test Llama2-13b-chat on hf-transfomer and FT.

The input_id is

    '1, 12968, 29901, 29871, 30406, 4691, 31479, 30287, 30502, 232, 194, 174, 31859, 233, 145, 149, 31463, 29871, 13, 13, 7900, 22137, 29901, 29871'
  • FT
    I use llama_example.cc and save input_id to start_ids.csv, then get the following output in out file
1 12968 29901 29871 30406 4691 31479 30287 30502 232 194 174 31859 233 145 149 31463 29871 13 13 7900 22137 29901 29871 18585 29991 2266 338 385 1342 310 263 4996 6605 5687 297 5132 29901 13 28956 13 1753 4996 6605 29898 2749 1125 13 1678 565 7431 29898 2749 29897 5277 29871 29896 29901 13 4706 736 3948 13 1678 24438 353 3948 29961 29900 29962 13 1678 3109 353 518 29916 363 921 297 3948 29961 29896 17531 565 921 5277 24438 29962 13 1678 7621 353 518 29916 363 921 297 3948 29961 29896 17531 565 921 1405 24438 29962 13 1678 736 4996 6605 29898 2222 29897 718 518 29886 11002 29962 718 4996 6605 29898 7979 
  • Hf-Transformer
input_id = '1, 12968, 29901, 29871, 30406, 4691, 31479, 30287, 30502, 232, 194, 174, 31859, 233, 145, 149, 31463, 29871, 13, 13, 7900, 22137, 29901, 29871'
input_id = [[int(i) for i in input_id.split(', ')]]
input_id = torch.tensor(input_id)
generate_ids = model.generate(input_id, max_new_tokens=100, do_sample = True, top_k =1, top_p=0.95, temperature = 1, repetition_penalty=1.0, eos_token_id=2, bos_token_id=1, pad_token_id=0)
generate_ids is:
tensor([[    1, 12968, 29901, 29871, 30406,  4691, 31479, 30287, 30502,   232,
           194,   174, 31859,   233,   145,   149, 31463, 29871,    13,    13,
          7900, 22137, 29901, 29871, 18585, 29991,  2266,   338,   385,  1342,
           310,   263,  4996,  6605,  5687,   297,  5132, 29901,    13, 28956,
            13,  1753,  4996,  6605, 29898,  2749,  1125,    13,  1678,   565,
          7431, 29898,  2749, 29897,  5277, 29871, 29896, 29901,    13,  4706,
           736,  3948,    13,  1678, 24438,   353,  3948, 29961, 29900, 29962,
            13,  1678,  3109,   353,   518, 29916,   363,   921,   297,  3948,
         29961, 29896, 17531,   565,   921,  5277, 24438, 29962,    13,  1678,
          7621,   353,   518, 29916,   363,   921,   297,  3948, 29961, 29896,
         17531,   565,   921,  1405, 24438, 29962,    13,  1678,   736,  4996,
          6605, 29898,  2222, 29897,   718,   518, 29886, 11002, 29962,   718,
          4996,  6605, 29898,  7979]])

So, according to this example output from FT is consistant with HF transformer.

@double-vin
Copy link

Thank you very much for your reply. When I used the commit on July 2nd, I received the correct results, but there was a problem with using the commit on April 23rd. I will use a new version to solve this problem.

@realgump
Copy link

realgump commented Aug 8, 2023

llama支持dynamic batching吗? 我在config文件里面打开了dynamic batching,但是server端仍然是串行推理的。仿照https://github.com/triton-inference-server/fastertransformer_backend/blob/6df8877bee99d0c6eefc2e9127edd5ee71b1ad06/all_models/gpt/fastertransformer/config.pbtxt 里面开ragged input,可以成功组batch,但输出会出现很多乱码。

Yes, dynamic batching works well with latest commits. 33B, both decoupled true and false.

Thank you for your reply. However, even after updating to the latest commits, my 13B model still produces garbled output when multiple requests are made concurrently.

@shixianc
Copy link

shixianc commented Aug 15, 2023

Do we have a working implementation for Llama1 using FlashAttention?

I tried to set $FMHA_ENABLE=ON but did not observe any difference in the output or the performance. I'm wondering if anyone has tested this feature and would like to share some more details?

@efwfe
Copy link

efwfe commented Aug 23, 2023

llama-2-70B: The model configuration has one more parameter. 'num_key_value_heads': 8

Converting model weights using huggingface_llama_convert.py indicates an error: ValueError: all input arrays must have the same shape

Comparing the two versions of llama model, the weight dimensions of k, q and v are different. llama-2-70B: q_proj.weight:(8192, 8192) k_proj.weight:(1024, 8192) v_proj.weight:(1024, 8192) Due to dimensional differences, use the np.vstack() method for parameter concatenation?

llama-65B: k_proj.weight:(8192, 8192) q_proj.weight:(8192, 8192) v_proj.weight:(8192, 8192)

Can model transformation give some suggestions for changes? This dimension change is not the inference implementation code in FT also needs to be modified synchronously.

same error .

@RobotGF
Copy link

RobotGF commented Sep 8, 2023

llama支持dynamic batching吗? 我在config文件里面打开了dynamic batching,但是server端仍然是串行推理的。仿照https://github.com/triton-inference-server/fastertransformer_backend/blob/6df8877bee99d0c6eefc2e9127edd5ee71b1ad06/all_models/gpt/fastertransformer/config.pbtxt 里面开ragged input,可以成功组batch,但输出会出现很多乱码。

Yes, dynamic batching works well with latest commits. 33B, both decoupled true and false.

Thank you for your reply. However, even after updating to the latest commits, my 13B model still produces garbled output when multiple requests are made concurrently.

#716
#742
this two pull request may help. both work well. select one as you need

@double-vin
Copy link

llama支持dynamic batching吗? 我在config文件里面打开了dynamic batching,但是server端仍然是串行推理的。仿照https://github.com/triton-inference-server/fastertransformer_backend/blob/6df8877bee99d0c6eefc2e9127edd5ee71b1ad06/all_models/gpt/fastertransformer/config.pbtxt 里面开ragged input,可以成功组batch,但输出会出现很多乱码。

Yes, dynamic batching works well with latest commits. 33B, both decoupled true and false.

Thank you for your reply. However, even after updating to the latest commits, my 13B model still produces garbled output when multiple requests are made concurrently.

#716 #742 this two pull request may help. both work well. select one as you need

The same problem, but these two requests did not solve the garbled code problem.

@HuaYZhao
Copy link

HuaYZhao commented Oct 11, 2023

我根据llama_guide编译FasterTransformer,cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release ..编译成功后运行build目录下的./bin/llama_example成功运行,但当我以debug模式进行编译cmake -DSM=80 -DCMAKE_BUILD_TYPE=Debug ..,并以相同的方式去运行llama_example可执行文件,得到错误如下:

terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: too many resources requested for launch /ft_workspace/FasterTransformer/src/fastertransformer/layers/FfnLayer.cc:311 

使用的模型是llama-7b-hf,设备是A100-80G
请帮我解惑一下,谢谢!

@Anychnn
Copy link

Anychnn commented Oct 11, 2023 via email

@CN-COTER
Copy link

CN-COTER commented Oct 20, 2023

Hi, FYI
TensorRT-LLM is publicly available https://github.com/NVIDIA/TensorRT-LLM/tree/main. As know from doc, it intergrate fastertransformer and support many latest feature.Meanwhille, tensorRT-llm-backend is now avaliable in triton-inference-server.(https://github.com/triton-inference-server/tensorrtllm_backend/blob/e514b4af5ec87477b095d3ba6fe63cc7b797055f/README.md#L31).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests