mit-han-lab / qserve Public

Notifications You must be signed in to change notification settings
Fork 28
Star 469

Code
Issues 28
Pull requests 4
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Issues: mit-han-lab/qserve

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

28 Open 9 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

A question about the parameter “–group-size” in qserve_benchmark.py

#48 opened Dec 16, 2024 by oasis-Linmi

qserve with tensorrt-llm is slower and awq int4 for llama2-7b

#46 opened Nov 28, 2024 by anaivebird

Does openai compatible server supported?

#43 opened Oct 31, 2024 by anaivebird

How to test the accuracy?

#42 opened Oct 30, 2024 by lisuying214

Some questions about VLM quant

#40 opened Oct 23, 2024 by hanhanpp

Question about pagedattention

#36 opened Sep 6, 2024 by SherrySwift

How to add new models?

#33 opened Aug 23, 2024 by NicolasDrapier

RMSNorm implemented as LayerNorm

#32 opened Aug 21, 2024 by jason-huang03

LLama-3-8B model dumped by LMQuant in 4w8a set raises errors when running e2e benchmark in QServe.

#29 opened Aug 12, 2024 by Patrick-Lew

[New Feature] Will MLA Be Supported?

#28 opened Aug 8, 2024 by RanchiZhao

How can we reproduce Table.2 and 3 ? (PPL and zero-shot Acc)

#25 opened Jul 12, 2024 by kriskrisliu

Question about dequantization overhead

#23 opened Jul 6, 2024 by DD-DuDa

Circular import error

#22 opened Jul 5, 2024 by LuckyLYM

The outpout of given model(mit-han-lab/Llama-3-8B-QServe-g128) is mistaken

#21 opened Jul 2, 2024 by haichuan1221

Expected speed for llama3-70b-instruct

#18 opened Jun 4, 2024 by ethxnp

Is the Table.3 accuracy tested with dequantized weights, or tested on real accelerated quantized kernels?

#17 opened Jun 3, 2024 by vovoluck

has anyone tried to HIPify this for AMD/ROCm

#16 opened Jun 2, 2024 by ehartford

support tp

#14 opened May 24, 2024 by cyLi-Tiger

activation quantization

#13 opened May 24, 2024 by hanhanpp

Any performance comparsion with vllm?

#12 opened May 21, 2024 by MuYu-zhi

Llama-2-7B-QServe model doesn't give the expected output

#11 opened May 21, 2024 by MuYu-zhi

Question about the paper

#10 opened May 18, 2024 by jameswu2014

Couldn't instantiate the backend tokenizer

#8 opened May 16, 2024 by Rudin6

lmquant for QoQ quantization and fake-quantized model dumping

#7 opened May 15, 2024 by SimpleTheoryOfTypes

Would this work on consumer hardware and integrated in frameworks like llama.cpp or others?

#5 opened May 11, 2024 by Mayorc1978

Previous 1 2 Next

Previous Next

ProTip! Find all open issues with in progress development work with linked:pr.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly