-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 x RTX A5000 performance #32
Comments
I take it this is a 65B model, since the two 24GB GPUs? |
Yes. Guanaco 65B GPTQ. More details:
|
If it doesn't infer act-order, it's because the model doesn't have a group index, which is what matters for how you'd use the model. I'm not sure if there's some other sense of being act-order that TheBloke is referring to with his conversions, but all the recent ones I've seen just have sequential groups (or no groups), even though they're labeled as act-order.
Not really. As long as the model fits. It can take a little trial and error because cuda:0 will typically also be used by the system, and on top of the usage that gets reported, Torch will allocate a bunch for itself, and it won't always be split equally. If one of your GPUs is faster than the other you'll want as much of the model on that one as possible, otherwise the balance has no effect on performance. |
Got it, thank you. Here's my understanding of GPTQ heurisitics (not sure if useful, skip if obvious):
|
That's pretty much my understanding as well. Except the activation order is what's stored in the group index. From gptq.py:
So ExLlama determines whether a model is act-order or not by checking if there's a group index, since any model converted with act-order is going to have the It's all very unclear, especially if act-order is meant to affect the format of the converted model, so that it conflicts somehow with group size. Unless it means (as I've been assuming) that the "groupsize" models just effectively have a sequential group index, whereas the "act-oder" models can have every row (or column, in the state/activations) index a different group, i.e. a different set of quantization parameters. It would also imply that act-order doesn't make sense without a group size, since then there's effectively only one group to index to. |
Situation with act-order is a bit confusing to me. |
act order + group size together causes serious performance loss on other GPTQ cuda kernels. So usually people use either group size alone or act order. There is still a perf hit with it alone but it's small. There is definitely a perplexity difference by using act order or just nothing. Group size on 30b models causes OOM with full context in 24gb. This is why people are doing what they're doing. They want some perplexity gain without all the performance/compatibility issues. |
This still doesn't make sense. If you haven't got a group size, what is it that's being reordered with the act-order feature? There are four separate cases, I think:
Case 4 is kind of nonsensical to me. It should be functionally identical to case 1, except it would include a group index. Except that index is just a list of zeros, so I guess GPTQ just doesn't bother to save it, since it doesn't actually affect anything? Unless I'm completely misunderstanding how act-order works, those models that claim to be act-order without groupsize are in fact just not act-order models. They may have been converted with the act-order option, but that option doesn't do anything in the absence of groupsize, or at least it doesn't do anything that requires you to use the model as you would normally use an act-order model. |
Well.. 128 is not fairly small.. on a 30b model I can't use it at full context anymore. It's fine if you like ask it one off questions but it's completely unusable for chat. I've not checked if you solve this problem but for every other GPTQ kernel 30b+128g == death. I'm not 100% exact on this, but there is never no grouping afaik. Without a group size it uses full rows and ranks them by size. There is a perplexity difference between act_order and no act_order on full rank quantizations. Wouldn't be the case if it did nothing. Also did the model get quantized by autogptq or original GPTQ as there may be a difference there too. Not an expert on the math but if your case 4 is true, both the original GPTQ author and the AutoGPTQ author added a useless option and nobody has brought it up for months. |
Idk, maybe in the original. My implementation does just fine on 128g because it's more memory-efficient overall. I can get up to 2516 tokens of context before it OoMs, with 128g. Of course it starts outputting garbage around 2048 because that's all Llama is trained on, but it's doing the computations for the full sequence length.
But this ranking isn't saved anywhere, it would seem. Or at least, if the model registers as no-act-order, it's because the g_idx tensor is missing from the weights. I've also tested quite a few models by now (incomplete list) and not one of them has a g_idx without a groupsize. It's quite possible that the quantization works out to slightly different (better) values when act-order is used. The code is very dense and hard to parse. Perhaps that's why I'm not seeing where the g_idx would get discarded before the quantized model is saved. I'll continue to investigate. |
I'll give the 30b-128 model I have a shot and see if I can go full context on it. There are some models now trained past 2048. Haven't attempted one of those either yet. |
What models are those? I'd be curious to check them out as well. |
Bluemoon RP has longer context and there is a 30b now (with 128 groups). Also mpt-storywriter merge. https://huggingface.co/TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge/tree/main I never got bluemoon to stay coherent in the 13b past 2500 or so tokens but storywriter goes strong, it's not quantized tho. The MPT uses alibi and have not tried to quantize it but I assume other people with less vram have. |
Cleaning up some stale issues. |
10 t/s vs. 6 t/s on text-generation-webui.
Great project.
The text was updated successfully, but these errors were encountered: