Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 x RTX A5000 performance #32

Closed
alain40 opened this issue Jun 5, 2023 · 14 comments
Closed

2 x RTX A5000 performance #32

alain40 opened this issue Jun 5, 2023 · 14 comments

Comments

@alain40
Copy link

alain40 commented Jun 5, 2023

10 t/s vs. 6 t/s on text-generation-webui.

Great project.

@turboderp
Copy link
Owner

I take it this is a 65B model, since the two 24GB GPUs?

@alain40
Copy link
Author

alain40 commented Jun 5, 2023

Yes. Guanaco 65B GPTQ.

More details:

  • Act-order was not inferred, but is present in the above model. I hard wired it in the code. Not sure if it makes a difference.
  • gs 18,20 leads to an imbalanced memory map with 23G in GPU0. gs 14,18 also works and leads to a more balanced memory map. Not sure if it makes a difference.

@turboderp
Copy link
Owner

Act-order was not inferred, but is present in the above model. I hard wired it in the code. Not sure if it makes a difference.

If it doesn't infer act-order, it's because the model doesn't have a group index, which is what matters for how you'd use the model. I'm not sure if there's some other sense of being act-order that TheBloke is referring to with his conversions, but all the recent ones I've seen just have sequential groups (or no groups), even though they're labeled as act-order.

gs 18,20 leads to an imbalanced memory map with 23G in GPU0. gs 14,18 also works and leads to a more balanced memory map. Not sure if it makes a difference.

Not really. As long as the model fits. It can take a little trial and error because cuda:0 will typically also be used by the system, and on top of the usage that gets reported, Torch will allocate a bunch for itself, and it won't always be split equally. If one of your GPUs is faster than the other you'll want as much of the model on that one as possible, otherwise the balance has no effect on performance.

@alain40
Copy link
Author

alain40 commented Jun 5, 2023

Got it, thank you.

Here's my understanding of GPTQ heurisitics (not sure if useful, skip if obvious):

  • true-sequential (sequential quantization even within a single transformer block). All models do this.
  • act-order (quantizing columns in order of decreasing activation size). Mutually exclusive with group-size when using std. CUDA version of GPTQ code which only works with one or the other but not both. So most models chose one or the other to work with the std. version of CUDA based GPTQ. There is a Triton based version that supports both heurisitics.
  • group-size (sharing scaling factors among a smaller group). Better performance if model must chose between act-order and group-size, but uses more memory.

@turboderp
Copy link
Owner

That's pretty much my understanding as well. Except the activation order is what's stored in the group index. From gptq.py:

    if actorder:
        invperm = torch.argsort(perm)
        Q = Q[:, invperm]
        g_idx = g_idx[invperm]

So ExLlama determines whether a model is act-order or not by checking if there's a group index, since any model converted with act-order is going to have the g_idx tensors, and a model converted without act-order won't.

It's all very unclear, especially if act-order is meant to affect the format of the converted model, so that it conflicts somehow with group size. Unless it means (as I've been assuming) that the "groupsize" models just effectively have a sequential group index, whereas the "act-oder" models can have every row (or column, in the state/activations) index a different group, i.e. a different set of quantization parameters.

It would also imply that act-order doesn't make sense without a group size, since then there's effectively only one group to index to.

@alain40
Copy link
Author

alain40 commented Jun 6, 2023

Situation with act-order is a bit confusing to me.
But we have an experimental data point: this model works with your current code.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 6, 2023

act order + group size together causes serious performance loss on other GPTQ cuda kernels. So usually people use either group size alone or act order. There is still a perf hit with it alone but it's small.

There is definitely a perplexity difference by using act order or just nothing. Group size on 30b models causes OOM with full context in 24gb. This is why people are doing what they're doing.

They want some perplexity gain without all the performance/compatibility issues.

@turboderp
Copy link
Owner

This still doesn't make sense. If you haven't got a group size, what is it that's being reordered with the act-order feature?

There are four separate cases, I think:

  1. no act-order, no groupsize: Here GPTQ would have a single set of quantization parameters for the whole matrix. qzeros and scales have a height of 1
  2. no act-order, groupsize n: There is a new set of quantization parameters for every n rows in the matrix. So the rows are effectively grouped, sequentially. qzeros and scales have a height of k/n where k is the height of the original weight matrix before quantization. This will always be a little slower, and the smaller the groupsize the slower and more VRAM-heavy it's going to be. 128g has a fairly small impact on both, though.
  3. act-order, groupsize n: Same as the previous, except there is an additional group index that specifies which group each row belongs to. They're no longer sequential, so you can't just process n rows at a time, you re-fetch the quant params for every row. This is obviously more flexible and allows for better accuracy, but also considerably slower. I get around it by shuffling all the matrices at load-time so they work like case 2 instead, with the caveat that the left-hand side of the matmul needs some shuffling also, so you get the correct result in the end, but this ends up being relatively cheap.
  4. act-order, no groupsize: In this case, there is only one set of quantization parameters, but also each row can be quantized with any set of quant params, except there's only one option to choose from...?

Case 4 is kind of nonsensical to me. It should be functionally identical to case 1, except it would include a group index. Except that index is just a list of zeros, so I guess GPTQ just doesn't bother to save it, since it doesn't actually affect anything?

Unless I'm completely misunderstanding how act-order works, those models that claim to be act-order without groupsize are in fact just not act-order models. They may have been converted with the act-order option, but that option doesn't do anything in the absence of groupsize, or at least it doesn't do anything that requires you to use the model as you would normally use an act-order model.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 6, 2023

Well.. 128 is not fairly small.. on a 30b model I can't use it at full context anymore. It's fine if you like ask it one off questions but it's completely unusable for chat. I've not checked if you solve this problem but for every other GPTQ kernel 30b+128g == death.

I'm not 100% exact on this, but there is never no grouping afaik. Without a group size it uses full rows and ranks them by size.

There is a perplexity difference between act_order and no act_order on full rank quantizations. Wouldn't be the case if it did nothing. Also did the model get quantized by autogptq or original GPTQ as there may be a difference there too.

Not an expert on the math but if your case 4 is true, both the original GPTQ author and the AutoGPTQ author added a useless option and nobody has brought it up for months.

@turboderp
Copy link
Owner

Well.. 128 is not fairly small.. on a 30b model I can't use it at full context anymore. It's fine if you like ask it one off questions but it's completely unusable for chat. I've not checked if you solve this problem but for every other GPTQ kernel 30b+128g == death.

Idk, maybe in the original. My implementation does just fine on 128g because it's more memory-efficient overall. I can get up to 2516 tokens of context before it OoMs, with 128g. Of course it starts outputting garbage around 2048 because that's all Llama is trained on, but it's doing the computations for the full sequence length.

I'm not 100% exact on this, but there is never no grouping afaik. Without a group size it uses full rows and ranks them by size.

But this ranking isn't saved anywhere, it would seem. Or at least, if the model registers as no-act-order, it's because the g_idx tensor is missing from the weights. I've also tested quite a few models by now (incomplete list) and not one of them has a g_idx without a groupsize.

It's quite possible that the quantization works out to slightly different (better) values when act-order is used. The code is very dense and hard to parse. Perhaps that's why I'm not seeing where the g_idx would get discarded before the quantized model is saved.

I'll continue to investigate.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 7, 2023

I'll give the 30b-128 model I have a shot and see if I can go full context on it. There are some models now trained past 2048. Haven't attempted one of those either yet.

@turboderp
Copy link
Owner

What models are those? I'd be curious to check them out as well.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 7, 2023

Bluemoon RP has longer context and there is a 30b now (with 128 groups). Also mpt-storywriter merge. https://huggingface.co/TehVenom/MPT-7b-WizardLM_Uncensored-Storywriter-Merge/tree/main

I never got bluemoon to stay coherent in the 13b past 2500 or so tokens but storywriter goes strong, it's not quantized tho. The MPT uses alibi and have not tried to quantize it but I assume other people with less vram have.

@turboderp
Copy link
Owner

Cleaning up some stale issues.

@turboderp turboderp closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants