-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting model on multiple GPUs produces RuntimeError #10
Comments
I pushed an update that might fix it. I messed something up at one point so the attention mask wasn't copied from the first device, which might explain that error. It seems to be working now at least, with .. yep, the 33B model works, and presumably the 65B version is quantized with the same parameters. |
And also, you probably shouldn't use |
Hmm, I'm still getting the error:
If I run with the
I am running the latest commit:
|
It's odd. Could you try this?
Also, what GPUs are you using? |
I get the same error eventually, after some
My GPUs are both RTX 4090:
|
For comparison, if the model is not split, test completes successfully, with normal perplexity scores.
|
Huh... this is very odd indeed. It seems like it can move the state from GPU to GPU (since that last example is running on cuda:1). And either GPU works, I take it? I.e. if you run it with ´-gs 20,0´ do you get the same result as ´-gs 0,20´? If that works then it has to come down to something particular that happens if the state is transferred in between decoder blocks. Possibly a synchronization issue with PyTorch. What version are you using? Also could you try ´nvcc --version´ for good measure? I'll try a RunPod instance in a bit with two 4090s to see if it's maybe a timing issue that's masked by one of my GPUs being slower than the other. |
Earlier I had been using PyTorch 2.0.1, but I just switched to 2.1 nightly and still getting the same error:
If I do
Thank you very much for going the extra mile to repro on RunPod! In case it helps, here's a bunch more info on my system:
|
Oh, and here's the
|
RunPod doesn't seem to provide the latest drivers for their 4090 servers, which means no compute_89 so I can't precisely replicate your setup. They won't let you update drivers as far as I know (?)... It does compiles for compute_86 though (with cu117) and this is running fine. It's a bit slow for the smaller models, probably because of the CPU bottleneck (1500 MHz server cores), but the performance is not that bad on 65B:
It works with at least a couple of different versions of Torch and CUDA, though I still can't replicate your setup because of the driver version. 525 only supports up to 12.0 as far as I can tell. But I'm wondering if it could have something to do with the AMD stuff in your device list. Maybe it's confusing Torch? You could try setting I think later I'll add a debug mode that dumps some I'd need to figure out exactly where the hidden state is getting corrupted. |
I added the debug mode. If you can try it with |
Thanks! Here you go:
|
Okay, so the hidden states are just disappearing when jumping from GPU to GPU? That's super weird. Could you try printing out the hidden state before and after the move. In model.py on line 1103, replace this:
with this:
If this is where the contents disappear I really don't know what to think... |
So it would seem... Just made that patch to model.py and here is the result: And the relevant section:
|
I'm going to go see if IOMMU is enabled and disable it if so... Moving a tensor across CUDA devices gets zero tensor, CUDA 11.0 #87363 |
Yes, that sounds like the same issue. Since it only seems to affect transfers between GPUs, you could probably work around it by copying via system ram like this instead of having to disable IOMMU.
There would be a (very small) performance cost, but I could add it as a fallback at least. If it works. |
So... I tried disabling IOMMU, and that didn't seem to have any effect (it was set to "auto" in the BIOS, and I did not have it enabled in the kernel command line, so ¯\_(ツ)_/¯. But, the passing of the hidden_states to CPU first did seem to fix the specific issue of the hidden state getting zeroed out:
After doing a bit of research, it looks like the 40xx series explicitly does not support "P2P". I tried using However, there does still seem to be an issue with GPU splitting. The LLM gives seemingly incoherent results with huge or nan perplexity scores still:
...
Attaching a full debug output: Note: I get expected perplexity results and coherent output if |
There is one other place where it moves data from GPU to GPU, but it's a little more subtle. It's the position embeddings which would end up being all zeros on one GPU if the issue is that it just can't move data across that way. And that would explain the output being garbage. In fact it fits nicely with a perplexity in the hundreds rather than nan. It is weird that it works between my 4090 and 3070-Ti, and I also tested it on two 4090s on RunPod, so there must be something else in your setup causing it, maybe not IOMMU but related to it. Some kernel parameter or something? Anyway, I pushed a new update with an extra option to force all the transfers (hopefully) to go via system RAM. I can't actually measure any difference in performance, so maybe I'll just make it the default, but for now you can try running with |
Fantastic! That did the trick :) Thank you! |
When attempting to split the model on multiple GPUs, I get the following error:
This only happens if the model is split between GPUs using the
-gs
option.The text was updated successfully, but these errors were encountered: