Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU issues #281

Open
nktice opened this issue Sep 9, 2023 · 9 comments
Open

Multi-GPU issues #281

nktice opened this issue Sep 9, 2023 · 9 comments

Comments

@nktice
Copy link

nktice commented Sep 9, 2023

Here's another bug on Oobabooga's project that is unresolved...
oobabooga/text-generation-webui#2923
I realized that the ExLlama team may have a solution....
So thought I'd cross post this issue on this project, in case you've not seen.

Here's the guide I wrote to get everything working on AMD kit...
https://github.com/nktice/AMD-AI
Models load fine when it is only on one card, here are some results :
https://github.com/nktice/AMD-AI/blob/main/SallyAIRiddle.md

Multi-card loading only spits out gibberish, here's an example :

pha golden Riv. Jcatred (ProcSN proc Dre -:// Mindly means for the and in a Nich říct Forest Rav Rav fran fran fran gaz Agrcastle castleasiacliordinate advers Mem advers Basibenkooor paste Singapore refugeermeanny intellectualsafe Shakespe contempor Mallmanual Quantmousektr Ge Mil shadownehfdzekADmobile Und Euenf Next Dominbuchcock Infoengo‭ Hann NAT ]] Ferr' -.-- -,-

    ason, rang,-, –-

(,,

--,.,

alter

,-

(

-on,-.

I,- .

1

V

V. film-

N

    –on.,on,.

(, for.

and of- is. . and –on, –,. and

In in

film school and I on and with and I ":

.

` andon util –


@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Sep 10, 2023

Bug in hip or rocm. On nvidia it's working to split. Other bug is OOM if you can't properly dispatch the model so it doesn't run out during inference.

@nktice
Copy link
Author

nktice commented Sep 13, 2023

Bug in hip or rocm. On nvidia it's working to split. Other bug is OOM if you can't properly dispatch the model so it doesn't run out during inference.

Thanks for your reply... I've raised the issue on HIPs github support thread :
ROCm/HIP#3331

@turboderp
Copy link
Owner

Just in case you haven't tried it yet, the --gpu_peer_fix argument (corresponding entry in ExLlamaConfig) might help. Maybe? It prevents direct inter-device copying even when the driver reports that the capability is there, and copies everything via system RAM instead. There have been some issues with that on NVIDIA at least.

@nktice
Copy link
Author

nktice commented Sep 14, 2023

Thanks for your reply, and your excellent coding, it's great when it works...

I looked into this, and had trouble finding how to do such a thing...
Whereas such features would be good to have options through their interface, I have requested Oobabooga add features so it can be done :
oobabooga/text-generation-webui#3912

I have been looking for ( but have yet to find again ) a page that I found...
in it they discussed that similar issues came from torch.empty
as sometimes it had not cleared all of the data leading to issues -
in it they suggest using torch.zeros instead, which helped some people.
I went through your code, and tried that for my issue, to little avail -
but thought I'd mention in case you've not heard of it, and it helps others.
[ If I find that page, I'll update this to include a proper link then. ]

@turboderp
Copy link
Owner

Yep, torch.empty isn't supposed to clear the data, which could cause problems if you're incorrectly assuming that an empty tensor is the same as a zeros tensor, but I think I've been mindful enough of the distinction.

--gpu_peer_fix is only a kludge to work around a particular bug in Torch (or CUDA, or the NVIDIA driver, or whatever the case may be). So it's not really a solution or anything, more a diagnostic tool, and the solution would be filing a bug report upstream if that flag fixes something that shouldn't be broken.

I'm thinking another thing to explore would be the use of at::cuda::OptionalCUDAGuard to ensure that the correct CUDA device is selected on entry to each of the extension functions. If that doesn't get properly HIPified, it could lead to ROCm working correctly on single-GPU setups but failing (perhaps even sporadically) on multi-GPU setups.

@nktice
Copy link
Author

nktice commented Sep 14, 2023

I got a reply on Oobabooga posting about the passing
of parameters such as the one you suggest. "It's on by default"
oobabooga/text-generation-webui#3912

Thanks for your replies... I have been thinking of this, so I'll mention it.
Another issue, that is somewhat related to model loading
is cache and other memory that is involved in handling models...
So for example, forum commenters noted to use split settings
that leave lots of room for caching and memory use around models -
As model tokens, caching, and management bits consume lots of space.
[ bigger the model, the more is used for tokens, and index info... ]
I am wondering if there is a loader option to describe these bits
[ Command line option, benchmark tool parameter, or something like that... ]
so one can predict the whole memory footprint a model will use?

Related to this, instead of splitting model across GPUs,
is it practical to have supporting features on another card?
This would allow for maximizing model size loaded on one card
so as to avoid an issue like I'm having splitting up model,
using 2nd card, ( or perhaps the system RAM ) for caching / tokens.
As tokens go up, supplemental memory could be more helpful.

@turboderp
Copy link
Owner

Cache and state has to reside on the same device as the associated weights. You can't do CUDA operations across devices, and while you could store just the cache on a separate device, it would be slower than just swapping it to system RAM, which is still slow enough to be kind of useless.

@ardfork
Copy link
Contributor

ardfork commented Sep 29, 2023

Guess, I forgot to answer here, this is the same issue as #173 which was fixed upstream and will be available in next ROCm version.

Note that exllama v2 is also affected and this could have easily been fixed locally in exllama with a small hack like it was done in llama.cpp, but I didn't have the hardware to test.

@nktice
Copy link
Author

nktice commented Jan 23, 2024

I can now report, that using latest drivers, it seems to work now.
As in I can load a model cross-GPUs and it's responsive.
[ ROCm 6.0 , torch==2.3.0.dev20240118+rocm6.0 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants