-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling model parallelism (training 30b on 2x 3090s and beyond) #131
Conversation
Perhaps it doesn't work on windows?
Here are some relevant libs (transformers is also installed from the correct fork
|
Hm, maybe the rpc bit isn't supported on windows. Keep
Each gpu should have roughly (size of model) / gpus. Since you're on Windows, use MSI Afterburner or something similar to monitor the VRAM usage to make sure the model is loaded across both before training begins. For anyone on Linux, use nvtop. |
Yeah, that's what you should expect to see. Model parallelism can't load both GPUs all the time. It just goes back and forth between them, but it means you can load larger models or use larger batch sizes. Some new frameworks have some fancy tricks to reduce the inefficiency, but this is just the simplest case. See here for more info: https://pytorch.org/docs/stable/pipeline.html The Pipeline should be one of those tricks, but I don't think I fully implemented it. I'll need to play with that some more later. But for now, yeah, looks like it's working as expected for you. |
I found the root cause of the device_map auto discrepancy in the transformers repo, so I'm going to draft this until I get that fixed and merged |
It is possible to support Deepspeed stage 3 to do parameter partitioning to fit large model into multiple gpus? Will it be faster than the currently naive (non-overlapping) pipeline parallelism? |
I'm not sure, I'll need to look into Deepspeed more. I had played with it for a minute and I think it didn't support 8bit. I'll add it to my list of things to look at, because better parallelism would be awesome. I mostly know how to get full pipelining working, but Deepspeed would be more valuable. |
How are you planning to implement full pipeline? I searched for examples and docs, and I think all leads to modifying the implementation of llama in transformers, which I would recognize as last resort. |
Correct. It's not trivial, but it's not terrible either. I stopped when I realized I also need to batch the inputs into microbatches in a single tensor. I was also using the pipeline for a little more than what it was designed for, so it was breaking in weird ways. Please do take a look if you're interested though, be warned it's very hacky and broken. |
Hi, Thanks for this work! I'm experimenting multiple configs to find the best matches for my use cases. I'm now trying to train the 30b, but I keep getting OOM.
Still, while "Loading checkpoint shards:" it breaks with OOM, having filled the first GPU up, second one is almost unused. Any idea what I could do wrong? |
@AngainorDev how did you force max_memory? I edited finetune.py line 78 to be
And I can train 30B on 2080Ti 22G x 3 with |
@AngainorDev I just pushed a change that references my fork of transformers. I was hoping they would merge the PR in quickly, but since they're a company, it seems like they won't get to it till Monday. To install it,
With that, you won't need to use a hard coded max_memory, and you can just use "auto" device map for a perfect distribution. |
I find deepspeed pipeline parallelism very promising: you just need to change the input and output of each layer to tuple of tensor, and deepspeed can do the rest for you, including micro batching, etc. It has much relaxed constrants than pytorch pipe: you don't need to express the model in nn.Sequential (just a list of python callables), each layer does not need to be a nn.Module (any python callable), and the input/output can be a tuple of tensors, not limited to one tensor. |
I just updated to git+https://github.com/kooshi/transformers.git@balanced_memory_8bit
I used This seems to have no effect, gpu 0 taking it all and oom at 70% of model loading. |
Sounds like it's not even seeing the second gpu as available or something. Make sure CUDA_VISIBLE_DEVICES is set correctly. |
Yeah, But torch.cuda.device_count() correctly detects the 2 GPUs. Second one gets a bit of vram used when running, around 1GB. |
7bb2a6d
to
473254d
Compare
My second PR for transformers was merged in, so now the only thing required to use model parallelism is reinstalling transformers, and merging the few lines left. I'm not sure what's going on with @AngainorDev, because in that case it's behaving as if it's ignoring the proven fixes of the manual max_memory, or the updated load_in_8bit logic in transformers. I have to imagine something is configured incorrectly or is somehow overriding the correct behavior. @AngainorDev my next suggestion would be to attempt a clean slate. Set up a brand new conda environment, install the latest supported libraries, and run this code, unmodified, just to see if it can work at all before changes. This PR is ready to be merged. |
Thanks for the follow up. |
Already use this update to train with MP and it works well! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM — thanks for this!
I successfully finetune the 30b model on multi gpu by pipeline parallelism
I hope for help, thanks! |
Can you fit fp16 model in your VRAM? It seems you don't have enough vram and some layers are put on cpu. |
Make sure you guys has something like model.parallized = True (check the changed files) and this error is not caused by OOM |
Are you using torchrun? |
Could you provide a command line example that uses model parallelism on multiple GPU? I have tried
The model was split into two GPUs about evenly, but I got the error "../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [30,0,0], thread: [96,0,0] Assertion If I train a smaller model on a single GPU, then the error won't show up. DDP also works well on 2 GPUs. I have also tried different versions of CUDA, nvidia-drivers, transformers, bitsandbytes, and llama models (even converted from original weights) but this error is still here. |
Yeah... this is new. It was also reported here: huggingface/transformers#22546 One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment) I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge. |
Thanks for pointing me to that thread. I forgot to mention that I was using 4090s. I have also checked that thread earlier and tried his driver version, but no luck on 4090s. MP works well on 2x 3090s though. |
I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project! |
So here is faster training (DDP) "OR" bigger models (MP). I have been searching ways for ddp "AND" mp but have no luck by now. Neither Deepspeed nor torchrun can give a clear clue. |
deepspeed do support MP, but seems only in inference part -- hope someone could correct me if I were wrong |
doesnt work on ubuntu 18.04, is 20.04 some magic version? I doubt it. What version of cuda do you have? |
My system is Ubuntu 20.04, but still meet this problem. Have you solved this problem? @vans163 |
My 4090 does not work, but the A100 works. |
nope I did not get it working |
python3.8, CUDA 12.0, Driver 525.105.17 |
Does as it says on the tin.
Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)
This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.
This also serves as a workaround for #8