-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault when training large GPT2 models on single GPU #679
Comments
I did a bit more digging around, and I think I have an idea of what's going on. The loop in Adam_Optimizer::Step is violating the bounds of _doubled_buffer because, when training the 1558M model, _param_size (1557611200) exceeds TILE (1073741824). If I edit the code to change the value of TILE to 1557611200, then I am able to train the 1558M model using the CPU Adam optimizer with no apparent problem. Is there a particular reason why this value is hardcoded? I'm new to DeepSpeed, so forgive me if I'm misunderstanding something. [EDIT: I originally noted that it was running slower than I expected, but that seems to have resulted from some weird problem with my installation. After a clean install, it runs at around 2.19s/it.] |
Your report looks similar to a segfault I'm getting w/ DeepSpeed and transformers/t5-11b #726 (comment) See if you still get the segfault if you use the parent of e51e4d7
|
@stas00
Watching however if I switch to using
Then adding that I get the following error where I can't start the training
I have 45gb of CPU ram. |
It's 17 because that's when your CPU Adam optimizer actually runs for the first time and it crashes. Because of Overflow it skips the first dozen or so steps. I was hitting 22 and 25 in other cases before segfaulting. Most likely it has to do with you running on AMD, correct? For a temp solution please see: huggingface/transformers#9996 (comment) The Deepspeed team is aware of the problem and working on solving this. |
Yes I'm running on AMD. Since I get the allocator issue with AdamW since I have to use the "zero_allow_untested_optimizer" flag does that mean I just have to wait until this is fixed? If I use Adam with the rollback you mentioned I still get the same segfault. |
Thank you for confirming that it's indeed an issue on the AMD platform, @collinarnett I'm just a messenger at the moment and will let the DeepSpeed folks to chime in on the specifics. But it looks like there is some bug in DeepSpeed's CPU Adam that happens only on the AMD platform. And so until it's fixed using torch optim should work, that is:
it will be much slower. The specific commit at which the breakage was first detected is the one where the DeepSpeed AdamW was made directly available That's all I know as of this moment. I will keep on posting updates in this thread huggingface/transformers#9996 as get more info, so you may choose to track it. |
For the record, when I finally got it to run, the CPU Adam optimizer used almost 54GB of CPU memory to train gpt2-xl. I'm on AMD, too. |
As I shared here: huggingface/transformers#9996 (comment) my CPU RAM usage stats are:
|
What's the exact change that you made, @jeffbinder? Perhaps that would be useful to the developers. For some reason the core dump I get has its backtrace completely corrupted. The backtrace would be very helpful. but they say they have an AMD machine to test this on - so hopefully they will sort it out. If you manage to get one and post it here that would help. Just in case you don't know how, here is one such guide: https://jvns.ca/blog/2018/04/28/debugging-a-segfault-on-linux/ |
Here is a diff: --- a/csrc/includes/cpu_adam.h
+++ b/csrc/includes/cpu_adam.h
@@ -20,7 +20,7 @@
} \
}
-#define TILE (1024 * 1024 * 1024)
+#define TILE 1557611200
#if defined(__AVX512__)
#define SIMD_STORE(a, d) _mm512_storeu_ps(a, d) 1557611200 is the number of parameters in gpt2-xl. In itself, this patch isn't really a solution because it's specific to one model. If you're training another large model, you'd have to change it to a different number. I'll see if I can get a proper stack trace when I have the time. The crash was, I believe, occurring in Adam_Optimizer::Step at cpu_adam.cpp:134. |
I read the code a bit more carefully and I think I can see why this segfault is only happening on AMD systems. If AVX512 or AVX256 instructions are available, The way this loop works looks like it results in a difference in behavior between Intel and AMD. In the AVX code, My hacky solution of changing the value of There also appears to be an issue with how the availability of AVX instructions is being determined. I have a Zen 2 processor that is supposed to have AVX256, but it doesn't seem to be detected. I'm not sure if that's an issue with DeepSpeed or a configuration problem on my end, though. |
Hi @jeffbinder, @stas00, and @collinarnett Thank you for the constructive discussion on this issue and how to solve it. Reza |
Hi @RezaYazdaniAminabadi, |
I'm trying to use DeepSpeed to finetune GPT2 models on a single RTX 3090 GPU. Using the scripts included with huggingface-transformers, I have been able to get it working up through the 774M model, and the ZeRO optimizations enable me to double the batch size. However, the CPU Adam optimizer is segfaulting when I try to train the 1558M model. I am using Ubuntu 20.04, CUDA 11.2, Nvidia drivers 460.32.03, and current git master versions of PyTorch, Transformers, and DeepSpeed.
Here is the script I used:
pofo-corpus.txt is the Poetry Foundation collection in a single text file (around 18MB). Here is the config file:
I've messed around with a bunch of the settings, but none of them seem to affect the issue. Here is the output:
The program then exits abruptly. The segfault is reported in dmesg:
I tried training similar models using the DeepSpeed version of Megatron-LM instead of huggingface-transformers, and the same thing happens--it works correctly up through a certain number of parameters, but it segfaults with sufficiently large models.
The text was updated successfully, but these errors were encountered: