-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: must forward with targets before backward #19
Comments
did you check if you have OpenMP installed? # on macOS
$ brew install libomp
# on ubuntu
$ sudo apt-get install libomp-dev after installing build it again with |
we have this error when the mean_loss is = -1.0f which is the initial value, and this attribute is only updeted in the gpt2_forward function when the target param is not NULL. So somewhere in the training loop or in the dataloader gpt2_backward function is called before than the gpt2_forward. |
I got the same case. |
Does anyone understand how -O3 can possibly be causing this issue? |
-Ofast already enables all the optimizations from -O3 and also other aggressive optimizations, maybe it's better to either use -O3 or -Ofast, not both? |
@karpathy or maybe the optim could be custom made llvm passes targeting train_gpt2 and not automatically generated from the -O flags |
The -Ofast flag enables the |
Ok I guess I found the problem, now -Ofast and -O3 enabled work togheter, the issue was caused by -ffast-math being enabled messing up with floating point arithmetic.
|
I still get the "backward before forward" error when using # CFLAGS = -O3 -Ofast -Wno-unused-result
CFLAGS = -O3 -Wno-unused-result
LDFLAGS =
LDLIBS = -lm
INCLUDES =``` |
@Infatoshi Yes, if |
No fast math flag seems to solve the issue for me as well. |
Ok so from what i saw, and to recap, on ubuntu machines (i am on 22.04): CFLAGS = -O3 -Ofast -fno-fast-math -Wno-unused-result will work. If this doesn't work just remove -Ofast and everything should be fine: CFLAGS = -O3 -Wno-unused-result |
I can't repro this issue on my computer. Would it maybe work to change the check as:
as the loss can never be negative. |
Ok, i will try this change in train_gpt2.c and then build it with your original Makefile, since i also had the issue as lot of others ubuntu machines. if (model->mean_loss < 0) {
printf("Error: must forward with targets before backward\n");
exit(1);
} i used the original Makefile with -Ofast -O3 enabled (without disabling -ffast-math) and it throws nan: step 0: train loss 5.356172 (took 11253.211517 ms)
step 1: train loss nan (took 9569.404843 ms)
step 2: train loss nan (took 9525.026318 ms)
step 3: train loss nan (took 9518.282173 ms)
step 4: train loss nan (took 9924.498914 ms) |
The previously suggested fix works for me on Ubuntu 22.04.4 LTS (I have OpenMP installed) CFLAGS = -O3 -Ofast -fno-fast-math -Wno-unused-result I had no joy on MacOS due to the complexity of achieving a consistent python environment. However I suppose the Python stages could be run on another machine and just the results copied to MacOS to be used as the input to the C phase. With this change my final output was the following two poems: <|endoftext|>I was so frightened with your face: to come and though they would not do it any more than as <|endoftext|>CLAUSE: Not precisely Booker prize material but nevertheless begins to give an insight into the work involved in producing useful research results. |
using WSL, only this CFLAGS = -O3 -Ofast -fno-fast-math -Wno-unused-result work. |
so @karpathy the issue seems to be only on ubuntu systems (or linux in general), on macos works fine. |
Stumbled on the same error And the thing is that Some info on my machine/compiler for others (I believe it's also depend on specific compiler used):
|
@bexcite yes, apparently the flag -ffast-math messes up the floating point arithmetic, but does it work on your machine with -Ofast enabled and -fno-fast-math to disable -ffast-math?
|
@ent0n29 Yes, disabling On my machine I've tried this combos:
|
yes, this is what I mentioned here for ubuntu machines:
|
I added a comment in README for now. I don't have a good sense of when the code works or does not work, so it feels hard to change the Makefile generically atm. @ent0n29 does removing these flags make the code slower? (I'd expect yes) |
Yes removing -O3 or -Ofast will make the code slower, i don't know how much -ffast-math influence the speed, and also -Ofast is just a wrapper of -O3 with more aggresive opts, i would use either one or the other |
I've tried both options on my laptop (i7-6600U CPU @ 2.60GHz, 16GB RAM) running Debian Bookworm:
and both ended up with the same results. Each step took ~40 seconds:
Meanwhile this guy used Rasp Pi 5 and it took him ~13 seconds for each step. |
[#21] what i have is:
step 0: train loss 5.356082 (took 19942.187432 ms)
step 1: train loss 4.300639 (took 20333.230115 ms)
step 2: train loss 4.623087 (took 20157.211989 ms)
step 3: train loss 4.599362 (took 19900.653274 ms)
step 4: train loss 4.616664 (took 19071.652778 ms)
step 5: train loss 4.231427 (took 19976.814785 ms)
step 6: train loss 3.753161 (took 20760.803743 ms)
step 7: train loss 3.650458 (took 19308.340202 ms)
step 8: train loss 4.182242 (took 19559.064261 ms)
step 9: train loss 4.199580 (took 18556.248236 ms) -Ofast without -O3 and -ffast-math disabled average at
step 0: train loss 5.356082 (took 28424.368136 ms)
step 1: train loss 4.300639 (took 20445.511701 ms)
step 2: train loss 4.623087 (took 22656.468311 ms)
step 3: train loss 4.599362 (took 19115.014434 ms)
step 4: train loss 4.616664 (took 19833.797978 ms)
step 5: train loss 4.231427 (took 18573.217460 ms)
step 6: train loss 3.753161 (took 18102.854112 ms)
step 7: train loss 3.650458 (took 18000.311629 ms)
step 8: train loss 4.182242 (took 28836.764671 ms)
step 9: train loss 4.199580 (took 24153.199814 ms) only -O3 without -Ofast is a bit slower but still around
step 0: train loss 5.356082 (took 17718.687714 ms)
step 1: train loss 4.300639 (took 17256.573805 ms)
step 2: train loss 4.623087 (took 16764.518172 ms)
step 3: train loss 4.599362 (took 16864.526737 ms)
step 4: train loss 4.616664 (took 16765.048234 ms)
step 5: train loss 4.231427 (took 16944.676229 ms)
step 6: train loss 3.753161 (took 20110.965357 ms)
step 7: train loss 3.650458 (took 18992.590776 ms)
step 8: train loss 4.182242 (took 19528.572922 ms)
step 9: train loss 4.199580 (took 17612.805042 ms) Every step is around |
CFLAGS = -O3 -Wno-unused-result
CFLAGS = -Ofast -Wno-unused-result
CFLAGS = -O3 -Ofast -fno-fast-math -Wno-unused-result
|
guys wake up, new optimization just dropped, |
I'm working on Windows and see the same with the MSVC. I use /O2 and /fp:fast. I think I've nailed it down to the GELU layer. Something is being over optimized here which results in blown out grads. So first I've disabled the optimization just there for proof: #pragma optimize( "", off )
#define GELU_SCALING_FACTOR sqrtf(2.0f / M_PI)
void gelu_forward(float* out, float* inp, int N) {
// (approximate) GeLU elementwise non-linearity in the MLP block of Transformer
for (int i = 0; i < N; i++) {
float x = inp[i];
float cube = 0.044715f * x * x * x;
out[i] = 0.5f * x * (1.0f + tanhf(GELU_SCALING_FACTOR * (x + cube)));
}
}
void gelu_backward(float* dinp, float* inp, float* dout, int N) {
for (int i = 0; i < N; i++) {
float x = inp[i];
float cube = 0.044715f * x * x * x;
float tanh_arg = GELU_SCALING_FACTOR * (x + cube);
float tanh_out = tanhf(tanh_arg);
float coshf_out = coshf(tanh_arg);
float sech_out = 1.0f / (coshf_out * coshf_out);
float local_grad = 0.5f * (1.0f + tanh_out) + x * 0.5f * sech_out * GELU_SCALING_FACTOR * (1.0f + 3.0f * 0.044715f * x * x);
dinp[i] += local_grad * dout[i];
}
}
#pragma optimize( "", on ) And it passed. [GPT-2] I've rearranged the code a bit to trick the compiler. Would someone please try on their end see if this solves it for you as well on Linux or Mac. If so we can setup a PR. #define GELU_SCALING_FACTOR 0.7978845608 // sqrtf(2.0f / M_PI)
void gelu_forward(float* out, float* inp, int N) {
// (approximate) GeLU elementwise non-linearity in the MLP block of Transformer
for (int i = 0; i < N; i++) {
float x = inp[i];
out[i] = 0.5 * x * (1 + tanhf(x * 0.7978845608 * (1 + 0.044715 * x * x)));
}
}
float gelu_grad(float x) {
float square = 0.044715f * x * x;
float cube = square * x;
float tanh_arg = GELU_SCALING_FACTOR * (x + cube);
float tanh_out = tanhf(tanh_arg);
float coshf_out = coshf(tanh_arg);
float sech_out = 1.0 / (coshf_out * coshf_out);
float local_grad = 0.5 * (1.0 + tanh_out) + x * 0.5f * sech_out * GELU_SCALING_FACTOR * (1.0 + 3.0 * square);
return local_grad;
}
void gelu_backward(float* dinp, float* inp, float* dout, int N) {
for (int i = 0; i < N; i++) {
dinp[i] += gelu_grad(inp[i]) * dout[i];
}
} [GPT-2] |
Another thing to note is that if compiling without fast fp one of the check tensors fails compared to pytorch version. It's very small delta but def. goes away when compiling with fast fp. OK (LOGITS) int check_tensor(float* a, float* b, int n, char* label) {
int print_upto = 5;
int ok = 1;
int labelPrinted = 0;
for (int i = 0; i < n; i++) {
if (fabsf(a[i] - b[i]) > 1e-2) {
// only report if mismatch
if (!labelPrinted) {
printf("%s: NOT OK\n", label);
labelPrinted = 1;
}
if (print_upto-- > 0) {
printf("\t[%d] %f %f, DIFF: %f\n", i, a[i], b[i], fabsf(a[i] - b[i]));
}
ok = 0;
}
}
return ok;
} |
@azret These changes make no difference on Linux. FYI compiling with |
Thank you for trying!
Thank you! Can you please also try to wrap with #pragma optimize( "", off/on) just this block? If this works for you than we can be sure that it is in fact this function that is being over optimized. It would be of much help as I am trying to get the /fp:fast working on Windows as well. #pragma optimize( "", off ) |
-fno-finite-math-only for almost 2x speed up, fix for [#19]
Good idea @azret! The optimizations seem to work on I looked at this again tonight, and after finding no NaNs returned from any Someone can still optimize further, I couldn't figure out a way to get all the optimizations except on that single In this process I noticed a lot of |
@jrrk2 re "complexity of achieving Python environment", at risk of stating the obvious, this should work if your Python version is good:
Also, surprisingly, Python compiles really easily from source, and is surprisingly small, with minimal dependencies, last I checked. On Linux at least, so worth trying, if just installing a new binary version seems too mundane 😅 |
@ent0n29: You prompted me to look deeper into the compiler flags, I guess this is where you found and tested
TIL 🙄 And of course the last one on the command line takes precedence 😅 The only difference between
Sounds like |
The only difference between
|
@dagelf yeah lately i'm deep into compilers, writing opt passes directly in the LLVM source code, that's why i opened this too #18 (comment), why use flags like -Ofast or -O3, we write our own optimization in LLVM passes targeting all the functions and operations in train_gpt2.c |
Lets continue the discussion about performance improvement efforts on CPU here: #253 |
when i ./train_gpt2.c , i get the result above Error.
The text was updated successfully, but these errors were encountered: