Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for Error: must forward with targets before backward [#19] #21

Closed
wants to merge 9 commits into from
Closed

Conversation

ent0n29
Copy link
Contributor

@ent0n29 ent0n29 commented Apr 9, 2024

fixes #19

@ent0n29 ent0n29 changed the title fix for issue #19 fix for Error: must forward with targets before backward [#19] Apr 9, 2024
@DongbinNie
Copy link

The fix works on my machine running with Ubuntu 2204 & intel 13i cpu, thanks.

@ent0n29
Copy link
Contributor Author

ent0n29 commented Apr 9, 2024

little update, the problem seems to be with -Ofast, using -O3 and -Ofast togheter throws the error #19, using only -Ofast also throws the error.
With only -O3 it works, on ubuntu 22.04

@chsasank
Copy link

chsasank commented Apr 9, 2024

Can confirm this makes tests pass.

@chsasank
Copy link

chsasank commented Apr 9, 2024

Honestly weird this works. Did you just find a compiler bug in gcc?!

@ent0n29
Copy link
Contributor Author

ent0n29 commented Apr 9, 2024

i found out that -Ofast enables a lot of hard optimizations that can alter the behaviour of the program, in particularly it enables the flag -ffast-math, which can mess the floating point arithmetic.
So keeping -O3 and -Ofast but disabling just the -ffast-math opt, works

@karpathy
Copy link
Owner

@ent0n29 does this make the code slower for you?
I wonder if I should merge to master and get rid of README comment potentially

@ent0n29
Copy link
Contributor Author

ent0n29 commented Apr 10, 2024

I don't think it makes the code slower since -O3 and -Ofast are enabled and only -ffast-math is disabled, I don't know what the contribution of -ffast-math is in terms of speed, but since we are working on floats it is important to maintain machine precision, since this flag creates problems with floating point arithmetic.
I can't test the speed differences since seems only to work on macos machines with -ffast-math enabled/disabled, maybe you could try to check if you see any notable differences?

more tests here -> #19 (comment)

@mc-cat-tty
Copy link

mc-cat-tty commented Apr 11, 2024

I don't think it makes the code slower since -O3 and -Ofast are enabled and only -ffast-math is disabled, I don't know what the contribution of -ffast-math is in terms of speed, but since we are working on floats it is important to maintain machine precision, since this flag creates problems with floating point arithmetic. I can't test the speed differences since seems only to work on macos machines with -ffast-math enabled/disabled, maybe you could try to check if you see any notable differences?

more tests here -> #19 (comment)

I have collected some data by running OMP_NUM_THREADS=8 ./train_gpt2 on an Apple Silicon M2. Although not being the best benchmark possible, it provides a clear picture of fast-math's contribution.

As a side-note: both the two following instances, have been tested almost under the same general workload of the machine.

Enabling -O3 -Ofast -Wno-unused-result (the default CFLAGS atm) I got the following result:

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.252026
step 0: train loss 5.356190 (took 3303.293000 ms)
step 1: train loss 4.301069 (took 3191.455000 ms)
step 2: train loss 4.623322 (took 2881.769000 ms)
step 3: train loss 4.600470 (took 2835.179000 ms)
step 4: train loss 4.616786 (took 2947.070000 ms)
step 5: train loss 4.231483 (took 2944.147000 ms)
step 6: train loss 3.754234 (took 3088.123000 ms)
step 7: train loss 3.652349 (took 2940.905000 ms)
step 8: train loss 4.183590 (took 2938.175000 ms)
step 9: train loss 4.199315 (took 3039.444000 ms)
val loss 4.425458

By disabling fast-math (with -fno-fast-math) the performances drop, roughly doubling the time for each step:

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.251911
step 0: train loss 5.356084 (took 6769.919000 ms)
step 1: train loss 4.300642 (took 6584.231000 ms)
step 2: train loss 4.623087 (took 6278.325000 ms)
step 3: train loss 4.599361 (took 6234.404000 ms)
step 4: train loss 4.616666 (took 6789.834000 ms)
step 5: train loss 4.231430 (took 7304.708000 ms)
step 6: train loss 3.753160 (took 6959.676000 ms)
step 7: train loss 3.650456 (took 6762.026000 ms)
step 8: train loss 4.182242 (took 7499.524000 ms)
step 9: train loss 4.199580 (took 7104.202000 ms)
val loss 4.426363

Some stats to better evaluate the consistency these measurements:

  • fast-math: mean := 3010.956 ms and std-dev := 138.254 ms
  • no-fast-math: mean := 6828.685 ms and std-dev := 386.642 ms

@ent0n29
Copy link
Contributor Author

ent0n29 commented Apr 11, 2024

yes, -Ofast with -ffast-math enabled seems to be the best option, but looks like only works for macos,
-Ofast -fno-fast-math is a compromise but should work for everyone.

@mc-cat-tty
Copy link

What about enabling fast-math conditionally, depending on the host OS?

uname -s could provide the mean for detecting the OS.

@DongbinNie
Copy link

After some investigation, it seems the error is caused by the gcc "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only" combined options introduced by the "-Ofast" / "-ffast-math", and it can be sovled by adding "-fno-finite-math-only" instead of "-fno-fast-math" for a better performance.

Btw, if change cc to clang on my ubuntu 2204, it works well without modifying any other options, and archive the best running speed.

@bexcite
Copy link

bexcite commented Apr 12, 2024

@DongbinNie wow, it's amazing, on my ThinkPad using -Ofast with -fno-finite-math-only (and keep using cc) the performance is almost 2x!!! (from 9309ms to 4966ms per step)

@ent0n29
Copy link
Contributor Author

ent0n29 commented Apr 12, 2024

@DongbinNie i can confirm that perfs are 2x with -fno-finite-math-only instead of -fno-fast-math, thanks for the help!
I was doing kinda the same thing testing the opt flags enabled by the -Ofast / -ffast-math combo.

@ent0n29
Copy link
Contributor Author

ent0n29 commented Apr 15, 2024

@karpathy have you tried compiling with -fno-finite-math-only instead of -fno-fast-math?

@karpathy
Copy link
Owner

Sorry @ent0n29 what is the final recommendation here right now? Is it

CFLAGS = -Ofast -fno-finite-math-only -Wno-unused-result -march=native

@ent0n29
Copy link
Contributor Author

ent0n29 commented Apr 16, 2024

yes @karpathy, -fno-finite-math-only instead of -fno-fast-math for almost 2x improvements

@ent0n29
Copy link
Contributor Author

ent0n29 commented Apr 16, 2024

I went from ~17 seconds per step with -fno-fast-math, to this with -fno-finite-math-only & -march=native:

step 0: train loss 5.356086 (took 9611.016869 ms)
step 1: train loss 4.300644 (took 8780.770364 ms)
step 2: train loss 4.623083 (took 7137.313333 ms)
step 3: train loss 4.599365 (took 6426.557283 ms)
step 4: train loss 4.616659 (took 6864.495874 ms)
step 5: train loss 4.231428 (took 6635.326672 ms)
step 6: train loss 3.753164 (took 6516.244371 ms)
step 7: train loss 3.650456 (took 6362.145571 ms)
step 8: train loss 4.182243 (took 6333.873539 ms)
step 9: train loss 4.199581 (took 6407.929416 ms)

@ent0n29
Copy link
Contributor Author

ent0n29 commented Apr 16, 2024

i close this to open a new one synced

@ent0n29 ent0n29 closed this Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error: must forward with targets before backward
6 participants