fix for Error: must forward with targets before backward [#19] #21

ent0n29 · 2024-04-09T09:36:33Z

fixes #19

DongbinNie · 2024-04-09T13:07:46Z

The fix works on my machine running with Ubuntu 2204 & intel 13i cpu, thanks.

ent0n29 · 2024-04-09T13:14:10Z

little update, the problem seems to be with -Ofast, using -O3 and -Ofast togheter throws the error #19, using only -Ofast also throws the error.
With only -O3 it works, on ubuntu 22.04

…o-fast-math disable it

chsasank · 2024-04-09T16:07:16Z

Can confirm this makes tests pass.

chsasank · 2024-04-09T16:10:14Z

Honestly weird this works. Did you just find a compiler bug in gcc?!

ent0n29 · 2024-04-09T16:16:36Z

i found out that -Ofast enables a lot of hard optimizations that can alter the behaviour of the program, in particularly it enables the flag -ffast-math, which can mess the floating point arithmetic.
So keeping -O3 and -Ofast but disabling just the -ffast-math opt, works

karpathy · 2024-04-10T19:20:24Z

@ent0n29 does this make the code slower for you?
I wonder if I should merge to master and get rid of README comment potentially

ent0n29 · 2024-04-10T19:31:42Z

I don't think it makes the code slower since -O3 and -Ofast are enabled and only -ffast-math is disabled, I don't know what the contribution of -ffast-math is in terms of speed, but since we are working on floats it is important to maintain machine precision, since this flag creates problems with floating point arithmetic.
I can't test the speed differences since seems only to work on macos machines with -ffast-math enabled/disabled, maybe you could try to check if you see any notable differences?

more tests here -> #19 (comment)

mc-cat-tty · 2024-04-11T16:40:33Z

I don't think it makes the code slower since -O3 and -Ofast are enabled and only -ffast-math is disabled, I don't know what the contribution of -ffast-math is in terms of speed, but since we are working on floats it is important to maintain machine precision, since this flag creates problems with floating point arithmetic. I can't test the speed differences since seems only to work on macos machines with -ffast-math enabled/disabled, maybe you could try to check if you see any notable differences?

more tests here -> #19 (comment)

I have collected some data by running OMP_NUM_THREADS=8 ./train_gpt2 on an Apple Silicon M2. Although not being the best benchmark possible, it provides a clear picture of fast-math's contribution.

As a side-note: both the two following instances, have been tested almost under the same general workload of the machine.

Enabling -O3 -Ofast -Wno-unused-result (the default CFLAGS atm) I got the following result:

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.252026
step 0: train loss 5.356190 (took 3303.293000 ms)
step 1: train loss 4.301069 (took 3191.455000 ms)
step 2: train loss 4.623322 (took 2881.769000 ms)
step 3: train loss 4.600470 (took 2835.179000 ms)
step 4: train loss 4.616786 (took 2947.070000 ms)
step 5: train loss 4.231483 (took 2944.147000 ms)
step 6: train loss 3.754234 (took 3088.123000 ms)
step 7: train loss 3.652349 (took 2940.905000 ms)
step 8: train loss 4.183590 (took 2938.175000 ms)
step 9: train loss 4.199315 (took 3039.444000 ms)
val loss 4.425458

By disabling fast-math (with -fno-fast-math) the performances drop, roughly doubling the time for each step:

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.251911
step 0: train loss 5.356084 (took 6769.919000 ms)
step 1: train loss 4.300642 (took 6584.231000 ms)
step 2: train loss 4.623087 (took 6278.325000 ms)
step 3: train loss 4.599361 (took 6234.404000 ms)
step 4: train loss 4.616666 (took 6789.834000 ms)
step 5: train loss 4.231430 (took 7304.708000 ms)
step 6: train loss 3.753160 (took 6959.676000 ms)
step 7: train loss 3.650456 (took 6762.026000 ms)
step 8: train loss 4.182242 (took 7499.524000 ms)
step 9: train loss 4.199580 (took 7104.202000 ms)
val loss 4.426363

Some stats to better evaluate the consistency these measurements:

fast-math: mean := 3010.956 ms and std-dev := 138.254 ms
no-fast-math: mean := 6828.685 ms and std-dev := 386.642 ms

ent0n29 · 2024-04-11T16:46:15Z

yes, -Ofast with -ffast-math enabled seems to be the best option, but looks like only works for macos,
-Ofast -fno-fast-math is a compromise but should work for everyone.

mc-cat-tty · 2024-04-11T16:48:46Z

What about enabling fast-math conditionally, depending on the host OS?

uname -s could provide the mean for detecting the OS.

…lso consistency with README

DongbinNie · 2024-04-12T10:53:28Z

After some investigation, it seems the error is caused by the gcc "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only" combined options introduced by the "-Ofast" / "-ffast-math", and it can be sovled by adding "-fno-finite-math-only" instead of "-fno-fast-math" for a better performance.

Btw, if change cc to clang on my ubuntu 2204, it works well without modifying any other options, and archive the best running speed.

bexcite · 2024-04-12T16:24:49Z

@DongbinNie wow, it's amazing, on my ThinkPad using -Ofast with -fno-finite-math-only (and keep using cc) the performance is almost 2x!!! (from 9309ms to 4966ms per step)

ent0n29 · 2024-04-12T17:42:12Z

@DongbinNie i can confirm that perfs are 2x with -fno-finite-math-only instead of -fno-fast-math, thanks for the help!
I was doing kinda the same thing testing the opt flags enabled by the -Ofast / -ffast-math combo.

…gbinNie

…gbinNie

ent0n29 · 2024-04-15T17:20:57Z

@karpathy have you tried compiling with -fno-finite-math-only instead of -fno-fast-math?

karpathy · 2024-04-15T22:35:06Z

Sorry @ent0n29 what is the final recommendation here right now? Is it

CFLAGS = -Ofast -fno-finite-math-only -Wno-unused-result -march=native

ent0n29 · 2024-04-16T08:17:25Z

yes @karpathy, -fno-finite-math-only instead of -fno-fast-math for almost 2x improvements

ent0n29 · 2024-04-16T08:23:31Z

I went from ~17 seconds per step with -fno-fast-math, to this with -fno-finite-math-only & -march=native:

step 0: train loss 5.356086 (took 9611.016869 ms)
step 1: train loss 4.300644 (took 8780.770364 ms)
step 2: train loss 4.623083 (took 7137.313333 ms)
step 3: train loss 4.599365 (took 6426.557283 ms)
step 4: train loss 4.616659 (took 6864.495874 ms)
step 5: train loss 4.231428 (took 6635.326672 ms)
step 6: train loss 3.753164 (took 6516.244371 ms)
step 7: train loss 3.650456 (took 6362.145571 ms)
step 8: train loss 4.182243 (took 6333.873539 ms)
step 9: train loss 4.199581 (took 6407.929416 ms)

ent0n29 · 2024-04-16T10:36:54Z

i close this to open a new one synced

fix for issue #19

14e3b95

ent0n29 changed the title ~~fix for issue #19~~ fix for Error: must forward with targets before backward [#19] Apr 9, 2024

ent0n29 mentioned this pull request Apr 9, 2024

Error: must forward with targets before backward #19

Closed

issue #19 was caused by -ffast-math enabled with -Ofast, the flag -fn…

08e618e

…o-fast-math disable it

chsasank mentioned this pull request Apr 9, 2024

Error: backward before forward #32

Closed

add .gitignore

ea522c2

Delete .gitignore

2a5914c

as i previously said -Ofast is just a wrapper of -O3 with more opt, a…

392ba13

…lso consistency with README

change -fno-fast-math with -fno-finite-math-only as suggested by @Don…

0425901

…gbinNie

ent0n29 added 2 commits April 16, 2024 00:28

add -march-native

80a942c

add -march=native

78e49df

add -march=native & -fno-finite-math-only

c2b76b3

ent0n29 closed this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for Error: must forward with targets before backward [#19] #21

fix for Error: must forward with targets before backward [#19] #21

ent0n29 commented Apr 9, 2024 •

edited

Loading

DongbinNie commented Apr 9, 2024

ent0n29 commented Apr 9, 2024

chsasank commented Apr 9, 2024

chsasank commented Apr 9, 2024 •

edited

Loading

ent0n29 commented Apr 9, 2024

karpathy commented Apr 10, 2024

ent0n29 commented Apr 10, 2024 •

edited

Loading

mc-cat-tty commented Apr 11, 2024 •

edited

Loading

ent0n29 commented Apr 11, 2024

mc-cat-tty commented Apr 11, 2024

DongbinNie commented Apr 12, 2024

bexcite commented Apr 12, 2024

ent0n29 commented Apr 12, 2024

ent0n29 commented Apr 15, 2024

karpathy commented Apr 15, 2024

ent0n29 commented Apr 16, 2024

ent0n29 commented Apr 16, 2024

ent0n29 commented Apr 16, 2024

fix for Error: must forward with targets before backward [#19] #21

fix for Error: must forward with targets before backward [#19] #21

Conversation

ent0n29 commented Apr 9, 2024 • edited Loading

DongbinNie commented Apr 9, 2024

ent0n29 commented Apr 9, 2024

chsasank commented Apr 9, 2024

chsasank commented Apr 9, 2024 • edited Loading

ent0n29 commented Apr 9, 2024

karpathy commented Apr 10, 2024

ent0n29 commented Apr 10, 2024 • edited Loading

mc-cat-tty commented Apr 11, 2024 • edited Loading

ent0n29 commented Apr 11, 2024

mc-cat-tty commented Apr 11, 2024

DongbinNie commented Apr 12, 2024

bexcite commented Apr 12, 2024

ent0n29 commented Apr 12, 2024

ent0n29 commented Apr 15, 2024

karpathy commented Apr 15, 2024

ent0n29 commented Apr 16, 2024

ent0n29 commented Apr 16, 2024

ent0n29 commented Apr 16, 2024

ent0n29 commented Apr 9, 2024 •

edited

Loading

chsasank commented Apr 9, 2024 •

edited

Loading

ent0n29 commented Apr 10, 2024 •

edited

Loading

mc-cat-tty commented Apr 11, 2024 •

edited

Loading