-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: use TestExtras #1099
base: main
Are you sure you want to change the base?
test: use TestExtras #1099
Conversation
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: a4029fc | Previous: 132619c | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4083 ns |
3875 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4375 ns |
4208 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5292 ns |
5250 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4104.5 ns |
4333 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
60940 ns |
61892.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10500 ns |
10542 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10291 ns |
10209 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11541 ns |
10459 ns |
1.10 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10291 ns |
10417 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
430924 ns |
433097 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1000 ns |
1084 ns |
0.92 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1250 ns |
1291 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1208 ns |
1292 ns |
0.93 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1084 ns |
1209 ns |
0.90 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18637 ns |
18531 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4125 ns |
4167 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4083 ns |
3917 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4250 ns |
4250 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4083 ns |
4083 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
111606 ns |
111975 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57500 ns |
57583 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46334 ns |
46292 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46417 ns |
38042 ns |
1.22 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81791 ns |
83125 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38028 ns |
37370 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2043104 ns |
2031625 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2081625 ns |
2085958 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2093375 ns |
2088333.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1997603.5 ns |
2005041 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
198186 ns |
198108 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
143708 ns |
143750 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
143541 ns |
146063 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
145083 ns |
145209 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
169146 ns |
144583.5 ns |
1.17 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166323.5 ns |
166112.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1109042 ns |
1118042 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1107083.5 ns |
1114250 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1120500 ns |
1153000 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1117812.5 ns |
1068770.5 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
533808 ns |
533468 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3667 ns |
3584 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4125 ns |
3750 ns |
1.10 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4667 ns |
4417 ns |
1.06 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3584 ns |
3958 ns |
0.91 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
68637 ns |
72081 ns |
0.95 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9500 ns |
9000 ns |
1.06 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9625 ns |
8542 ns |
1.13 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10458 ns |
9041 ns |
1.16 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9084 ns |
8916 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
489949 ns |
503190.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
14500 ns |
15000 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17875 ns |
15250 ns |
1.17 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
16833.5 ns |
16708 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15417 ns |
15542 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
54785 ns |
55903 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219791 ns |
214187.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213125 ns |
213604.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214959 ns |
215395.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213291 ns |
212917 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
271573 ns |
278881 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
834 ns |
500 ns |
1.67 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
542 ns |
1.15 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
667 ns |
750 ns |
0.89 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
583 ns |
583 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17528 ns |
17733 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1667 ns |
1625 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1500 ns |
1500 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1875 ns |
1625 ns |
1.15 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1625 ns |
1583 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
102518 ns |
105125.5 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
7250 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5916 ns |
5833 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5917 ns |
5250 ns |
1.13 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9958 ns |
10084 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23868 ns |
24106 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225916.5 ns |
220750 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
230020.5 ns |
228084 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230895.5 ns |
230459 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214062.5 ns |
213708.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
169436 ns |
169707.5 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3875 ns |
3917 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23681 ns |
23637 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
17000 ns |
16708 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16667 ns |
16834 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16875 ns |
16875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16667 ns |
16625 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
161775 ns |
161602 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
571041 ns |
578416.5 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
572042 ns |
569958 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
581375 ns |
579292 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
572750 ns |
578291 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113713 ns |
113009 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1412416 ns |
1417979.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1415000 ns |
1419167 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1419895.5 ns |
1424875 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1420709 ns |
1426416 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
210302 ns |
210883 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1082979.5 ns |
1067000 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
951125 ns |
958417 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1349083 ns |
1336917 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1300542 ns |
1304396 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
273530.5 ns |
271759 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5927917 ns |
5795104.5 ns |
1.02 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4593708 ns |
4601125 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4969959 ns |
4929084 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5668104.5 ns |
5750083 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1072557 ns |
1068932 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
583 ns |
500 ns |
1.17 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23820 ns |
23274 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2125 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2250 ns |
2166 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2208 ns |
2167 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2250 ns |
2208 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
169440 ns |
171283 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4604.5 ns |
4333 ns |
1.06 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4833 ns |
4125 ns |
1.17 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5292 ns |
5083 ns |
1.04 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4395.5 ns |
4292 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
65255 ns |
66130 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11416.5 ns |
11625 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11625 ns |
11458 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12167 ns |
12458 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11875 ns |
11709 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
450103 ns |
452684.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6958 ns |
6375 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6958 ns |
6959 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7917 ns |
8229.5 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7000 ns |
6916 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
52300 ns |
52019 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17500 ns |
16875 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17542 ns |
17000 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
19229.5 ns |
18166 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17917 ns |
17542 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
302036 ns |
301500.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
667 ns |
542 ns |
1.23 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
667 ns |
542 ns |
1.23 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
750 ns |
666 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
750 ns |
667 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
33102 ns |
32512 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8791 ns |
8500 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8937.5 ns |
8750 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
9500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9645.5 ns |
8959 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
158238 ns |
157915 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64709 ns |
64542 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64459 ns |
64625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64500 ns |
64750 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64458 ns |
64875 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111486 ns |
111658.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
281042 ns |
279708 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
279542 ns |
283750 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
274292 ns |
293250 ns |
0.94 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
283875 ns |
284521 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
183452 ns |
185586.5 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3368667 ns |
3282500 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3082959 ns |
3076875 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3041917 ns |
2795834 ns |
1.09 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
3930875 ns |
4063541.5 ns |
0.97 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
568345 ns |
567714 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7658875 ns |
7638583 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7434250 ns |
7366000 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7447833 ns |
7289042 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8211854.5 ns |
8172916 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1321696 ns |
1335450 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17560104.5 ns |
17555833 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17559000 ns |
17413291.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
17525708 ns |
17640417 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14127062.5 ns |
14085667 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23564834 ns |
23644667 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33525417 ns |
33391375 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37150250 ns |
40912708 ns |
0.91 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35094333.5 ns |
35048479 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1859060 ns |
1855237.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
189921250 ns |
189754584 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
232647750 ns |
232353000 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
193793291 ns |
201284750 ns |
0.96 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
434990500 ns |
435226125 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13941036 ns |
13860033 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
292673333 ns |
290571042 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
335185917 ns |
334832916 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
296085479.5 ns |
303703583 ns |
0.97 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
396336125 ns |
393811604 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22146 ns |
21541 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
23666 ns |
22375 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25812.5 ns |
23354 ns |
1.11 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24417 ns |
24500 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
97629 ns |
95582 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
104416 ns |
103250 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
104020.5 ns |
115312.5 ns |
0.90 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
105250 ns |
104625 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
103250 ns |
102667 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
517099 ns |
503695.5 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6250 ns |
5750 ns |
1.09 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6167 ns |
5791 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8208 ns |
7666 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6437.5 ns |
6250 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
70044 ns |
68642 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14875 ns |
14875 ns |
1 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15250 ns |
14625 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16166 ns |
16250 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14958 ns |
14833 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
490718.5 ns |
478112.5 ns |
1.03 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3006729 ns |
3019792 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2070041 ns |
2069896 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2284396.5 ns |
2279000 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4891187 ns |
4750917 ns |
1.03 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
584210 ns |
583001 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23526125 ns |
23604770.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18048250 ns |
18003875 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18140125 ns |
18293125 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35506875 ns |
35919729.5 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3110700 ns |
3106744 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33335958 ns |
33297687 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27594458 ns |
27474958 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28556354 ns |
29070229.5 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41469000 ns |
41830959 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
72333 ns |
73396 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
75375 ns |
75125 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
75083 ns |
74875 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
72770.5 ns |
72959 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
103461 ns |
103514 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
206770.5 ns |
274208 ns |
0.75 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
297562.5 ns |
205959 ns |
1.44 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221041 ns |
255333 ns |
0.87 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205479.5 ns |
296916 ns |
0.69 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
551540 ns |
554316 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12167 ns |
11167 ns |
1.09 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12458 ns |
11875 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13500 ns |
13458 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12104.5 ns |
12458 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
72247.5 ns |
72256.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26458 ns |
26583.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27645.5 ns |
26833 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27917 ns |
28084 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26979.5 ns |
26708 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
475419.5 ns |
483481.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12500 ns |
11520.5 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12917 ns |
13041 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13604.5 ns |
13750 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12416.5 ns |
12875 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
53143 ns |
52959.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
25584 ns |
25500 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26292 ns |
25542 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26041 ns |
26375 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26395.5 ns |
26542 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
305462.5 ns |
310926 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
179500 ns |
179125 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
180687.5 ns |
182625 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
183812.5 ns |
183958 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
182020.5 ns |
182416 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
57465 ns |
58111 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
584750 ns |
582958 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
585541 ns |
583209 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
591584 ns |
610042 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
583084 ns |
582000 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
288858 ns |
286370 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6167 ns |
5729.5 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6833 ns |
6334 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7042 ns |
7500 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6041 ns |
6083 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
71108.5 ns |
71136.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14000 ns |
14167 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15375 ns |
14500 ns |
1.06 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15209 ns |
15667 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14125 ns |
14667 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
466371.5 ns |
468005 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1215270.5 ns |
1186749.5 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1243541 ns |
1247334 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1265104 ns |
1282666.5 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
998125 ns |
841729 ns |
1.19 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301486 ns |
301667 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4108708 ns |
4101771 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4434771 ns |
4417458 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4556542 ns |
4790916 ns |
0.95 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
3704916 ns |
3731833.5 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1037137 ns |
1043818 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1875 ns |
1792 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1834 ns |
1834 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23824.5 ns |
23460 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4917 ns |
4875 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4958 ns |
4834 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4917 ns |
4917 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4917 ns |
4958 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
192406.5 ns |
189873 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6208 ns |
5792 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6458 ns |
6125 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7208 ns |
7187.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6208 ns |
6208 ns |
1 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
56624 ns |
55970.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10834 ns |
10625 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11916 ns |
11083 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11250 ns |
11584 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11875 ns |
11500 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
338922 ns |
332298.5 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
375 ns |
0.89 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23247 ns |
22660 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2958 ns |
2708 ns |
1.09 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2833 ns |
2750 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3041 ns |
3000 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2792 ns |
2709 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
161850.5 ns |
159360 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11542 ns |
11292 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11875 ns |
11792 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12750 ns |
13250 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11334 ns |
12229.5 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
58925 ns |
57130.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24583 ns |
24708 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24916 ns |
24167 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25208 ns |
25854 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24917 ns |
24916.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
299387 ns |
300198 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4208 ns |
4208 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4250 ns |
4125 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4208 ns |
4250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4208 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
25565 ns |
24574 ns |
1.04 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16333 ns |
16166 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16167 ns |
16000 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16250 ns |
16042 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16125 ns |
16375 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
199746.5 ns |
201392 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5875 ns |
5750 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5750 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5958 ns |
5875 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5875 ns |
5916 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
34596 ns |
33153 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20875 ns |
20333 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20792 ns |
20792 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21291 ns |
20917 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20979 ns |
21375 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
178480.5 ns |
175780 ns |
1.02 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
423625 ns |
417417 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
384354 ns |
378854.5 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
484333 ns |
487270.5 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
102791 ns |
103917 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
67358 ns |
66399.5 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
937687.5 ns |
877583 ns |
1.07 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
970875 ns |
949562.5 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1172499.5 ns |
1206625 ns |
0.97 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
443166.5 ns |
469167 ns |
0.94 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
191211.5 ns |
191112 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80583 ns |
85417 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
81875 ns |
81083 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
82041 ns |
84625 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80792 ns |
85417 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193663 ns |
193239.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1929125 ns |
1913750 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1696625 ns |
1913542 ns |
0.89 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1921709 ns |
1943083.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1867812.5 ns |
1906896 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
402663 ns |
406558 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22091 ns |
22047.5 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1916 ns |
1875 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1875 ns |
1834 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
174094.5 ns |
171306.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6520.5 ns |
6209 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7792 ns |
6625 ns |
1.18 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7875 ns |
8542 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6333 ns |
7125 ns |
0.89 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
63458.5 ns |
60422 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9541.5 ns |
9000 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9417 ns |
8958 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9417 ns |
9584 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9375 ns |
9416 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
318478.5 ns |
313100.5 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
119562958 ns |
119013624.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174205375 ns |
174073709 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
147918417 ns |
154836458 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
103338083 ns |
106465208 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5458392 ns |
5473107.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
615458021 ns |
615549000 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
554375542 ns |
555627500 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
448791791.5 ns |
469486625 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
752574437.5 ns |
758488604 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34940940 ns |
34956527 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
653342791 ns |
650955333 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
664179520.5 ns |
665997520.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
583748437.5 ns |
596311875 ns |
0.98 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
741232458 ns |
746344250 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59000 ns |
59041 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46125 ns |
47750 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47042 ns |
39041 ns |
1.20 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83333 ns |
84708.5 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38314 ns |
36941 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1904187.5 ns |
1922166 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1972750 ns |
1978041 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1980584 ns |
1990167 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1881166.5 ns |
1920167 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
177156 ns |
173728 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
268458 ns |
282041.5 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
278208.5 ns |
266458 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
287959 ns |
273853.5 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
267958.5 ns |
270333 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
137529 ns |
135453.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
695604 ns |
674666 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
684270.5 ns |
684354 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
666959 ns |
676145.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
585000 ns |
596375 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
743997.5 ns |
752272.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2252979 ns |
2253417 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2225916.5 ns |
2217895.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2183667 ns |
2190479 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2169792 ns |
2202416.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
133681 ns |
133169 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5530291.5 ns |
5479500 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5499000 ns |
5506916 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5501458 ns |
5588312.5 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5511125 ns |
5564021 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
785850 ns |
794371.5 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
640709 ns |
646958 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
634375 ns |
656500 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
647958 ns |
640416 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
644875 ns |
657291 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46990 ns |
47817 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1822667 ns |
1822375 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1724334 ns |
1719708 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1723187.5 ns |
1665541 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2100083 ns |
2108083 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
221327 ns |
227850 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58209 ns |
58458 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46042 ns |
45083 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
45083 ns |
38041 ns |
1.19 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83208 ns |
84958 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28909 ns |
28842 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2044937.5 ns |
2030375 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2074937.5 ns |
2084312.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2098334 ns |
1787459 ns |
1.17 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1983208 ns |
2014583.5 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
189957.5 ns |
192397.5 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13300125 ns |
13382625 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12441249.5 ns |
12433458.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12494750 ns |
12571375 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
14866208 ns |
15143562.5 ns |
0.98 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
513748.5 ns |
514602 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47395083 ns |
47546916 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41816041 ns |
41875708 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41060375 ns |
41161020.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58402979 ns |
58396167 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3253507 ns |
3251545 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
97062041.5 ns |
75047125 ns |
1.29 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
90812917 ns |
67897459 ns |
1.34 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90806458 ns |
90940166.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
98905645.5 ns |
99460667 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58875 ns |
58750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47375 ns |
46875 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46833 ns |
38333 ns |
1.22 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83292 ns |
80334 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
48465 ns |
46475 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1918416 ns |
1921416 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1970334 ns |
1976416 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1953375.5 ns |
1721708.5 ns |
1.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1886542 ns |
1905000 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
196063.5 ns |
190253.5 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
416 ns |
333 ns |
1.25 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
416 ns |
417 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
33464 ns |
31709.5 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6459 ns |
6125 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6458 ns |
6208 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6584 ns |
6583 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6792 ns |
6854.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
181531.5 ns |
176344 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31935 ns |
31144 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2792 ns |
2625 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2875 ns |
2625 ns |
1.10 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2916 ns |
2833 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2875 ns |
2750 ns |
1.05 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
168237.5 ns |
164923.5 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
286871562.5 ns |
285479083.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
340292959 ns |
340672292 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
313683291.5 ns |
320528833.5 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
271677583 ns |
267627833 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7111375 ns |
7061953.5 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
998128375 ns |
1000752000 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
940217833 ns |
941508917 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
835201958 ns |
849741542 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1154673125 ns |
1162624583 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34036515 ns |
33972568.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1669749000 ns |
1314224145.5 ns |
1.27 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1693232458 ns |
1312834041.5 ns |
1.29 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1603932000 ns |
1621294583 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1667049542 ns |
1681368042 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1412500 ns |
1461562.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1464666 ns |
1416958 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1418792 ns |
1414750 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1414333 ns |
1412375 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
128139 ns |
127713.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5033791 ns |
5020125 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5012167 ns |
5027042 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5022688 ns |
4740833 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5031750 ns |
5044042 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
594340 ns |
510137 ns |
1.17 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
166795333 ns |
171071812.5 ns |
0.98 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
127722249.5 ns |
126739625 ns |
1.01 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
125070854.5 ns |
146147041 ns |
0.86 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
153378167 ns |
168329334 ns |
0.91 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4886780 ns |
4881506 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
630808333 ns |
622612209 ns |
1.01 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
538148708 ns |
538980667 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
495624667 ns |
504257334 ns |
0.98 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
657078416 ns |
656863250 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
15762771 ns |
16684647 ns |
0.94 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8958041.5 ns |
8964583 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8947833.5 ns |
8900333 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7932750 ns |
7993333 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
9741916 ns |
9790312.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1610530 ns |
1594468.5 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36113292 ns |
36115750.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
37084584 ns |
36971083.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33617771 ns |
34444208 ns |
0.98 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
38201125 ns |
37794834 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6453792 ns |
6465190.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47375 ns |
47292 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47583 ns |
47542 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47500 ns |
47584 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47500 ns |
47500 ns |
1 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
18959 ns |
18793 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50417 ns |
50291.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50750 ns |
50417 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50583 ns |
50833 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50791 ns |
50750 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
219022 ns |
231220 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7084 ns |
6291 ns |
1.13 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7292 ns |
7084 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8416 ns |
7792 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6791 ns |
7542 ns |
0.90 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
103670 ns |
106604.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9895.5 ns |
10209 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10209 ns |
9833 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10125 ns |
10270.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10125 ns |
10459 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
624636.5 ns |
619990 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6084 ns |
5792 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6250 ns |
6416 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7917 ns |
7958 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6125 ns |
6042 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
151041 ns |
121725 ns |
1.24 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12958.5 ns |
13375 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13500 ns |
13000 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13458 ns |
13584 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13520.5 ns |
13375 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
607821.5 ns |
528027 ns |
1.15 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1000 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1125 ns |
959 ns |
1.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1125 ns |
1083 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
1125 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
33684 ns |
31705 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8084 ns |
7792 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8125 ns |
7667 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8209 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8167 ns |
8666 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
251588.5 ns |
204125.5 ns |
1.23 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23250 ns |
23000 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23500 ns |
23084 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23583 ns |
23584 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23417 ns |
23500 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18914 ns |
18461 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52625 ns |
52458 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52625 ns |
52291 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52708.5 ns |
52791 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52625 ns |
52458 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
362580.5 ns |
286087.5 ns |
1.27 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1396458 ns |
1397209 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1398583 ns |
1395917 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1399229 ns |
1400209 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1447520.5 ns |
1398500 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
197200.5 ns |
195540.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5036041 ns |
5008458.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5013875 ns |
5018750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5014270.5 ns |
4722750 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5017417 ns |
4703042 ns |
1.07 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
701302 ns |
626852.5 ns |
1.12 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3044084 ns |
3063416 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2082520.5 ns |
2063875 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2291250 ns |
2311417 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4557292 ns |
4823500 ns |
0.94 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
583956 ns |
580360 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24426437.5 ns |
24332959 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18927375 ns |
18875458 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18983062.5 ns |
18989334 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36883604.5 ns |
36748479.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3211711 ns |
3188758 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34120104 ns |
34048562.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28355000 ns |
28257854 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28027812.5 ns |
28468541.5 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41664437.5 ns |
41851021 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
143414750 ns |
144123292 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
146649291 ns |
147912291 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
126495354.5 ns |
128219729 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
173120645.5 ns |
175666645.5 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22568973 ns |
22797470 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
1965865458 ns |
1274551333 ns |
1.54 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
831780792 ns |
1209986250 ns |
0.69 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
682570062.5 ns |
717258459 ns |
0.95 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
669462084 ns |
669341542 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
118557806 ns |
118134658 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
73500 ns |
75042 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
84958 ns |
73833 ns |
1.15 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76666 ns |
75813 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
85083 ns |
74125 ns |
1.15 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
287272.5 ns |
248024.5 ns |
1.16 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
294875 ns |
202750 ns |
1.45 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
286958 ns |
283250 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
260250 ns |
194000 ns |
1.34 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
229875 ns |
189583 ns |
1.21 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1480876 ns |
1272660.5 ns |
1.16 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35495375 ns |
35542000 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
36303146 ns |
36428479 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32522062.5 ns |
32734792 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40428166.5 ns |
40941958 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5852098 ns |
5852888 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
147916959 ns |
147574354 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
153042041.5 ns |
154842271 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
139521896 ns |
142249771 ns |
0.98 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
283555833 ns |
285430916 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34913814 ns |
34907859 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
120479333 ns |
119543458.5 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173824750 ns |
173916625 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
147602583 ns |
155928584 ns |
0.95 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
102571708 ns |
103545938 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5467443 ns |
5470774 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
469778500 ns |
471171395.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
466895958 ns |
467366000 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
439357750 ns |
456719729 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
740288000 ns |
738831458 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32287610 ns |
32277660 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
640237334 ns |
709159062 ns |
0.90 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
654668125 ns |
654555208.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
572991708.5 ns |
585803354.5 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
728632208 ns |
726547959 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1317291 ns |
1242646 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
985354 ns |
968625.5 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
954875 ns |
674709 ns |
1.42 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2090334 ns |
1941770.5 ns |
1.08 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
574648.5 ns |
569058 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2968667 ns |
2969916 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2607125 ns |
2603708 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2618542 ns |
1985166.5 ns |
1.32 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3693166 ns |
3729625 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1918588 ns |
1762089 ns |
1.09 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5834437.5 ns |
5801458 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5783458 ns |
5780958 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
5803563 ns |
5645834 ns |
1.03 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2886625 ns |
2921042 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7459 ns |
7250 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
5958 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6166 ns |
5333 ns |
1.16 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10083 ns |
1 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26589 ns |
25119 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213854 ns |
215750 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221042 ns |
258458 ns |
0.86 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221125 ns |
221291.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219959 ns |
207146 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
292261.5 ns |
264756 ns |
1.10 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
309512521 ns |
308377104 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
229306208 ns |
231656291 ns |
0.99 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
199201146 ns |
224042396 ns |
0.89 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
307319792 ns |
307881333 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7680668.5 ns |
7678620 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1085399937 ns |
1097604312.5 ns |
0.99 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
903621916.5 ns |
920148521 ns |
0.98 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
817729000 ns |
858485833.5 ns |
0.95 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1148602500 ns |
1150798750 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26633251 ns |
26497955 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5500 ns |
4958.5 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5708 ns |
5583 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6500 ns |
6916.5 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5708 ns |
5541 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
182488.5 ns |
171524 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7500 ns |
7542 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
6750 ns |
1.13 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7500 ns |
7458 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7667 ns |
7875 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
688950.5 ns |
670577.5 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
541 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
541 ns |
1.16 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
666 ns |
625 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
708 ns |
625 ns |
1.13 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
24411 ns |
23778 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9083 ns |
8708 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9375 ns |
8541.5 ns |
1.10 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9167 ns |
9458 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9291.5 ns |
9541.5 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
235750.5 ns |
233071 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
351895.5 ns |
353250 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
351750 ns |
353208 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
353062.5 ns |
352667 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
354479 ns |
352125 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21359 ns |
21348 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
829208 ns |
822333 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
800208 ns |
774854 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
824146 ns |
777042 ns |
1.06 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
823125 ns |
825999.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
300583.5 ns |
286748 ns |
1.05 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
332458 ns |
336833 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
334875 ns |
335917 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
451125 ns |
445708 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
10584 ns |
10917 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17729 ns |
17559 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
717292 ns |
713499.5 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
734604 ns |
730834 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1006333 ns |
1027167 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
26917 ns |
26500 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
277239.5 ns |
260521.5 ns |
1.06 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
378792 ns |
371375 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
344042 ns |
346250 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
442042 ns |
445812.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
31416 ns |
30479 ns |
1.03 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22416 ns |
22136 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
731209 ns |
734062.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
779563 ns |
773750.5 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1031875 ns |
1061729 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
102333 ns |
98521 ns |
1.04 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
221693.5 ns |
220018.5 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3500 ns |
3375 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3542 ns |
3542 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3687.5 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3458 ns |
3583 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
17824 ns |
17780 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4291 ns |
4125 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4334 ns |
4167 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4334 ns |
4375 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4541 ns |
4500 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
291531 ns |
258504 ns |
1.13 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3791 ns |
3750 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4208.5 ns |
3500 ns |
1.20 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4833 ns |
4917 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4166 ns |
4083 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
223915 ns |
200777 ns |
1.12 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8708 ns |
8417 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8625 ns |
8000 ns |
1.08 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8750 ns |
8625 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8792 ns |
8604.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1270382 ns |
1183716 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203167 ns |
205708 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
208375 ns |
210125 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210209 ns |
210375 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
201333 ns |
200375 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
35222.5 ns |
34375 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
601812.5 ns |
650916 ns |
0.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
621604 ns |
666959 ns |
0.93 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
670333 ns |
624167 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
645479 ns |
632458 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
359781 ns |
343648 ns |
1.05 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
1011834 ns |
1000479 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1004625 ns |
1007958 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
957062 ns |
974396 ns |
0.98 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
871125 ns |
894770.5 ns |
0.97 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
207547 ns |
207021.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4541875 ns |
4512146 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4686937.5 ns |
4708729.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4492709 ns |
4609875 ns |
0.97 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
5165833 ns |
5171208.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
940561 ns |
947853.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3792 ns |
3333 ns |
1.14 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3666 ns |
3083 ns |
1.19 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
3917 ns |
4333 ns |
0.90 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3500 ns |
3917 ns |
0.89 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
240134.5 ns |
218377.5 ns |
1.10 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7208 ns |
7375 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7375 ns |
6833 ns |
1.08 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7625 ns |
7458 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7625 ns |
7459 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
1041039.5 ns |
1012916 ns |
1.03 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1634770.5 ns |
1641584 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1174688 ns |
1193979 ns |
0.98 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1359875 ns |
1342687.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2471583 ns |
2486625.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214655.5 ns |
214048 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12346021 ns |
12366291.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9572458.5 ns |
9556958 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9258041.5 ns |
9332500 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18025666 ns |
18065166.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1953259 ns |
1946882 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17404166 ns |
17346750 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14403583 ns |
14347000 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14381917 ns |
14486917 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21181270.5 ns |
21148167 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
88687.5 ns |
134750 ns |
0.66 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
90167 ns |
88584 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
93084 ns |
92042 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
89125 ns |
89042 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126156 ns |
126624 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2038104.5 ns |
2031958 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2011625 ns |
2023083.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2026917 ns |
1756000 ns |
1.15 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2023229 ns |
2029583 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1056276 ns |
1029084 ns |
1.03 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
1791.5 ns |
1750 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
2791 ns |
2833 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
3500 ns |
2458 ns |
1.42 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
2000 ns |
2166.5 ns |
0.92 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15460 ns |
16055 ns |
0.96 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2792 ns |
2583 ns |
1.08 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2792 ns |
2500 ns |
1.12 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
2792 ns |
2750 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2750 ns |
2750 ns |
1 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
195419.5 ns |
191618 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7416 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5458 ns |
5917 ns |
0.92 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5959 ns |
5125 ns |
1.16 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10166 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34429 ns |
33917 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
246417 ns |
226396.5 ns |
1.09 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221667 ns |
222521 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
233584 ns |
221584 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
218792 ns |
207458 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
349888.5 ns |
311723.5 ns |
1.12 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3667 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3750 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3709 ns |
3667 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22388 ns |
22860 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14500 ns |
14458 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14458 ns |
14291 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14334 ns |
14250 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14417 ns |
14667 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
496018 ns |
472859.5 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
93000 ns |
137417 ns |
0.68 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
94333.5 ns |
96458.5 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
96916.5 ns |
95833 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
138021 ns |
93125 ns |
1.48 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125381 ns |
125940 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1942875 ns |
1921458.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1882875 ns |
1918166.5 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1927750 ns |
1817687.5 ns |
1.06 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1922500 ns |
1914458 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1003127 ns |
951464 ns |
1.05 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
878833 ns |
869042 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
816854.5 ns |
815167 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1222604 ns |
1175833 ns |
1.04 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
970625 ns |
967562.5 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
277917 ns |
276671 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2747604 ns |
2830583 ns |
0.97 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2503916 ns |
2508062.5 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3367063 ns |
3332875 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3412875 ns |
3328000 ns |
1.03 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1628640 ns |
1576106.5 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15396 ns |
16000 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18000 ns |
15625 ns |
1.15 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20000 ns |
16458 ns |
1.22 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17875 ns |
16417 ns |
1.09 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
143796.5 ns |
143900.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
227875 ns |
255875.5 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
215417 ns |
254271 ns |
0.85 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
258584 ns |
216250 ns |
1.20 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
256229.5 ns |
258021 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
646947.5 ns |
637843.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
220416 ns |
220792 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
220542 ns |
220667 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
224125 ns |
221208 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
222292 ns |
222208.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
273029 ns |
270997 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
510395.5 ns |
504458 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
506708 ns |
507416.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
562146 ns |
499833.5 ns |
1.12 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
510020.5 ns |
498875.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1446681 ns |
1304306.5 ns |
1.11 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
4292 ns |
3459 ns |
1.24 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
3750 ns |
3854.5 ns |
0.97 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
4791 ns |
5375 ns |
0.89 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
3645.5 ns |
4042 ns |
0.90 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16893 ns |
16660 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
7292 ns |
7166 ns |
1.02 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
7292 ns |
6458 ns |
1.13 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
7250 ns |
7209 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
7500 ns |
7541.5 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
199189.5 ns |
194930.5 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17645.5 ns |
17666 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19437.5 ns |
17125 ns |
1.14 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21333 ns |
19729 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19229.5 ns |
18000 ns |
1.07 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
146525 ns |
146357.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
250125 ns |
244562 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
211541.5 ns |
237417 ns |
0.89 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230750 ns |
214500 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
253917 ns |
225208 ns |
1.13 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
967908.5 ns |
894981 ns |
1.08 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4437.5 ns |
4416 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4750 ns |
3917 ns |
1.21 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5083.5 ns |
5334 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4562.5 ns |
4833 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
248063 ns |
187684 ns |
1.32 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10333 ns |
10500 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10666 ns |
9708 ns |
1.10 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10834 ns |
11167 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10541.5 ns |
11250 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1105049.5 ns |
1024651 ns |
1.08 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3625 ns |
3209 ns |
1.13 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3875 ns |
3250 ns |
1.19 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4583 ns |
4687.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3333 ns |
3791 ns |
0.88 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
251061.5 ns |
218725.5 ns |
1.15 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7792 ns |
7833 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7667 ns |
7291 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7709 ns |
7625 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7916.5 ns |
7917 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1114681 ns |
1043721.5 ns |
1.07 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23672334 ns |
23437104.5 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
35178458 ns |
35045979.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37815563 ns |
41490500 ns |
0.91 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34900167 ns |
34913479 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1833918 ns |
2126334.5 ns |
0.86 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
183930417 ns |
184798459 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
159549208 ns |
159330000 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146565333.5 ns |
151477459 ns |
0.97 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
410971708 ns |
411547250 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16513855 ns |
16524151 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
427252041 ns |
427197208 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
253199000 ns |
252723645.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
297022499.5 ns |
305721250 ns |
0.97 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
480585042 ns |
481095166 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
184458 ns |
182854.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
183874.5 ns |
182791.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
186250 ns |
185292 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
185042 ns |
185750 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
230772 ns |
173677.5 ns |
1.33 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
617333 ns |
629833 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
590750 ns |
631375 ns |
0.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
637562.5 ns |
590542 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
637792 ns |
630770.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1100938 ns |
1010062 ns |
1.09 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3873083 ns |
3848041.5 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3922104.5 ns |
4009000 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3586041.5 ns |
3525583 ns |
1.02 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
4556916.5 ns |
4614917 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
533150 ns |
536882 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17436562.5 ns |
17371917 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17829084 ns |
17740624.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16603000 ns |
16856312.5 ns |
0.98 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
20181833 ns |
20403334 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2634997 ns |
2613028 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
708 ns |
625 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
708 ns |
667 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32617 ns |
31917 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8958 ns |
9334 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9583 ns |
8708 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10208 ns |
9875 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9792 ns |
9417 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
265095 ns |
260614 ns |
1.02 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
497943000 ns |
503086958 ns |
0.99 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
392660917 ns |
424620083.5 ns |
0.92 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
423842791 ns |
462339520.5 ns |
0.92 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
676715125 ns |
673052062 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12483998.5 ns |
12478664.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1879019416.5 ns |
1872018104.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1625360250 ns |
1625413500 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1495225542 ns |
1546440125 ns |
0.97 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2203151229.5 ns |
2200566458.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49422036 ns |
49139909 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1647417 ns |
1647791.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1190770.5 ns |
1202542 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1385833 ns |
1365999.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2341166 ns |
2393042 ns |
0.98 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215752.5 ns |
215162 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12693416 ns |
12703083.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9888874.5 ns |
9880000 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9681500 ns |
9761146 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18308500 ns |
18559417 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2050707 ns |
2005712 ns |
1.02 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17696271 ns |
17693854 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14736375 ns |
14669187.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14587709 ns |
14767500 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21440292 ns |
21469542 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26292 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26334 ns |
26208 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26167 ns |
26292 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26292 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24734 ns |
23799 ns |
1.04 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67375 ns |
66666 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
68125 ns |
66750 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67125 ns |
67209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66833 ns |
67500 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
411788 ns |
380551.5 ns |
1.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204208 ns |
203917 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209666 ns |
209750 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209917 ns |
210000 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199375 ns |
199958 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27528 ns |
25800 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
600791.5 ns |
648229.5 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
630250 ns |
661271 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
670458 ns |
622750 ns |
1.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
631312 ns |
586375 ns |
1.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
357112 ns |
308724.5 ns |
1.16 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
683417 ns |
600291 ns |
1.14 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
639083 ns |
594125 ns |
1.08 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
643125 ns |
544666 ns |
1.18 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
655937.5 ns |
652208 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132998 ns |
131751 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2277042 ns |
2235000 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2212708 ns |
2235625 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2231208 ns |
2300854 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2235625 ns |
2253125 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1250744 ns |
1127758 ns |
1.11 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17541 ns |
17541 ns |
1 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19333 ns |
16958 ns |
1.14 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21416.5 ns |
19917 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19604.5 ns |
17958 ns |
1.09 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
144909 ns |
145385 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
229625 ns |
261583 ns |
0.88 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
227666 ns |
260812.5 ns |
0.87 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
260500 ns |
220937.5 ns |
1.18 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
231084 ns |
230896 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1065824.5 ns |
982925 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
625 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
667 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23542 ns |
23015 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9625 ns |
9479.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10000 ns |
9042 ns |
1.11 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10208.5 ns |
10292 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9875 ns |
9625 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
258222.5 ns |
257388 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6083.5 ns |
5458 ns |
1.11 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5917 ns |
5417 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6500 ns |
6625 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5708 ns |
6083 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
236089 ns |
233603.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7500 ns |
7083 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7542 ns |
7041 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7666 ns |
7833 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7583.5 ns |
7375 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
800761.5 ns |
800650 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2084 ns |
2000 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2375 ns |
2125 ns |
1.12 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2458 ns |
2458 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2209 ns |
2459 ns |
0.90 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
18200 ns |
17988 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6958 ns |
6500 ns |
1.07 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6750 ns |
6291 ns |
1.07 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6834 ns |
6708 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6500 ns |
6542 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
329767 ns |
330671 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
749000 ns |
749709 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
746792 ns |
747104 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
752000 ns |
749208 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
746875 ns |
751791.5 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21064 ns |
21045 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
791375 ns |
791000 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
774979.5 ns |
791062.5 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
809646 ns |
775875 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
791333.5 ns |
775250 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
297256 ns |
294695 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7417 ns |
7208 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5958 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5959 ns |
5291 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10208 ns |
10208 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33581 ns |
32534 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
233792 ns |
233291 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228229 ns |
267375 ns |
0.85 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
268812.5 ns |
227812.5 ns |
1.18 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
221000 ns |
213583 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
365038 ns |
361573 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10208 ns |
10020.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10604.5 ns |
10042 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10916 ns |
11625 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10542 ns |
10208 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
253826 ns |
248981.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25375 ns |
26791 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24791 ns |
24292 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25167 ns |
24750 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24792 ns |
25000 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1115526 ns |
1132389 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106176167 ns |
107227250 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
118171749.5 ns |
117058791.5 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
120350854.5 ns |
124034229 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117488979 ns |
117545541.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2653192 ns |
2659866 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
393750042 ns |
393155000 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
366379167 ns |
366597250 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
355573791 ns |
357674666 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
484763208 ns |
490403667 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15215468 ns |
15157994 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
937318521 ns |
758865499.5 ns |
1.24 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
757115375 ns |
580033084 ns |
1.31 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
744755937 ns |
748265062.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
943378104 ns |
948608916.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7375 ns |
6916.5 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7500 ns |
7000 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7416 ns |
8042 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7375 ns |
7625 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
241924 ns |
242461.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14250 ns |
14084 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14666 ns |
13500 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
13750 ns |
14208 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14333 ns |
14333 ns |
1 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1074204 ns |
1085062 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5791 ns |
5541 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6375 ns |
6563 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6875 ns |
7666 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6250 ns |
6291 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
235095 ns |
235371.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12292 ns |
12542 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13084 ns |
12104.5 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12584 ns |
13042 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12583 ns |
12750 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
783718 ns |
793450.5 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
5750 ns |
5125 ns |
1.12 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
5625 ns |
5750 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
6459 ns |
6333 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
5541 ns |
5625 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16882 ns |
16571 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
15458 ns |
15792 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
15542 ns |
15417 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
15625 ns |
15625 ns |
1 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
15709 ns |
15750 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
199445.5 ns |
200110.5 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
417 ns |
292 ns |
1.43 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
417 ns |
417 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23913 ns |
23594.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6459 ns |
5959 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6416 ns |
6083 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6500 ns |
6666 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6854.5 ns |
6834 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
240304 ns |
242427.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6000 ns |
5833 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5959 ns |
5834 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5959 ns |
6000 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6000 ns |
6041 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24668 ns |
24342.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21187.5 ns |
20875 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21042 ns |
21042 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21209 ns |
21666 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21459 ns |
21875 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
261526.5 ns |
262727.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
147583.5 ns |
185833 ns |
0.79 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
145666 ns |
144916.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
148333 ns |
146875 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
144500 ns |
144416.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167437.5 ns |
167734 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1329042 ns |
1323750 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1319625 ns |
1312209 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1323417 ns |
1332875 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1327500 ns |
1333770.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1350011 ns |
1339118 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22166 ns |
24041.5 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24833 ns |
22312.5 ns |
1.11 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25208 ns |
24833 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24542 ns |
24667 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
353253 ns |
351890.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
117709 ns |
170708 ns |
0.69 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
130208 ns |
177875 ns |
0.73 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
178999.5 ns |
118625 ns |
1.51 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
177750 ns |
120020.5 ns |
1.48 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1465805 ns |
1461877 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23144 ns |
22590 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6458 ns |
6250 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6417 ns |
6250 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6750 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6625 ns |
6583 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
255534 ns |
255552.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5166 ns |
4291 ns |
1.20 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5000 ns |
4417 ns |
1.13 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5104.5 ns |
5708 ns |
0.89 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4396 ns |
5292 ns |
0.83 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
256740.5 ns |
256272 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10500 ns |
10042 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10250 ns |
9833 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10417 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10500 ns |
10333 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1354747 ns |
1354208 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1666 ns |
1583 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1625 ns |
1666 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1584 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23049 ns |
22798 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6000 ns |
5833 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5750 ns |
5709 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6083 ns |
6000 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5708 ns |
5916 ns |
0.96 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
272998.5 ns |
274328 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6816125 ns |
6866624.5 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6402167 ns |
6433708 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6496250 ns |
6554499.5 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7515896 ns |
7548875 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214842.5 ns |
213149 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24119875 ns |
24100417 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21297750 ns |
21294521 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21015750 ns |
21070125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29643833 ns |
29826667 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2104284 ns |
2116806 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
48636417 ns |
37336834 ns |
1.30 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45606625 ns |
34197292 ns |
1.33 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45739292 ns |
45794042 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
49289166.5 ns |
49624208 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6583 ns |
5750 ns |
1.14 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6375 ns |
5625 ns |
1.13 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6729.5 ns |
6791 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6125 ns |
6667 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
235503 ns |
236202.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8209 ns |
8084 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8959 ns |
7875 ns |
1.14 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8583 ns |
8667 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8667 ns |
9167 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1060587 ns |
1060405 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1504292 ns |
1553542 ns |
0.97 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1256167 ns |
1263041.5 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1625333 ns |
1622041 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2157209 ns |
2175916 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
276714.5 ns |
272178 ns |
1.02 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7846542 ns |
7902375 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6594750 ns |
6258292 ns |
1.05 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7163917 ns |
7165958 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10480542 ns |
10478104.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1863181 ns |
1852121.5 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
364583.5 ns |
361584 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
371083 ns |
370750 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
459000 ns |
456417 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
25166 ns |
24999.5 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46384 ns |
46439.5 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
742584 ns |
738895.5 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
806833 ns |
809958 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1063250 ns |
1082542 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
95708 ns |
76708 ns |
1.25 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
310356 ns |
301861.5 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397625 ns |
397459 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288083 ns |
288084 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288333 ns |
212208 ns |
1.36 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
751750 ns |
755209 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43878 ns |
43701 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
667417 ns |
665625 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
532042 ns |
530417 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
531875 ns |
473750 ns |
1.12 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
972625 ns |
974458 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
188552 ns |
189749 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
665500 ns |
649583 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
644145.5 ns |
641833 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
640042 ns |
545458.5 ns |
1.17 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
609542 ns |
653167 ns |
0.93 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132007 ns |
131877 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2499500 ns |
2454834 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2439604.5 ns |
2460271 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2452479 ns |
2500666 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2453583 ns |
2518479 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1281567 ns |
1202049 ns |
1.07 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
3000 ns |
3000 ns |
1 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
2917 ns |
3500 ns |
0.83 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
4250 ns |
3500 ns |
1.21 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
3166.5 ns |
2708 ns |
1.17 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16209 ns |
15904 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
5625 ns |
5375 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
5542 ns |
5292 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
5583 ns |
5666 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
5583 ns |
5750 ns |
0.97 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
197427.5 ns |
196388 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1460625 ns |
1465625 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1499500 ns |
1502708 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1500542 ns |
1496875 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1439750 ns |
1444792 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
40557 ns |
40558 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5144333.5 ns |
5125396 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5277208 ns |
5286583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5290292 ns |
5312375 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4992083 ns |
4974792 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
196743 ns |
195790.5 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3709 ns |
3708 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3709 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3709 ns |
3708 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
32882 ns |
32748 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15375 ns |
15083 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15375 ns |
15083 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15375 ns |
15167 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15416 ns |
15375 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
376061.5 ns |
375651.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71333 ns |
71125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
70833 ns |
71167 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71000 ns |
71208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71125 ns |
71083 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
112803 ns |
112958 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
319792 ns |
323791 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
318500 ns |
320458 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
318500 ns |
326875 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
318125 ns |
323000 ns |
0.98 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
192539.5 ns |
193747 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1042 ns |
1000 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
958 ns |
1.13 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1125 ns |
1042 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
1084 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
23898 ns |
23358 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8166 ns |
7875 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8167 ns |
7834 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8708 ns |
8458 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8208 ns |
8833 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
259317.5 ns |
259209 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
509937.5 ns |
505375 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
487208.5 ns |
484292 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
562375 ns |
564542 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
212250 ns |
215062.5 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
130057 ns |
128754 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1385062.5 ns |
1371334 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1461166.5 ns |
1393812.5 ns |
1.05 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1724562 ns |
1732333 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
867750 ns |
870083.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
273418 ns |
276302 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
417 ns |
292 ns |
1.43 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31926 ns |
31400 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6334 ns |
6167 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6292 ns |
6000 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7042 ns |
6500 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6500 ns |
6958 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
262567.5 ns |
263074.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1725167 ns |
1767042 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1722500 ns |
1725208 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1727542 ns |
1727292 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1723666 ns |
1726271 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
168619.5 ns |
168554 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4367833 ns |
4357521 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4376291 ns |
4359541 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4353604 ns |
4379875 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4366625 ns |
4377583 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1235288 ns |
1157059 ns |
1.07 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6833 ns |
6666 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6667 ns |
6666 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
9875 ns |
6916 ns |
1.43 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6917 ns |
7041.5 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
19683 ns |
20567 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
51833 ns |
32834 ns |
1.58 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
52375 ns |
51229.5 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
72645.5 ns |
33541.5 ns |
2.17 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
34708 ns |
51062.5 ns |
0.68 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
209572 ns |
209739.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
17708 ns |
17250 ns |
1.03 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
17875 ns |
17812.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
18667 ns |
18292 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
17812.5 ns |
17708 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18259 ns |
17907 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
53333 ns |
53208 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
53334 ns |
52959 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
53417 ns |
53541 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
53458 ns |
53291 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
343000.5 ns |
344400 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75333 ns |
75333 ns |
1 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75250 ns |
74959 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75167 ns |
75292 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75084 ns |
75000 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46885 ns |
47022 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
327500 ns |
325292 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
328708 ns |
324417 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
325125 ns |
343042 ns |
0.95 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
325875 ns |
327084 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
208870.5 ns |
210359 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1487167 ns |
1488333 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1528250 ns |
1527917 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1527208 ns |
1521042 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1464208 ns |
1466167 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
52106 ns |
51138 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5135229 ns |
5120375 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5243250 ns |
5285750 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5280000 ns |
5309459 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4984000 ns |
4973917 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
203243.5 ns |
202631 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28208 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28292 ns |
28125 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28167 ns |
28208 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28167 ns |
28209 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24493 ns |
24478 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66791 ns |
66208 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67583 ns |
66167 ns |
1.02 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66667 ns |
66250 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66458 ns |
66959 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
533204 ns |
533201 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1505333 ns |
1463833 ns |
1.03 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1127417 ns |
1144583 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1115333.5 ns |
832188 ns |
1.34 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2167937.5 ns |
2217792 ns |
0.98 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
580240.5 ns |
576305 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3100541.5 ns |
3077958.5 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2730979 ns |
2733167 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2737604 ns |
2620334 ns |
1.04 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3813083 ns |
3782000 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2047885 ns |
2001343 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
7941458 ns |
7887749.5 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
7941125 ns |
7887771 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
7897312 ns |
7989000 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4823416.5 ns |
4832458 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
79250 ns |
134958 ns |
0.59 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
81916 ns |
78917 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83708 ns |
82625 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82375 ns |
81250 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194078 ns |
193237.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2030188 ns |
2017354.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2011333.5 ns |
2006750 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2016458 ns |
2041167 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2010375 ns |
2018875 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
809832 ns |
797402 ns |
1.02 |
This comment was automatically generated by workflow using github-action-benchmark.
world age issue for the LuxLib tests |
fixes #1098