Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: use TestExtras #1099

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

test: use TestExtras #1099

wants to merge 4 commits into from

Conversation

avik-pal
Copy link
Member

fixes #1098

Copy link
Contributor

github-actions bot commented Nov 22, 2024

Benchmark Results (ASV)

main a4029fc... main/a4029fcfff83ff...
basics/overhead 0.155 ± 0.0015 μs 0.127 ± 0.0014 μs 1.22
time_to_load 1.26 ± 0.024 s 1.29 ± 0.028 s 0.978

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: a4029fc Previous: 132619c Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4083 ns 3875 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4375 ns 4208 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5292 ns 5250 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4104.5 ns 4333 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60940 ns 61892.5 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10500 ns 10542 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10291 ns 10209 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11541 ns 10459 ns 1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10291 ns 10417 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 430924 ns 433097 ns 0.99
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1000 ns 1084 ns 0.92
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1250 ns 1291 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1208 ns 1292 ns 0.93
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1084 ns 1209 ns 0.90
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18637 ns 18531 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4125 ns 4167 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4083 ns 3917 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4250 ns 4250 ns 1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4083 ns 4083 ns 1
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 111606 ns 111975 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57500 ns 57583 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46334 ns 46292 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46417 ns 38042 ns 1.22
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81791 ns 83125 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38028 ns 37370 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2043104 ns 2031625 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2081625 ns 2085958 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2093375 ns 2088333.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1997603.5 ns 2005041 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198186 ns 198108 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 143708 ns 143750 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 143541 ns 146063 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145083 ns 145209 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 169146 ns 144583.5 ns 1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166323.5 ns 166112.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1109042 ns 1118042 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1107083.5 ns 1114250 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1120500 ns 1153000 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1117812.5 ns 1068770.5 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 533808 ns 533468 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3667 ns 3584 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4125 ns 3750 ns 1.10
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4667 ns 4417 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3584 ns 3958 ns 0.91
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 68637 ns 72081 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9500 ns 9000 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9625 ns 8542 ns 1.13
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10458 ns 9041 ns 1.16
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9084 ns 8916 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 489949 ns 503190.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 14500 ns 15000 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17875 ns 15250 ns 1.17
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16833.5 ns 16708 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15417 ns 15542 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 54785 ns 55903 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219791 ns 214187.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213125 ns 213604.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214959 ns 215395.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213291 ns 212917 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 271573 ns 278881 ns 0.97
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 834 ns 500 ns 1.67
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 542 ns 1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 750 ns 0.89
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 583 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17528 ns 17733 ns 0.99
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1667 ns 1625 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1500 ns 1500 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1625 ns 1.15
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1625 ns 1583 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 102518 ns 105125.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7250 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5916 ns 5833 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5917 ns 5250 ns 1.13
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9958 ns 10084 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23868 ns 24106 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225916.5 ns 220750 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230020.5 ns 228084 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230895.5 ns 230459 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214062.5 ns 213708.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 169436 ns 169707.5 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3917 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23681 ns 23637 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 17000 ns 16708 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16667 ns 16834 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16875 ns 16875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16667 ns 16625 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 161775 ns 161602 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 571041 ns 578416.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 572042 ns 569958 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 581375 ns 579292 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 572750 ns 578291 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113713 ns 113009 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1412416 ns 1417979.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1415000 ns 1419167 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1419895.5 ns 1424875 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1420709 ns 1426416 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 210302 ns 210883 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1082979.5 ns 1067000 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 951125 ns 958417 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1349083 ns 1336917 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1300542 ns 1304396 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 273530.5 ns 271759 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5927917 ns 5795104.5 ns 1.02
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4593708 ns 4601125 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4969959 ns 4929084 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5668104.5 ns 5750083 ns 0.99
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1072557 ns 1068932 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 500 ns 1.17
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23820 ns 23274 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2250 ns 2166 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2208 ns 2167 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2250 ns 2208 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 169440 ns 171283 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4604.5 ns 4333 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4833 ns 4125 ns 1.17
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5292 ns 5083 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4395.5 ns 4292 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65255 ns 66130 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11416.5 ns 11625 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11625 ns 11458 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12167 ns 12458 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11875 ns 11709 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 450103 ns 452684.5 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6958 ns 6375 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6958 ns 6959 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7917 ns 8229.5 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7000 ns 6916 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 52300 ns 52019 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17500 ns 16875 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17542 ns 17000 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 19229.5 ns 18166 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17917 ns 17542 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 302036 ns 301500.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 667 ns 542 ns 1.23
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 667 ns 542 ns 1.23
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 750 ns 666 ns 1.13
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 750 ns 667 ns 1.12
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33102 ns 32512 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8791 ns 8500 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8937.5 ns 8750 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9500 ns 9500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9645.5 ns 8959 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 158238 ns 157915 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64709 ns 64542 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64459 ns 64625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64500 ns 64750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64458 ns 64875 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111486 ns 111658.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 281042 ns 279708 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 279542 ns 283750 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 274292 ns 293250 ns 0.94
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 283875 ns 284521 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 183452 ns 185586.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3368667 ns 3282500 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3082959 ns 3076875 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3041917 ns 2795834 ns 1.09
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3930875 ns 4063541.5 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 568345 ns 567714 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7658875 ns 7638583 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7434250 ns 7366000 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7447833 ns 7289042 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8211854.5 ns 8172916 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1321696 ns 1335450 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17560104.5 ns 17555833 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17559000 ns 17413291.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17525708 ns 17640417 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14127062.5 ns 14085667 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23564834 ns 23644667 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33525417 ns 33391375 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37150250 ns 40912708 ns 0.91
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35094333.5 ns 35048479 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1859060 ns 1855237.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189921250 ns 189754584 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 232647750 ns 232353000 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 193793291 ns 201284750 ns 0.96
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 434990500 ns 435226125 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13941036 ns 13860033 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 292673333 ns 290571042 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 335185917 ns 334832916 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 296085479.5 ns 303703583 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 396336125 ns 393811604 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22146 ns 21541 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23666 ns 22375 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25812.5 ns 23354 ns 1.11
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24417 ns 24500 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 97629 ns 95582 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 104416 ns 103250 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104020.5 ns 115312.5 ns 0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 105250 ns 104625 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103250 ns 102667 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 517099 ns 503695.5 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6250 ns 5750 ns 1.09
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6167 ns 5791 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8208 ns 7666 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6437.5 ns 6250 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 70044 ns 68642 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14875 ns 14875 ns 1
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15250 ns 14625 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16166 ns 16250 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14958 ns 14833 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 490718.5 ns 478112.5 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3006729 ns 3019792 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2070041 ns 2069896 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2284396.5 ns 2279000 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4891187 ns 4750917 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 584210 ns 583001 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23526125 ns 23604770.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18048250 ns 18003875 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18140125 ns 18293125 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35506875 ns 35919729.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3110700 ns 3106744 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33335958 ns 33297687 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27594458 ns 27474958 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28556354 ns 29070229.5 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41469000 ns 41830959 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72333 ns 73396 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75375 ns 75125 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75083 ns 74875 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72770.5 ns 72959 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 103461 ns 103514 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 206770.5 ns 274208 ns 0.75
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 297562.5 ns 205959 ns 1.44
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221041 ns 255333 ns 0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205479.5 ns 296916 ns 0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 551540 ns 554316 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12167 ns 11167 ns 1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12458 ns 11875 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13500 ns 13458 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12104.5 ns 12458 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 72247.5 ns 72256.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26458 ns 26583.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27645.5 ns 26833 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27917 ns 28084 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26979.5 ns 26708 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 475419.5 ns 483481.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12500 ns 11520.5 ns 1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12917 ns 13041 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13604.5 ns 13750 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12416.5 ns 12875 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 53143 ns 52959.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25584 ns 25500 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26292 ns 25542 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26041 ns 26375 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26395.5 ns 26542 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 305462.5 ns 310926 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179500 ns 179125 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180687.5 ns 182625 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183812.5 ns 183958 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 182020.5 ns 182416 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 57465 ns 58111 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 584750 ns 582958 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 585541 ns 583209 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 591584 ns 610042 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583084 ns 582000 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 288858 ns 286370 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6167 ns 5729.5 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6833 ns 6334 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7042 ns 7500 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6041 ns 6083 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 71108.5 ns 71136.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14000 ns 14167 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15375 ns 14500 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15209 ns 15667 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14125 ns 14667 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 466371.5 ns 468005 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1215270.5 ns 1186749.5 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1243541 ns 1247334 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1265104 ns 1282666.5 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 998125 ns 841729 ns 1.19
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301486 ns 301667 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4108708 ns 4101771 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4434771 ns 4417458 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4556542 ns 4790916 ns 0.95
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3704916 ns 3731833.5 ns 0.99
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1037137 ns 1043818 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1792 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23824.5 ns 23460 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4917 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4958 ns 4834 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4917 ns 4917 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4917 ns 4958 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 192406.5 ns 189873 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6208 ns 5792 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6458 ns 6125 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7208 ns 7187.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6208 ns 6208 ns 1
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 56624 ns 55970.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10834 ns 10625 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11916 ns 11083 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11250 ns 11584 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11875 ns 11500 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 338922 ns 332298.5 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 23247 ns 22660 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2958 ns 2708 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2833 ns 2750 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3041 ns 3000 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2792 ns 2709 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 161850.5 ns 159360 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11542 ns 11292 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11875 ns 11792 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12750 ns 13250 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11334 ns 12229.5 ns 0.93
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 58925 ns 57130.5 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24583 ns 24708 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24916 ns 24167 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25208 ns 25854 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24917 ns 24916.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 299387 ns 300198 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4250 ns 4125 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 25565 ns 24574 ns 1.04
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16333 ns 16166 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16167 ns 16000 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16250 ns 16042 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16125 ns 16375 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 199746.5 ns 201392 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5875 ns 5750 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5750 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5958 ns 5875 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5875 ns 5916 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34596 ns 33153 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20875 ns 20333 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20792 ns 20792 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21291 ns 20917 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20979 ns 21375 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 178480.5 ns 175780 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 423625 ns 417417 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 384354 ns 378854.5 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 484333 ns 487270.5 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 102791 ns 103917 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA 67358 ns 66399.5 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 937687.5 ns 877583 ns 1.07
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 970875 ns 949562.5 ns 1.02
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1172499.5 ns 1206625 ns 0.97
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 443166.5 ns 469167 ns 0.94
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 191211.5 ns 191112 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80583 ns 85417 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81875 ns 81083 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82041 ns 84625 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80792 ns 85417 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193663 ns 193239.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1929125 ns 1913750 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1696625 ns 1913542 ns 0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1921709 ns 1943083.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1867812.5 ns 1906896 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 402663 ns 406558 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 291 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22091 ns 22047.5 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1916 ns 1875 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 174094.5 ns 171306.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6520.5 ns 6209 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7792 ns 6625 ns 1.18
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7875 ns 8542 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6333 ns 7125 ns 0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 63458.5 ns 60422 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9541.5 ns 9000 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9417 ns 8958 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9417 ns 9584 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9375 ns 9416 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 318478.5 ns 313100.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 119562958 ns 119013624.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174205375 ns 174073709 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147918417 ns 154836458 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 103338083 ns 106465208 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5458392 ns 5473107.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 615458021 ns 615549000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 554375542 ns 555627500 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 448791791.5 ns 469486625 ns 0.96
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 752574437.5 ns 758488604 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34940940 ns 34956527 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 653342791 ns 650955333 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 664179520.5 ns 665997520.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 583748437.5 ns 596311875 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 741232458 ns 746344250 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59000 ns 59041 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46125 ns 47750 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47042 ns 39041 ns 1.20
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83333 ns 84708.5 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38314 ns 36941 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1904187.5 ns 1922166 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1972750 ns 1978041 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1980584 ns 1990167 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1881166.5 ns 1920167 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 177156 ns 173728 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 268458 ns 282041.5 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 278208.5 ns 266458 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 287959 ns 273853.5 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267958.5 ns 270333 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 137529 ns 135453.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 695604 ns 674666 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 684270.5 ns 684354 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 666959 ns 676145.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 585000 ns 596375 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 743997.5 ns 752272.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2252979 ns 2253417 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2225916.5 ns 2217895.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2183667 ns 2190479 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2169792 ns 2202416.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133681 ns 133169 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5530291.5 ns 5479500 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5499000 ns 5506916 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5501458 ns 5588312.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5511125 ns 5564021 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 785850 ns 794371.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 640709 ns 646958 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 634375 ns 656500 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 647958 ns 640416 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 644875 ns 657291 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46990 ns 47817 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1822667 ns 1822375 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1724334 ns 1719708 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1723187.5 ns 1665541 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2100083 ns 2108083 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 221327 ns 227850 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58209 ns 58458 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46042 ns 45083 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45083 ns 38041 ns 1.19
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83208 ns 84958 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28909 ns 28842 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2044937.5 ns 2030375 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2074937.5 ns 2084312.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2098334 ns 1787459 ns 1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1983208 ns 2014583.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 189957.5 ns 192397.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13300125 ns 13382625 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12441249.5 ns 12433458.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12494750 ns 12571375 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 14866208 ns 15143562.5 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 513748.5 ns 514602 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47395083 ns 47546916 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41816041 ns 41875708 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41060375 ns 41161020.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58402979 ns 58396167 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3253507 ns 3251545 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 97062041.5 ns 75047125 ns 1.29
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 90812917 ns 67897459 ns 1.34
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90806458 ns 90940166.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 98905645.5 ns 99460667 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58875 ns 58750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47375 ns 46875 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46833 ns 38333 ns 1.22
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83292 ns 80334 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48465 ns 46475 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1918416 ns 1921416 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1970334 ns 1976416 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1953375.5 ns 1721708.5 ns 1.13
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1886542 ns 1905000 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196063.5 ns 190253.5 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 416 ns 333 ns 1.25
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 416 ns 417 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 33464 ns 31709.5 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6459 ns 6125 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6208 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6584 ns 6583 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6792 ns 6854.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 181531.5 ns 176344 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 291 ns 1.14
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31935 ns 31144 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2792 ns 2625 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2625 ns 1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2916 ns 2833 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2875 ns 2750 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 168237.5 ns 164923.5 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 286871562.5 ns 285479083.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 340292959 ns 340672292 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 313683291.5 ns 320528833.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 271677583 ns 267627833 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7111375 ns 7061953.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 998128375 ns 1000752000 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 940217833 ns 941508917 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 835201958 ns 849741542 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1154673125 ns 1162624583 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34036515 ns 33972568.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1669749000 ns 1314224145.5 ns 1.27
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1693232458 ns 1312834041.5 ns 1.29
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1603932000 ns 1621294583 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1667049542 ns 1681368042 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1412500 ns 1461562.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1464666 ns 1416958 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1418792 ns 1414750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1414333 ns 1412375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128139 ns 127713.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5033791 ns 5020125 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5012167 ns 5027042 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5022688 ns 4740833 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5031750 ns 5044042 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 594340 ns 510137 ns 1.17
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 166795333 ns 171071812.5 ns 0.98
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 127722249.5 ns 126739625 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 125070854.5 ns 146147041 ns 0.86
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 153378167 ns 168329334 ns 0.91
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4886780 ns 4881506 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 630808333 ns 622612209 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 538148708 ns 538980667 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 495624667 ns 504257334 ns 0.98
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 657078416 ns 656863250 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 15762771 ns 16684647 ns 0.94
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8958041.5 ns 8964583 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8947833.5 ns 8900333 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7932750 ns 7993333 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9741916 ns 9790312.5 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1610530 ns 1594468.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36113292 ns 36115750.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37084584 ns 36971083.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33617771 ns 34444208 ns 0.98
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38201125 ns 37794834 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6453792 ns 6465190.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47375 ns 47292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47583 ns 47542 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47500 ns 47584 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47500 ns 47500 ns 1
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18959 ns 18793 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50417 ns 50291.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50750 ns 50417 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50583 ns 50833 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50791 ns 50750 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 219022 ns 231220 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7084 ns 6291 ns 1.13
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7292 ns 7084 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8416 ns 7792 ns 1.08
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6791 ns 7542 ns 0.90
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 103670 ns 106604.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9895.5 ns 10209 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10209 ns 9833 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10125 ns 10270.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10125 ns 10459 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 624636.5 ns 619990 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6084 ns 5792 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6250 ns 6416 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7917 ns 7958 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6125 ns 6042 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 151041 ns 121725 ns 1.24
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12958.5 ns 13375 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13500 ns 13000 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13458 ns 13584 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13520.5 ns 13375 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 607821.5 ns 528027 ns 1.15
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1125 ns 959 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1125 ns 1083 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1125 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 33684 ns 31705 ns 1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8084 ns 7792 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8125 ns 7667 ns 1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8209 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8167 ns 8666 ns 0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 251588.5 ns 204125.5 ns 1.23
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23250 ns 23000 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23500 ns 23084 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23583 ns 23584 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23417 ns 23500 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18914 ns 18461 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52625 ns 52458 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52625 ns 52291 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52708.5 ns 52791 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52625 ns 52458 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 362580.5 ns 286087.5 ns 1.27
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1396458 ns 1397209 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1398583 ns 1395917 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1399229 ns 1400209 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1447520.5 ns 1398500 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 197200.5 ns 195540.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5036041 ns 5008458.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5013875 ns 5018750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5014270.5 ns 4722750 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5017417 ns 4703042 ns 1.07
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 701302 ns 626852.5 ns 1.12
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3044084 ns 3063416 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2082520.5 ns 2063875 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2291250 ns 2311417 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4557292 ns 4823500 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 583956 ns 580360 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24426437.5 ns 24332959 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18927375 ns 18875458 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18983062.5 ns 18989334 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36883604.5 ns 36748479.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3211711 ns 3188758 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34120104 ns 34048562.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28355000 ns 28257854 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28027812.5 ns 28468541.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41664437.5 ns 41851021 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 143414750 ns 144123292 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 146649291 ns 147912291 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126495354.5 ns 128219729 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173120645.5 ns 175666645.5 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22568973 ns 22797470 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1965865458 ns 1274551333 ns 1.54
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 831780792 ns 1209986250 ns 0.69
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 682570062.5 ns 717258459 ns 0.95
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 669462084 ns 669341542 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118557806 ns 118134658 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 73500 ns 75042 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 84958 ns 73833 ns 1.15
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76666 ns 75813 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 85083 ns 74125 ns 1.15
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 287272.5 ns 248024.5 ns 1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 294875 ns 202750 ns 1.45
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 286958 ns 283250 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 260250 ns 194000 ns 1.34
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 229875 ns 189583 ns 1.21
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1480876 ns 1272660.5 ns 1.16
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35495375 ns 35542000 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36303146 ns 36428479 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32522062.5 ns 32734792 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40428166.5 ns 40941958 ns 0.99
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5852098 ns 5852888 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 147916959 ns 147574354 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 153042041.5 ns 154842271 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 139521896 ns 142249771 ns 0.98
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 283555833 ns 285430916 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34913814 ns 34907859 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120479333 ns 119543458.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173824750 ns 173916625 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147602583 ns 155928584 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 102571708 ns 103545938 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5467443 ns 5470774 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 469778500 ns 471171395.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466895958 ns 467366000 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 439357750 ns 456719729 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 740288000 ns 738831458 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32287610 ns 32277660 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 640237334 ns 709159062 ns 0.90
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 654668125 ns 654555208.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 572991708.5 ns 585803354.5 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 728632208 ns 726547959 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1317291 ns 1242646 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 985354 ns 968625.5 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 954875 ns 674709 ns 1.42
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2090334 ns 1941770.5 ns 1.08
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 574648.5 ns 569058 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2968667 ns 2969916 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2607125 ns 2603708 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2618542 ns 1985166.5 ns 1.32
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3693166 ns 3729625 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1918588 ns 1762089 ns 1.09
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5834437.5 ns 5801458 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5783458 ns 5780958 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5803563 ns 5645834 ns 1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2886625 ns 2921042 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7459 ns 7250 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 5958 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6166 ns 5333 ns 1.16
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10083 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26589 ns 25119 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213854 ns 215750 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221042 ns 258458 ns 0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221125 ns 221291.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219959 ns 207146 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 292261.5 ns 264756 ns 1.10
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 309512521 ns 308377104 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 229306208 ns 231656291 ns 0.99
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 199201146 ns 224042396 ns 0.89
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 307319792 ns 307881333 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7680668.5 ns 7678620 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1085399937 ns 1097604312.5 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 903621916.5 ns 920148521 ns 0.98
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 817729000 ns 858485833.5 ns 0.95
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1148602500 ns 1150798750 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26633251 ns 26497955 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5500 ns 4958.5 ns 1.11
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5708 ns 5583 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6500 ns 6916.5 ns 0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5708 ns 5541 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 182488.5 ns 171524 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7542 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 6750 ns 1.13
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7458 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7667 ns 7875 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 688950.5 ns 670577.5 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 584 ns 541 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 541 ns 1.16
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 666 ns 625 ns 1.07
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 708 ns 625 ns 1.13
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24411 ns 23778 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9083 ns 8708 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9375 ns 8541.5 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9167 ns 9458 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9291.5 ns 9541.5 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 235750.5 ns 233071 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351895.5 ns 353250 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351750 ns 353208 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 353062.5 ns 352667 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 354479 ns 352125 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21359 ns 21348 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 829208 ns 822333 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 800208 ns 774854 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 824146 ns 777042 ns 1.06
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 823125 ns 825999.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 300583.5 ns 286748 ns 1.05
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 332458 ns 336833 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 334875 ns 335917 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 451125 ns 445708 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 10584 ns 10917 ns 0.97
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17729 ns 17559 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 717292 ns 713499.5 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 734604 ns 730834 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1006333 ns 1027167 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26917 ns 26500 ns 1.02
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 277239.5 ns 260521.5 ns 1.06
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 378792 ns 371375 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 344042 ns 346250 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 442042 ns 445812.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 31416 ns 30479 ns 1.03
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22416 ns 22136 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 731209 ns 734062.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 779563 ns 773750.5 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1031875 ns 1061729 ns 0.97
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 102333 ns 98521 ns 1.04
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 221693.5 ns 220018.5 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3500 ns 3375 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3542 ns 3542 ns 1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3708 ns 3687.5 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3458 ns 3583 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17824 ns 17780 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4291 ns 4125 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4334 ns 4167 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4334 ns 4375 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4541 ns 4500 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 291531 ns 258504 ns 1.13
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3791 ns 3750 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4208.5 ns 3500 ns 1.20
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4833 ns 4917 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4166 ns 4083 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 223915 ns 200777 ns 1.12
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8708 ns 8417 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8625 ns 8000 ns 1.08
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8750 ns 8625 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8792 ns 8604.5 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1270382 ns 1183716 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203167 ns 205708 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 208375 ns 210125 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210209 ns 210375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 201333 ns 200375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35222.5 ns 34375 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 601812.5 ns 650916 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 621604 ns 666959 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 670333 ns 624167 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 645479 ns 632458 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 359781 ns 343648 ns 1.05
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 1011834 ns 1000479 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1004625 ns 1007958 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 957062 ns 974396 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 871125 ns 894770.5 ns 0.97
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207547 ns 207021.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4541875 ns 4512146 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4686937.5 ns 4708729.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4492709 ns 4609875 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 5165833 ns 5171208.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 940561 ns 947853.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3792 ns 3333 ns 1.14
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3666 ns 3083 ns 1.19
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 3917 ns 4333 ns 0.90
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3500 ns 3917 ns 0.89
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 240134.5 ns 218377.5 ns 1.10
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7208 ns 7375 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 6833 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7458 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7625 ns 7459 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1041039.5 ns 1012916 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1634770.5 ns 1641584 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1174688 ns 1193979 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1359875 ns 1342687.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2471583 ns 2486625.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214655.5 ns 214048 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12346021 ns 12366291.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9572458.5 ns 9556958 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9258041.5 ns 9332500 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18025666 ns 18065166.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1953259 ns 1946882 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17404166 ns 17346750 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14403583 ns 14347000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14381917 ns 14486917 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21181270.5 ns 21148167 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 88687.5 ns 134750 ns 0.66
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 90167 ns 88584 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 93084 ns 92042 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 89125 ns 89042 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126156 ns 126624 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2038104.5 ns 2031958 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2011625 ns 2023083.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2026917 ns 1756000 ns 1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2023229 ns 2029583 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1056276 ns 1029084 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 1791.5 ns 1750 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 2791 ns 2833 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3500 ns 2458 ns 1.42
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 2000 ns 2166.5 ns 0.92
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15460 ns 16055 ns 0.96
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2792 ns 2583 ns 1.08
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2792 ns 2500 ns 1.12
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2792 ns 2750 ns 1.02
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2750 ns 2750 ns 1
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 195419.5 ns 191618 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7416 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5458 ns 5917 ns 0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5959 ns 5125 ns 1.16
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10166 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34429 ns 33917 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 246417 ns 226396.5 ns 1.09
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221667 ns 222521 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 233584 ns 221584 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218792 ns 207458 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 349888.5 ns 311723.5 ns 1.12
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3667 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22388 ns 22860 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14500 ns 14458 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14458 ns 14291 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14334 ns 14250 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14417 ns 14667 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 496018 ns 472859.5 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 93000 ns 137417 ns 0.68
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 94333.5 ns 96458.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 96916.5 ns 95833 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 138021 ns 93125 ns 1.48
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125381 ns 125940 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1942875 ns 1921458.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1882875 ns 1918166.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1927750 ns 1817687.5 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1922500 ns 1914458 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1003127 ns 951464 ns 1.05
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 878833 ns 869042 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 816854.5 ns 815167 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1222604 ns 1175833 ns 1.04
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 970625 ns 967562.5 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 277917 ns 276671 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2747604 ns 2830583 ns 0.97
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2503916 ns 2508062.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3367063 ns 3332875 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3412875 ns 3328000 ns 1.03
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1628640 ns 1576106.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15396 ns 16000 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18000 ns 15625 ns 1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20000 ns 16458 ns 1.22
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17875 ns 16417 ns 1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 143796.5 ns 143900.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 227875 ns 255875.5 ns 0.89
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215417 ns 254271 ns 0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 258584 ns 216250 ns 1.20
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 256229.5 ns 258021 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 646947.5 ns 637843.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 220416 ns 220792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 220542 ns 220667 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 224125 ns 221208 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 222292 ns 222208.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 273029 ns 270997 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 510395.5 ns 504458 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 506708 ns 507416.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 562146 ns 499833.5 ns 1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 510020.5 ns 498875.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1446681 ns 1304306.5 ns 1.11
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 4292 ns 3459 ns 1.24
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 3750 ns 3854.5 ns 0.97
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 4791 ns 5375 ns 0.89
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 3645.5 ns 4042 ns 0.90
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16893 ns 16660 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7292 ns 7166 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7292 ns 6458 ns 1.13
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7250 ns 7209 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7500 ns 7541.5 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 199189.5 ns 194930.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17645.5 ns 17666 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19437.5 ns 17125 ns 1.14
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21333 ns 19729 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19229.5 ns 18000 ns 1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146525 ns 146357.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 250125 ns 244562 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 211541.5 ns 237417 ns 0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 230750 ns 214500 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 253917 ns 225208 ns 1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 967908.5 ns 894981 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4437.5 ns 4416 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4750 ns 3917 ns 1.21
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5083.5 ns 5334 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4562.5 ns 4833 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 248063 ns 187684 ns 1.32
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10333 ns 10500 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10666 ns 9708 ns 1.10
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10834 ns 11167 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10541.5 ns 11250 ns 0.94
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1105049.5 ns 1024651 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3625 ns 3209 ns 1.13
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3875 ns 3250 ns 1.19
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4583 ns 4687.5 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3333 ns 3791 ns 0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 251061.5 ns 218725.5 ns 1.15
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7792 ns 7833 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7667 ns 7291 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7709 ns 7625 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7916.5 ns 7917 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1114681 ns 1043721.5 ns 1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23672334 ns 23437104.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35178458 ns 35045979.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37815563 ns 41490500 ns 0.91
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34900167 ns 34913479 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1833918 ns 2126334.5 ns 0.86
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 183930417 ns 184798459 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159549208 ns 159330000 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146565333.5 ns 151477459 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 410971708 ns 411547250 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16513855 ns 16524151 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 427252041 ns 427197208 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 253199000 ns 252723645.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 297022499.5 ns 305721250 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 480585042 ns 481095166 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 184458 ns 182854.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 183874.5 ns 182791.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 186250 ns 185292 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 185042 ns 185750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 230772 ns 173677.5 ns 1.33
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 617333 ns 629833 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 590750 ns 631375 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 637562.5 ns 590542 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 637792 ns 630770.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1100938 ns 1010062 ns 1.09
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3873083 ns 3848041.5 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3922104.5 ns 4009000 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3586041.5 ns 3525583 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4556916.5 ns 4614917 ns 0.99
batchedmm(128, Bsize=512)/forward/GPU/CUDA 533150 ns 536882 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17436562.5 ns 17371917 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17829084 ns 17740624.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16603000 ns 16856312.5 ns 0.98
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 20181833 ns 20403334 ns 0.99
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2634997 ns 2613028 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 708 ns 625 ns 1.13
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 708 ns 667 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32617 ns 31917 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8958 ns 9334 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9583 ns 8708 ns 1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10208 ns 9875 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9792 ns 9417 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 265095 ns 260614 ns 1.02
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 497943000 ns 503086958 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 392660917 ns 424620083.5 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 423842791 ns 462339520.5 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 676715125 ns 673052062 ns 1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12483998.5 ns 12478664.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1879019416.5 ns 1872018104.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1625360250 ns 1625413500 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1495225542 ns 1546440125 ns 0.97
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2203151229.5 ns 2200566458.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49422036 ns 49139909 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1647417 ns 1647791.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1190770.5 ns 1202542 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1385833 ns 1365999.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2341166 ns 2393042 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215752.5 ns 215162 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12693416 ns 12703083.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9888874.5 ns 9880000 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9681500 ns 9761146 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18308500 ns 18559417 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2050707 ns 2005712 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17696271 ns 17693854 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14736375 ns 14669187.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14587709 ns 14767500 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21440292 ns 21469542 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26292 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26334 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26167 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24734 ns 23799 ns 1.04
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67375 ns 66666 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 68125 ns 66750 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67125 ns 67209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66833 ns 67500 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 411788 ns 380551.5 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204208 ns 203917 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209666 ns 209750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209917 ns 210000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199375 ns 199958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27528 ns 25800 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 600791.5 ns 648229.5 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 630250 ns 661271 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 670458 ns 622750 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 631312 ns 586375 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 357112 ns 308724.5 ns 1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 683417 ns 600291 ns 1.14
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 639083 ns 594125 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 643125 ns 544666 ns 1.18
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 655937.5 ns 652208 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132998 ns 131751 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2277042 ns 2235000 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2212708 ns 2235625 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2231208 ns 2300854 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2235625 ns 2253125 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1250744 ns 1127758 ns 1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17541 ns 17541 ns 1
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19333 ns 16958 ns 1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21416.5 ns 19917 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19604.5 ns 17958 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 144909 ns 145385 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229625 ns 261583 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227666 ns 260812.5 ns 0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 260500 ns 220937.5 ns 1.18
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 231084 ns 230896 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1065824.5 ns 982925 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23542 ns 23015 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9625 ns 9479.5 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10000 ns 9042 ns 1.11
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10208.5 ns 10292 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9875 ns 9625 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 258222.5 ns 257388 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6083.5 ns 5458 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5917 ns 5417 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6500 ns 6625 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5708 ns 6083 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 236089 ns 233603.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7083 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7542 ns 7041 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7666 ns 7833 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7583.5 ns 7375 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 800761.5 ns 800650 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2084 ns 2000 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2375 ns 2125 ns 1.12
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2458 ns 2458 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2209 ns 2459 ns 0.90
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18200 ns 17988 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6958 ns 6500 ns 1.07
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6750 ns 6291 ns 1.07
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6834 ns 6708 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6500 ns 6542 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 329767 ns 330671 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 749000 ns 749709 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746792 ns 747104 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 752000 ns 749208 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 746875 ns 751791.5 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21064 ns 21045 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 791375 ns 791000 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 774979.5 ns 791062.5 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 809646 ns 775875 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 791333.5 ns 775250 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 297256 ns 294695 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7417 ns 7208 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 5958 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5959 ns 5291 ns 1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10208 ns 10208 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33581 ns 32534 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 233792 ns 233291 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228229 ns 267375 ns 0.85
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 268812.5 ns 227812.5 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221000 ns 213583 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 365038 ns 361573 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10208 ns 10020.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10604.5 ns 10042 ns 1.06
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10916 ns 11625 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10542 ns 10208 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 253826 ns 248981.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25375 ns 26791 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24791 ns 24292 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25167 ns 24750 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24792 ns 25000 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1115526 ns 1132389 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106176167 ns 107227250 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 118171749.5 ns 117058791.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120350854.5 ns 124034229 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117488979 ns 117545541.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2653192 ns 2659866 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 393750042 ns 393155000 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 366379167 ns 366597250 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 355573791 ns 357674666 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 484763208 ns 490403667 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15215468 ns 15157994 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 937318521 ns 758865499.5 ns 1.24
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 757115375 ns 580033084 ns 1.31
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 744755937 ns 748265062.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 943378104 ns 948608916.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7375 ns 6916.5 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7500 ns 7000 ns 1.07
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7416 ns 8042 ns 0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7375 ns 7625 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 241924 ns 242461.5 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14250 ns 14084 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14666 ns 13500 ns 1.09
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 13750 ns 14208 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14333 ns 14333 ns 1
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1074204 ns 1085062 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5791 ns 5541 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6375 ns 6563 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6875 ns 7666 ns 0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6250 ns 6291 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 235095 ns 235371.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12292 ns 12542 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13084 ns 12104.5 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12584 ns 13042 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12583 ns 12750 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 783718 ns 793450.5 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5750 ns 5125 ns 1.12
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5625 ns 5750 ns 0.98
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 6459 ns 6333 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5541 ns 5625 ns 0.99
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16882 ns 16571 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15458 ns 15792 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15542 ns 15417 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15625 ns 15625 ns 1
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15709 ns 15750 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 199445.5 ns 200110.5 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 417 ns 292 ns 1.43
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 417 ns 417 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23913 ns 23594.5 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6459 ns 5959 ns 1.08
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6416 ns 6083 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6666 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6854.5 ns 6834 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 240304 ns 242427.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6000 ns 5833 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5959 ns 5834 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5959 ns 6000 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6000 ns 6041 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24668 ns 24342.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21187.5 ns 20875 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21042 ns 21042 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21209 ns 21666 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21459 ns 21875 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 261526.5 ns 262727.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 147583.5 ns 185833 ns 0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145666 ns 144916.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 148333 ns 146875 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144500 ns 144416.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167437.5 ns 167734 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1329042 ns 1323750 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1319625 ns 1312209 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1323417 ns 1332875 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1327500 ns 1333770.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1350011 ns 1339118 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22166 ns 24041.5 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24833 ns 22312.5 ns 1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25208 ns 24833 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24542 ns 24667 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 353253 ns 351890.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 117709 ns 170708 ns 0.69
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 130208 ns 177875 ns 0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 178999.5 ns 118625 ns 1.51
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 177750 ns 120020.5 ns 1.48
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1465805 ns 1461877 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23144 ns 22590 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6458 ns 6250 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6417 ns 6250 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6750 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6625 ns 6583 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 255534 ns 255552.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5166 ns 4291 ns 1.20
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5000 ns 4417 ns 1.13
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5104.5 ns 5708 ns 0.89
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4396 ns 5292 ns 0.83
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 256740.5 ns 256272 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10500 ns 10042 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10250 ns 9833 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns 10417 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10500 ns 10333 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1354747 ns 1354208 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1666 ns 1583 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1666 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23049 ns 22798 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6000 ns 5833 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5750 ns 5709 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6083 ns 6000 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5708 ns 5916 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 272998.5 ns 274328 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6816125 ns 6866624.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6402167 ns 6433708 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6496250 ns 6554499.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7515896 ns 7548875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214842.5 ns 213149 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24119875 ns 24100417 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21297750 ns 21294521 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21015750 ns 21070125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29643833 ns 29826667 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2104284 ns 2116806 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48636417 ns 37336834 ns 1.30
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45606625 ns 34197292 ns 1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45739292 ns 45794042 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49289166.5 ns 49624208 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6583 ns 5750 ns 1.14
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6375 ns 5625 ns 1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6729.5 ns 6791 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6125 ns 6667 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 235503 ns 236202.5 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8209 ns 8084 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8959 ns 7875 ns 1.14
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8583 ns 8667 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8667 ns 9167 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1060587 ns 1060405 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1504292 ns 1553542 ns 0.97
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1256167 ns 1263041.5 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1625333 ns 1622041 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2157209 ns 2175916 ns 0.99
lenet(28, 28, 1, 128)/forward/GPU/CUDA 276714.5 ns 272178 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7846542 ns 7902375 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6594750 ns 6258292 ns 1.05
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7163917 ns 7165958 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10480542 ns 10478104.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1863181 ns 1852121.5 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 364583.5 ns 361584 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 371083 ns 370750 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 459000 ns 456417 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 25166 ns 24999.5 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46384 ns 46439.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 742584 ns 738895.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 806833 ns 809958 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1063250 ns 1082542 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 95708 ns 76708 ns 1.25
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 310356 ns 301861.5 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397625 ns 397459 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288083 ns 288084 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288333 ns 212208 ns 1.36
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 751750 ns 755209 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43878 ns 43701 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 667417 ns 665625 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 532042 ns 530417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 531875 ns 473750 ns 1.12
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 972625 ns 974458 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 188552 ns 189749 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 665500 ns 649583 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 644145.5 ns 641833 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 640042 ns 545458.5 ns 1.17
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 609542 ns 653167 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132007 ns 131877 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2499500 ns 2454834 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2439604.5 ns 2460271 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2452479 ns 2500666 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2453583 ns 2518479 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1281567 ns 1202049 ns 1.07
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 3000 ns 3000 ns 1
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 2917 ns 3500 ns 0.83
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 4250 ns 3500 ns 1.21
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 3166.5 ns 2708 ns 1.17
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16209 ns 15904 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5625 ns 5375 ns 1.05
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5542 ns 5292 ns 1.05
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5583 ns 5666 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5583 ns 5750 ns 0.97
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 197427.5 ns 196388 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1460625 ns 1465625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1499500 ns 1502708 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1500542 ns 1496875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1439750 ns 1444792 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40557 ns 40558 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5144333.5 ns 5125396 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5277208 ns 5286583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5290292 ns 5312375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4992083 ns 4974792 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196743 ns 195790.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3709 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 32882 ns 32748 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15375 ns 15083 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15375 ns 15083 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15375 ns 15167 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15416 ns 15375 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 376061.5 ns 375651.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71333 ns 71125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 70833 ns 71167 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71000 ns 71208 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71125 ns 71083 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112803 ns 112958 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 319792 ns 323791 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 318500 ns 320458 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 318500 ns 326875 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318125 ns 323000 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 192539.5 ns 193747 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1042 ns 1000 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 958 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1125 ns 1042 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1084 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23898 ns 23358 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8166 ns 7875 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8167 ns 7834 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8708 ns 8458 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8208 ns 8833 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 259317.5 ns 259209 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 509937.5 ns 505375 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 487208.5 ns 484292 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 562375 ns 564542 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 212250 ns 215062.5 ns 0.99
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130057 ns 128754 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1385062.5 ns 1371334 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1461166.5 ns 1393812.5 ns 1.05
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1724562 ns 1732333 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 867750 ns 870083.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 273418 ns 276302 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 292 ns 1.43
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31926 ns 31400 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6334 ns 6167 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6292 ns 6000 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7042 ns 6500 ns 1.08
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6500 ns 6958 ns 0.93
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 262567.5 ns 263074.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1725167 ns 1767042 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1722500 ns 1725208 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1727542 ns 1727292 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1723666 ns 1726271 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168619.5 ns 168554 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4367833 ns 4357521 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4376291 ns 4359541 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4353604 ns 4379875 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4366625 ns 4377583 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1235288 ns 1157059 ns 1.07
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6833 ns 6666 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6667 ns 6666 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 9875 ns 6916 ns 1.43
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6917 ns 7041.5 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 19683 ns 20567 ns 0.96
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51833 ns 32834 ns 1.58
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 52375 ns 51229.5 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 72645.5 ns 33541.5 ns 2.17
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 34708 ns 51062.5 ns 0.68
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 209572 ns 209739.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17708 ns 17250 ns 1.03
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 17875 ns 17812.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 18667 ns 18292 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17812.5 ns 17708 ns 1.01
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18259 ns 17907 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53333 ns 53208 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53334 ns 52959 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53417 ns 53541 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53458 ns 53291 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 343000.5 ns 344400 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75333 ns 75333 ns 1
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75250 ns 74959 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75167 ns 75292 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75084 ns 75000 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46885 ns 47022 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 327500 ns 325292 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 328708 ns 324417 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 325125 ns 343042 ns 0.95
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 325875 ns 327084 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 208870.5 ns 210359 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1487167 ns 1488333 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1528250 ns 1527917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1527208 ns 1521042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1464208 ns 1466167 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52106 ns 51138 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5135229 ns 5120375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5243250 ns 5285750 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5280000 ns 5309459 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4984000 ns 4973917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 203243.5 ns 202631 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28208 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28292 ns 28125 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28167 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28167 ns 28209 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24493 ns 24478 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66791 ns 66208 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67583 ns 66167 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66667 ns 66250 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66458 ns 66959 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 533204 ns 533201 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1505333 ns 1463833 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1127417 ns 1144583 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1115333.5 ns 832188 ns 1.34
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2167937.5 ns 2217792 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 580240.5 ns 576305 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3100541.5 ns 3077958.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2730979 ns 2733167 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2737604 ns 2620334 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3813083 ns 3782000 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2047885 ns 2001343 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7941458 ns 7887749.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7941125 ns 7887771 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7897312 ns 7989000 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4823416.5 ns 4832458 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 79250 ns 134958 ns 0.59
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81916 ns 78917 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 83708 ns 82625 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82375 ns 81250 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194078 ns 193237.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2030188 ns 2017354.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2011333.5 ns 2006750 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2016458 ns 2041167 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2010375 ns 2018875 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 809832 ns 797402 ns 1.02

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal
Copy link
Member Author

world age issue for the LuxLib tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use TestExtras.jl for inference testing
1 participant