Skip to content

v0.1-attn-weights

Compare
Choose a tag to compare
@rwightman rwightman released this 04 Sep 00:25

A collection of weights I've trained comparing various types of SE-like (SE, ECA, GC, etc), self-attention (bottleneck, halo, lambda) blocks, and related non-attn baselines.

ResNet-26-T series

  • [2, 2, 2, 2] repeat Bottlneck block ResNet architecture
  • ReLU activations
  • 3 layer stem with 24, 32, 64 chs, max-pool
  • avg pool in shortcut downsample
  • self-attn blocks replace 3x3 in both blocks for last stage, and second block of penultimate stage
model top1 top1_err top5 top5_err param_count img_size cropt_pct interpolation
botnet26t_256 79.246 20.754 94.53 5.47 12.49 256 0.95 bicubic
halonet26t 79.13 20.87 94.314 5.686 12.48 256 0.95 bicubic
lambda_resnet26t 79.112 20.888 94.59 5.41 10.96 256 0.94 bicubic
lambda_resnet26rpt_256 78.964 21.036 94.428 5.572 10.99 256 0.94 bicubic
resnet26t 77.872 22.128 93.834 6.166 16.01 256 0.94 bicubic

Details:

  • HaloNet - 8 pixel block size, 2 pixel halo (overlap), relative position embedding
  • BotNet - relative position embedding
  • Lambda-ResNet-26-T - 3d lambda conv, kernel = 9
  • Lambda-ResNet-26-RPT - relative position embedding

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnet26t 2967.55 86.252 256 256 857.62 297.984 256 256 16.01
botnet26t_256 2642.08 96.879 256 256 809.41 315.706 256 256 12.49
halonet26t 2601.91 98.375 256 256 783.92 325.976 256 256 12.48
lambda_resnet26t 2354.1 108.732 256 256 697.28 366.521 256 256 10.96
lambda_resnet26rpt_256 1847.34 138.563 256 256 644.84 197.892 128 256 10.99

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnet26t 3691.94 69.327 256 256 1188.17 214.96 256 256 16.01
botnet26t_256 3291.63 77.76 256 256 1126.68 226.653 256 256 12.49
halonet26t 3230.5 79.232 256 256 1077.82 236.934 256 256 12.48
lambda_resnet26rpt_256 2324.15 110.133 256 256 864.42 147.485 128 256 10.99
lambda_resnet26t Not Supported

ResNeXT-26-T series

  • [2, 2, 2, 2] repeat Bottlneck block ResNeXt architectures
  • SiLU activations
  • grouped 3x3 convolutions in bottleneck, 32 channels per group
  • 3 layer stem with 24, 32, 64 chs, max-pool
  • avg pool in shortcut downsample
  • channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv
  • when active, self-attn blocks replace 3x3 conv in both blocks for last stage, and second block of penultimate stage
model top1 top1_err top5 top5_err param_count img_size cropt_pct interpolation
eca_halonext26ts 79.484 20.516 94.600 5.400 10.76 256 0.94 bicubic
eca_botnext26ts_256 79.270 20.730 94.594 5.406 10.59 256 0.95 bicubic
bat_resnext26ts 78.268 21.732 94.1 5.9 10.73 256 0.9 bicubic
seresnext26ts 77.852 22.148 93.784 6.216 10.39 256 0.9 bicubic
gcresnext26ts 77.804 22.196 93.824 6.176 10.48 256 0.9 bicubic
eca_resnext26ts 77.446 22.554 93.57 6.43 10.3 256 0.9 bicubic
resnext26ts 76.764 23.236 93.136 6.864 10.3 256 0.9 bicubic

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnext26ts 3006.57 85.134 256 256 864.4 295.646 256 256 10.3
seresnext26ts 2931.27 87.321 256 256 836.92 305.193 256 256 10.39
eca_resnext26ts 2925.47 87.495 256 256 837.78 305.003 256 256 10.3
gcresnext26ts 2870.01 89.186 256 256 818.35 311.97 256 256 10.48
eca_botnext26ts_256 2652.03 96.513 256 256 790.43 323.257 256 256 10.59
eca_halonext26ts 2593.03 98.705 256 256 766.07 333.541 256 256 10.76
bat_resnext26ts 2469.78 103.64 256 256 697.21 365.964 256 256 10.73

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

NOTE: there are performance issues with certain grouped conv configs with channels last layout, backwards pass in particular is really slow. Also causing issues for RegNet and NFNet networks.

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnext26ts 3952.37 64.755 256 256 608.67 420.049 256 256 10.3
eca_resnext26ts 3815.77 67.074 256 256 594.35 430.146 256 256 10.3
seresnext26ts 3802.75 67.304 256 256 592.82 431.14 256 256 10.39
gcresnext26ts 3626.97 70.57 256 256 581.83 439.119 256 256 10.48
eca_botnext26ts_256 3515.84 72.8 256 256 611.71 417.862 256 256 10.59
eca_halonext26ts 3410.12 75.057 256 256 597.52 427.789 256 256 10.76
bat_resnext26ts 3053.83 83.811 256 256 533.23 478.839 256 256 10.73

ResNet-33-T series.

  • [2, 3, 3, 2] repeat Bottlneck block ResNet architecture
  • SiLU activations
  • 3 layer stem with 24, 32, 64 chs, no max-pool, 1st and 3rd conv stride 2
  • avg pool in shortcut downsample
  • channel attn (active in non self-attn blocks) between 3x3 and last 1x1 conv
  • when active, self-attn blocks replace 3x3 conv last block of stage 2 and 3, and both blocks of final stage
  • FC 1x1 conv between last block and classifier

The 33-layer models have an extra 1x1 FC layer between last conv block and classifier. There is both a non-attenion 33 layer baseline and a 32 layer without the extra FC.

model top1 top1_err top5 top5_err param_count img_size cropt_pct interpolation
sehalonet33ts 80.986 19.014 95.272 4.728 13.69 256 0.94 bicubic
seresnet33ts 80.388 19.612 95.108 4.892 19.78 256 0.94 bicubic
eca_resnet33ts 80.132 19.868 95.054 4.946 19.68 256 0.94 bicubic
gcresnet33ts 79.99 20.01 94.988 5.012 19.88 256 0.94 bicubic
resnet33ts 79.352 20.648 94.596 5.404 19.68 256 0.94 bicubic
resnet32ts 79.028 20.972 94.444 5.556 17.96 256 0.94 bicubic

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnet32ts 2502.96 102.266 256 256 733.27 348.507 256 256 17.96
resnet33ts 2473.92 103.466 256 256 725.34 352.309 256 256 19.68
seresnet33ts 2400.18 106.646 256 256 695.19 367.413 256 256 19.78
eca_resnet33ts 2394.77 106.886 256 256 696.93 366.637 256 256 19.68
gcresnet33ts 2342.81 109.257 256 256 678.22 376.404 256 256 19.88
sehalonet33ts 1857.65 137.794 256 256 577.34 442.545 256 256 13.69

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size train_samples_per_sec train_step_time train_batch_size train_img_size param_count
resnet32ts 3306.22 77.416 256 256 1012.82 252.158 256 256 17.96
resnet33ts 3257.59 78.573 256 256 1002.38 254.778 256 256 19.68
seresnet33ts 3128.08 81.826 256 256 950.27 268.581 256 256 19.78
eca_resnet33ts 3127.11 81.852 256 256 948.84 269.123 256 256 19.68
gcresnet33ts 2984.87 85.753 256 256 916.98 278.169 256 256 19.88
sehalonet33ts 2188.23 116.975 256 256 711.63 179.03 128 256 13.69

ResNet-50(ish) models

In Progress

RegNet"Z" series

  • RegNetZ inspired architecture, inverted bottleneck, SE attention, pre-classifier FC, essentially an EfficientNet w/ grouped conv instead of depthwise
  • b, c, and d are three different sizes I put together to cover differing flop ranges, not based on the paper (https://arxiv.org/abs/2103.06877) or a search process
  • for comparison to RegNetY and paper RegNetZ models, at 224x224 b,c, and d models are 1.45, 1.92, and 4.58 GMACs respectively, b, and c are trained at 256 here so higher than that (see tables)
  • haloregnetz_c uses halo attention for all of last stage, and interleaved every 3 (for 4) of penultimate stage
  • b, c variants use a stem / 1st stage like the paper, d uses a 3-deep tiered stem with 2-1-2 striding

ImageNet-1k validation at train resolution

model top1 top1_err top5 top5_err param_count img_size cropt_pct interpolation
regnetz_d 83.422 16.578 96.636 3.364 27.58 256 0.95 bicubic
regnetz_c 82.164 17.836 96.058 3.942 13.46 256 0.94 bicubic
haloregnetz_b 81.058 18.942 95.2 4.8 11.68 224 0.94 bicubic
regnetz_b 79.868 20.132 94.988 5.012 9.72 224 0.94 bicubic

ImageNet-1k validation at optimal test res

model top1 top1_err top5 top5_err param_count img_size cropt_pct interpolation
regnetz_d 84.04 15.96 96.87 3.13 27.58 320 0.95 bicubic
regnetz_c 82.516 17.484 96.356 3.644 13.46 320 0.94 bicubic
haloregnetz_b 81.058 18.942 95.2 4.8 11.68 224 0.94 bicubic
regnetz_b 80.728 19.272 95.47 4.53 9.72 288 0.94 bicubic

Benchmark - RTX 3090 - AMP - NCHW - NGC 21.09

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size infer_GMACs train_samples_per_sec train_step_time train_batch_size train_img_size param_count
regnetz_b 2703.42 94.68 256 224 1.45 764.85 333.348 256 224 9.72
haloregnetz_b 2086.22 122.695 256 224 1.88 620.1 411.415 256 224 11.68
regnetz_c 1653.19 154.836 256 256 2.51 459.41 277.268 128 256 13.46
regnetz_d 1060.91 241.284 256 256 5.98 296.51 430.143 128 256 27.58

Benchmark - RTX 3090 - AMP - NHWC - NGC 21.09

NOTE: channels last layout is painfully slow for backward pass here due to some sort of cuDNN issue

model infer_samples_per_sec infer_step_time infer_batch_size infer_img_size infer_GMACs train_samples_per_sec train_step_time train_batch_size train_img_size param_count
regnetz_b 4152.59 61.634 256 224 1.45 399.37 639.572 256 224 9.72
haloregnetz_b 2770.78 92.378 256 224 1.88 364.22 701.386 256 224 11.68
regnetz_c 2512.4 101.878 256 256 2.51 376.72 338.372 128 256 13.46
regnetz_d 1456.05 175.8 256 256 5.98 111.32 1148.279 128 256 27.58