Skip to content

Conversation

@austinderek
Copy link

Implement overflow-aware optimization in ctrBlocks8Asm: make a fast branch
in case when there is no overflow. One branch per 8 blocks is faster than
7 increments in general purpose registers and transfers from them to XMM.

Added AES-192 and AES-256 modes to the AES-CTR benchmark.

Added a correctness test in ctr_test.go for the overflow optimization.

This improves performance, especially in AES-128 mode.

goos: windows
goarch: amd64
pkg: crypto/cipher
cpu: AMD Ryzen 7 5800H with Radeon Graphics
│ B/s │ B/s vs base
AESCTR/128/50-16 1.377Gi ± 0% 1.384Gi ± 0% +0.51% (p=0.028 n=20)
AESCTR/128/1K-16 6.164Gi ± 0% 6.892Gi ± 1% +11.81% (p=0.000 n=20)
AESCTR/128/8K-16 7.372Gi ± 0% 8.768Gi ± 1% +18.95% (p=0.000 n=20)
AESCTR/192/50-16 1.289Gi ± 0% 1.279Gi ± 0% -0.75% (p=0.001 n=20)
AESCTR/192/1K-16 5.734Gi ± 0% 6.011Gi ± 0% +4.83% (p=0.000 n=20)
AESCTR/192/8K-16 6.889Gi ± 1% 7.437Gi ± 0% +7.96% (p=0.000 n=20)
AESCTR/256/50-16 1.170Gi ± 0% 1.163Gi ± 0% -0.54% (p=0.005 n=20)
AESCTR/256/1K-16 5.235Gi ± 0% 5.391Gi ± 0% +2.98% (p=0.000 n=20)
AESCTR/256/8K-16 6.361Gi ± 0% 6.676Gi ± 0% +4.94% (p=0.000 n=20)
geomean 3.681Gi 3.882Gi +5.46%

The slight slowdown on 50-byte workloads is unrelated to this change,
because such workloads never use ctrBlocks8Asm.

Updates golang#76061


🔄 This is a mirror of upstream PR golang#76059

Implement overflow-aware optimization in ctrBlocks8Asm: make a fast branch
in case when there is no overflow. One branch per 8 blocks is faster than
7 increments in general purpose registers and transfers from them to XMM.

Added AES-192 and AES-256 modes to the AES-CTR benchmark.

Added a correctness test in ctr_aes_test.go for the overflow optimization.

This improves performance, especially in AES-128 mode.

goos: windows
goarch: amd64
pkg: crypto/cipher
cpu: AMD Ryzen 7 5800H with Radeon Graphics
                 │     B/s      │     B/s       vs base
AESCTR/128/50-16   1.377Gi ± 0%   1.384Gi ± 0%   +0.51% (p=0.028 n=20)
AESCTR/128/1K-16   6.164Gi ± 0%   6.892Gi ± 1%  +11.81% (p=0.000 n=20)
AESCTR/128/8K-16   7.372Gi ± 0%   8.768Gi ± 1%  +18.95% (p=0.000 n=20)
AESCTR/192/50-16   1.289Gi ± 0%   1.279Gi ± 0%   -0.75% (p=0.001 n=20)
AESCTR/192/1K-16   5.734Gi ± 0%   6.011Gi ± 0%   +4.83% (p=0.000 n=20)
AESCTR/192/8K-16   6.889Gi ± 1%   7.437Gi ± 0%   +7.96% (p=0.000 n=20)
AESCTR/256/50-16   1.170Gi ± 0%   1.163Gi ± 0%   -0.54% (p=0.005 n=20)
AESCTR/256/1K-16   5.235Gi ± 0%   5.391Gi ± 0%   +2.98% (p=0.000 n=20)
AESCTR/256/8K-16   6.361Gi ± 0%   6.676Gi ± 0%   +4.94% (p=0.000 n=20)
geomean            3.681Gi        3.882Gi        +5.46%

The slight slowdown on 50-byte workloads is unrelated to this change,
because such workloads never use ctrBlocks8Asm.
@austinderek austinderek closed this Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants