Skip to content

Conversation

@newpavlov
Copy link
Member

@newpavlov newpavlov commented Nov 6, 2025

Replaces the logical OR in the G function with addition. It seemingly results in a better ALU utilization and improves performance by several percents. From 699 MB/s to 753 MB/s on my x86 PC and from 910 MB/s to 960 MB/s on Mac M4.

Based on #749

@newpavlov
Copy link
Member Author

newpavlov commented Nov 6, 2025

Huh... Compiling op_g in isolation results in the same assembly whether we use wrapping_add or not (see here). In other words, the compiler is able to apply such optimization itself. But it seems this change (accidentally?) nudges the compiler towards a better codegen.

@newpavlov
Copy link
Member Author

newpavlov commented Nov 6, 2025

Interestingly, compiling the full compress function with the wrapping_add change results in more instructions on x86 (672 vs 648), but I guess the resulting code is a bit friendlier to pipeline. On AArch64 the number of instructions is the same (632), but with wrapping_add some orr instructions get replaced with add (i.e. the compiler has failed to apply the optimization when the full function is compiled).

@tarcieri
Could you check whether this change results in a better performance on Mac?

@newpavlov
Copy link
Member Author

On M4 this change improves performance from 910 MB/s to 960 MB/s.

@newpavlov newpavlov merged commit 6aa90e8 into master Nov 6, 2025
13 checks passed
@newpavlov newpavlov deleted the md5/add_opt branch November 6, 2025 04:10
@tarcieri
Copy link
Member

tarcieri commented Nov 6, 2025

Seems about 5-6% faster on my M1 Max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants