-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Labels
craneliftIssues related to the Cranelift code generatorIssues related to the Cranelift code generatorcranelift:area:x64Issues related to x64 codegenIssues related to x64 codegencranelift:goal:optimize-speedFocus area: the speed of the code produced by Cranelift.Focus area: the speed of the code produced by Cranelift.
Description
Feature
Latency and RCP for division depends on register sizes for most of x64 architectures. So Clang do one interesting trick which may speedup 64-bit division if high parts of operands equal to zero. Pseudocode:
idiv(a: i64, b: i64) -> i64 {
if ((a | b) >> 32) return a / b; // full 64-bit division
return i32(a) / i32(b) as i64; // 32-bit division
}
godbolt: https://godbolt.org/z/Tqqzs1
Is it make sense apply same optimization for cranelift only for x64 architecture?
Benefit
it may speedup div / rem over 2x for arguments without high parts with small constant overhead according to this table:
But it is worth excluding Zen1,2,3 architecture due to it uses a more modern scheme for division which doesn't dependent on register's width. Also it doesn't need for ARM.
Metadata
Metadata
Assignees
Labels
craneliftIssues related to the Cranelift code generatorIssues related to the Cranelift code generatorcranelift:area:x64Issues related to x64 codegenIssues related to x64 codegencranelift:goal:optimize-speedFocus area: the speed of the code produced by Cranelift.Focus area: the speed of the code produced by Cranelift.