Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
cmd/compile: schedule carry chain arithmetic disjointly
This results in a 1.7-2.4x improvement in native go crypto/elliptic multiplication operations on PPC64, and similar improvements might be possible on other architectures which use flags or similar to represent the carry bit in SSA form. If it is possible, schedule carry chains independently of each other to avoid clobbering the carry flag. This is very expensive. This is done by: 1. Identifying carry bit using, but not creating ops, and lowering their priority below all other ops which do not need to be placed at the top of a block. This effectively ensures only one carry chain will be placed at a time in most important cases (crypto/elliptic/internal/fiat contains most of them). 2. Raising the priority of carry bit generating ops to schedule later in a block to ensure they are placed as soon as they are ready. Likewise, tuple ops which separate carrying ops are scored similar to 2 above. This prevents unrelated ops from being scheduled between carry-dependent operations. This occurs when unrelated ops are ready to schedule alongside such tuple ops. This reduces the chances a flag clobbering op might be placed between two carry-dependent operations. With PPC64 Add64/Sub64 lowering into SSA and this patch, the net performance difference in crypto/elliptic benchmarks on P9/ppc64le are: name old time/op new time/op delta ScalarBaseMult/P256 46.3µs ± 0% 46.9µs ± 0% +1.34% ScalarBaseMult/P224 356µs ± 0% 209µs ± 0% -41.14% ScalarBaseMult/P384 1.20ms ± 0% 0.57ms ± 0% -52.14% ScalarBaseMult/P521 3.38ms ± 0% 1.44ms ± 0% -57.27% ScalarMult/P256 199µs ± 0% 199µs ± 0% -0.17% ScalarMult/P224 357µs ± 0% 212µs ± 0% -40.56% ScalarMult/P384 1.20ms ± 0% 0.58ms ± 0% -51.86% ScalarMult/P521 3.37ms ± 0% 1.44ms ± 0% -57.32% MarshalUnmarshal/P256/Uncompressed 2.59µs ± 0% 2.52µs ± 0% -2.63% MarshalUnmarshal/P256/Compressed 2.58µs ± 0% 2.52µs ± 0% -2.06% MarshalUnmarshal/P224/Uncompressed 1.54µs ± 0% 1.40µs ± 0% -9.42% MarshalUnmarshal/P224/Compressed 1.54µs ± 0% 1.39µs ± 0% -9.87% MarshalUnmarshal/P384/Uncompressed 2.40µs ± 0% 1.80µs ± 0% -24.93% MarshalUnmarshal/P384/Compressed 2.35µs ± 0% 1.81µs ± 0% -23.03% MarshalUnmarshal/P521/Uncompressed 3.79µs ± 0% 2.58µs ± 0% -31.81% MarshalUnmarshal/P521/Compressed 3.80µs ± 0% 2.60µs ± 0% -31.67% Note, P256 uses an asm implementation, thus, little variation is expected. Updates #40171 Change-Id: I810850e8ff429505424c92d6fe37f99aaa0c6e84 Reviewed-on: https://go-review.googlesource.com/c/go/+/393656 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Run-TryBot: Paul Murphy <murp@ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Filippo Valsorda <valsorda@google.com>
- Loading branch information