Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: add mul intrinsic for overflow checks #21588

Closed
martisch opened this issue Aug 24, 2017 · 15 comments
Closed

runtime: add mul intrinsic for overflow checks #21588

martisch opened this issue Aug 24, 2017 · 15 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@martisch
Copy link
Contributor

martisch commented Aug 24, 2017

This is neither a language change nor introducing or changing a public api but discussing a runtime implementation detail.

Motivation

Profiling over production programs reveals that growslice and makemap are hot functions (even outside the malloc part). I am working on to improve their performance and correctness for go1.10.

One part of these functions is at least one check for len > _MaxMem/elemSize which contains a costly division and needs a previous check for elemSize != 0.

(Update: the check can also have variants like len*elemSize > _MaxMem-smallAmount or roundup(len*elemSize) > _MaxMem with an overflow check like uintptr(size) > maxSliceCap(elemSize) before that in the general case involves a division )

We have resorted to switches with constant divisions in growslice to improve performance for common cases. This introduces branch mis predicts and a larger function size which is bad for icaches.

Elsewhere we use maxSliceCap with precomputed values (_MaxMem/elemSize) which introduces cache misses, uses additional memory and only works for small element sizes.

An alternative to the above optimizations that avoids data cache lookups and a division is to check len*elemSize > _MaxMem. Care needs to be taken that len*elemSize does not overflow.

To be able to check overflow quickly and safely i propose the following:

Proposal

Add a new unexported runtime function runtime.mul:

func mul(a, b uint) (uint, bool) {
	return a * b, b != 0 && a > ^uint(0)/b
}

The function can be used like:

if n, overflow := mul(len, elemSize); overflow || n > _MaxMem {
   panic("Does not fit into Memory")
}

Let the compiler recognize runtime.mul as an intrinsic and replace it with optimized inline instructions where possible.

On architectures that provide overflow/carryflags this can be a mul and setting overflow according to the flag. Otherwise use the provided generic implementation. (Note that a*b is always returned also in the generic implementation to match optimizations in machine code.)

Thereby we can eliminate maxSliceCap and shrink the switch cases without loss of (much) performance but better branch prediction and cache use and shrinking the functions away from special cases again.

Possible future directions and ideas for later extra proposals

  • add a similar function for add (the name is already taken in the runtime namespace) or just optimize overflow checks for adds better if this is not already the case
  • if runtime.mul provides nice code and performance improvements after evaluation of the implementation and use cases we could add such a function to math(.Bits) (and just let the compiler do the existing optimization to that name). Note that math.Bits are already intrinsics.
  • add detection of generic overflow checking in the ssa backend such that any overflow checking code is optimized

If the introduction of the above function and optimization looks ok i would like to work on implementing it and then replace existing runtime codepaths that would benefit from the new function.

/cc @khr @josharian @ianlancetaylor @stemar94

@martisch martisch changed the title runtime add mul intrinsic for overflow checks runtime: add mul intrinsic for overflow checks Aug 24, 2017
@rasky
Copy link
Member

rasky commented Aug 24, 2017

Related: #6815

@griesemer
Copy link
Contributor

Independent of the merits of this proposal, why does this need to be in the runtime? Why not math/bits (or analogous)?

@rasky
Copy link
Member

rasky commented Aug 24, 2017

Also related: #18616 (search for carry)

@martisch
Copy link
Contributor Author

martisch commented Aug 24, 2017

@griesemer: As far as i know currently runtime can only import runtime, unsafe and runtime/internal/... not math. Since everything depends on runtime we would create an import cycle if we use math.mul in runtime. (https://github.com/golang/go/blob/master/src/cmd/go/internal/load/pkg.go)
Not using math in runtime e.g. avoids the issue of introducing a math import which could have inits now or in the future that do not play well with runtime not having fully initialized.

Being runtime only also avoids introducing a new public api for now and runtime.mul can be easily deleted again or changed since it is private to the runtime.

math.Bits might be simple enough to include in runtime but this would need checks that this does not introduce any circular dependencies.

If we find it works well with runtime we can use the then existing compiler support to discuss a public api as a future independent proposal.

@robpike
Copy link
Contributor

robpike commented Aug 24, 2017

Why not put the _Maxmem check in there as well?

@martisch
Copy link
Contributor Author

martisch commented Aug 24, 2017

@robpike i think keeping runtime.mul simple leaves it more generally applicable to other cases of overflow checking at the same time as being used for _MaxMem checks.

Not all checks in runtime are against _MaxMem but can also be _MaxMem-someAmount. Where someAmount is known to be smaller than _MaxMem. Some uses round up len*elemSize before checking against _MaxMem. These cases are faster to check with help of mul when not directly comparing against _MaxMem.

if n, overflow := mul(len, elemSize); overflow || n > _MaxMem-someAmount {
   panic("Does not fit into Memory")
}
if n, overflow := mul(len, elemSize); overflow || roundup(n) > _MaxMem {
   panic("Does not fit into Memory")
}

In the later case depending on roundup a round up overflow to 0 needs to be prevented in roundup.

Updated the description to include that there can be variants of the _MaxMem check.

@josharian
Copy link
Contributor

Seems unobjectionable to add such a runtime function, although it should probably go in runtime/internal/sys, with the other runtime intrinsics.

If we decide to add something like it to math/bits as well, we might decide to give the runtime flavor looser semantics for performance (like we have done with other runtime intrinsics that also appear in math/bits). FWIW, I recently wanted math/bits to have something like Mul128(x, y int64) (lo, hi int64) for use with math/rand.

Longer term, though, it seems perhaps better to fix this either via:

Note that with either of these, we should be able to get equivalent performance, with enough compiler help. (1) In the arbitrary precision case, the compiler could in theory recognize that the integer is in use only in one expression, which bounds its range, and replace it under the hood with a 128 bit version. (2) Working with a 128 bit multiply, the compiler can recognize that the top half of a multiply is being checked against 0, so it can substitute a pure overflow check there.

@randall77
Copy link
Contributor

Sounds ok to me. I think our experience with using CTZ and friends in the runtime first before providing it generally in math/bits worked well. It would be nice to replicate that experience.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/91755 mentions this issue: runtime/internal/math: add multiplication with overflow check

gopherbot pushed a commit that referenced this issue Oct 15, 2018
This CL adds a new internal math package for use by the runtime.
The new package exports a MulUintptr function with uintptr arguments
a and b and returns uintptr(a*b) and whether the full-width product
x*y does overflow the uintptr value range (uintptr(x*y) != x*y).

Uses of MulUinptr in the runtime and intrinsics for performance
will be added in followup CLs.

Updates #21588

Change-Id: Ia5a02eeabc955249118e4edf68c67d9fc0858058
Reviewed-on: https://go-review.googlesource.com/c/91755
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/142377 mentions this issue: runtime: use multiplication with overflow check for makeslice

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/143797 mentions this issue: runtime: use multiplication with overflow check for makemap

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/143798 mentions this issue: runtime: use multiplication with overflow check for growslice

gopherbot pushed a commit that referenced this issue Oct 23, 2018
This improves performance for maps with a bucket size
(key+value*8 bytes) larger than 32 bytes and removes loading
a value from the maxElems array for smaller bucket sizes.

name                old time/op  new time/op  delta
MakeMap/[Byte]Byte  93.5ns ± 1%  91.8ns ± 1%  -1.83%  (p=0.000 n=10+10)
MakeMap/[Int]Int     134ns ± 1%   127ns ± 2%  -5.61%  (p=0.000 n=9+10)

Updates #21588

Change-Id: I53f77186769c4bd0f2b90f3c6c17df643b060e39
Reviewed-on: https://go-review.googlesource.com/c/143797
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Oct 23, 2018
This improves performance for slices with an element size larger
than 32 bytes and removes loading a value from the maxElems
array for smaller element sizes.

name                 old time/op  new time/op  delta
GrowSlice/Byte       41.4ns ± 2%  41.5ns ± 1%    ~     (p=0.366 n=10+9)
GrowSlice/Int16      51.1ns ± 2%  51.0ns ± 2%    ~     (p=0.985 n=10+10)
GrowSlice/Int        64.0ns ± 1%  64.2ns ± 1%    ~     (p=0.180 n=10+10)
GrowSlice/Ptr        90.8ns ± 1%  90.7ns ± 1%    ~     (p=0.858 n=9+10)
GrowSlice/Struct/24   108ns ± 0%   108ns ± 2%    ~     (p=0.488 n=8+9)
GrowSlice/Struct/32   118ns ± 2%   117ns ± 2%    ~     (p=0.327 n=10+10)
GrowSlice/Struct/40   159ns ± 1%   148ns ± 1%  -6.87%  (p=0.000 n=10+9)

Updates #21588

Change-Id: I443b82972d379b1befa791f9ee468b3adc6bb760
Reviewed-on: https://go-review.googlesource.com/c/143798
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Oct 23, 2018
This improves performance for slices with an element size larger
than 32 bytes and removes loading a value from the maxElems
array for smaller element sizes.

name                 old time/op  new time/op  delta
MakeSlice/Byte       18.0ns ± 4%  18.0ns ± 2%     ~     (p=0.575 n=20+17)
MakeSlice/Int16      21.8ns ± 2%  21.6ns ± 1%   -0.63%  (p=0.035 n=20+19)
MakeSlice/Int        42.0ns ± 2%  41.6ns ± 1%     ~     (p=0.121 n=20+18)
MakeSlice/Ptr        62.6ns ± 2%  62.4ns ± 2%     ~     (p=0.491 n=20+18)
MakeSlice/Struct/24  57.4ns ± 3%  56.0ns ± 2%   -2.40%  (p=0.000 n=19+19)
MakeSlice/Struct/32  62.1ns ± 2%  60.6ns ± 3%   -2.43%  (p=0.000 n=20+20)
MakeSlice/Struct/40  77.3ns ± 3%  68.9ns ± 3%  -10.91%  (p=0.000 n=20+20)

Updates #21588

Change-Id: Ie12807bf8f77c0e15453413f47e3d7de771b798f
Reviewed-on: https://go-review.googlesource.com/c/142377
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/144017 mentions this issue: runtime: use multiplication with overflow check for makechan

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/143997 mentions this issue: runtime: use multiplication with overflow check for newarray

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/144037 mentions this issue: runtime: remove unused maxSliceCap function and maxElems array

gopherbot pushed a commit that referenced this issue Oct 23, 2018
This improves performance for channels with an element size
larger than 32 bytes and removes loading a value from the
maxElems array for smaller element sizes.

MakeChan/Byte       88.8ns ± 6%  85.2ns ± 1%  -4.03%  (p=0.000 n=10+10)
MakeChan/Int         100ns ± 4%    96ns ± 2%  -3.72%  (p=0.000 n=9+10)
MakeChan/Ptr         124ns ± 3%   126ns ± 2%    ~     (p=0.068 n=10+10)
MakeChan/Struct/0   80.5ns ± 2%  80.7ns ± 2%    ~     (p=0.697 n=10+10)
MakeChan/Struct/32   143ns ± 4%   141ns ± 2%    ~     (p=0.221 n=10+10)
MakeChan/Struct/40   169ns ± 2%   159ns ± 4%  -6.26%  (p=0.000 n=10+10)

Updates #21588

Change-Id: Ifbf12a5af2f0ec7e1d2241ecfffab020e9abec48
Reviewed-on: https://go-review.googlesource.com/c/144017
Reviewed-by: Keith Randall <khr@golang.org>
gopherbot pushed a commit that referenced this issue Oct 23, 2018
This improves performance for e.g. maps with a bucket size
(key+value*8 bytes) larger than 32 bytes and removes loading
a value from the maxElems array for smaller bucket sizes.

name                old time/op  new time/op  delta
MakeMap/[Byte]Byte  95.5ns ± 1%  94.7ns ± 1%  -0.78%  (p=0.013 n=9+9)
MakeMap/[Int]Int     128ns ± 0%   121ns ± 2%  -5.63%  (p=0.000 n=6+10)

Updates #21588

Change-Id: I7d9eb7d49150c399c15dcab675e24bc97ff97852
Reviewed-on: https://go-review.googlesource.com/c/143997
Reviewed-by: Keith Randall <khr@golang.org>
@golang golang locked and limited conversation to collaborators Oct 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

9 participants