Skip to content

runtime, cmd/compile: consider runtime.free, runtime.freetracked and GOEXPERIMENT=runtimefree #74299

Open
@thepudds

Description

@thepudds

I have a WIP CL 673695 with a better and more complete description, but opening this issue here to help with discussion.

Background

CL 673695 starts with:

For reducing allocations, there is sometimes a chicken and egg problem.

For example, #72036 is a WIP approach to update escape analysis so that interface arguments do not always escape, with a primary target of w.Write(buf) and friends, but many interesting byte slices are variable size known only at runtime. Currently, a make with an unknown capacity (beyond 32 bytes) causes a slice to be heap allocated, so solely improving escape analysis in this case has limited impact on actual applications -- if a slice is heap allocated for two reasons, eliminating one reason just means it is still heap allocated for the other reason, which lessens the value of doing the work to eliminate the first reason.

In an attempt to help triangulate on what is possible for the runtime, compiler, and standard library to do together to reduce how much work the GC must do, this CL has a proof-of-concept set of runtime.free implementations

Back in February, as part of my write up for #72036 I mentioned that a runtime.free could help with this chicken and egg problem:

I have a mostly working runtime.free that can be inserted by the compiler (and not for use by normal Go code of course), and I'm looking at hooking it up a few different ways, including when a slice is both variable size and also used as an interface method arg. (I'll probably open a separate issue on this, but I'm also attempting to free memory in some runtime map growth cases, some append cases, etc.

Prototype

CL 673695 contains that mostly working runtime.free. (It's "mostly working" in the sense that it can pass its own tests, including I let it accumulate a few hundred CPU hours running its tests successfully under stress. That of course does not mean it is bug free, and it still has some shortcuts, including for example how it updates user-facing stats).

I also have some additional WIP CLs that update the compiler to automatically insert free calls for slices allocated via make when the compiler can prove it is safe to do so, and separately updating maps to free map storage during growth and split as well as a small number of lower-level stdlib things like strings.Builder and bytes.Buffer (which are "low level" in the sense that they already encapsulate some low-level performance tricks, including use of unsafe, but also they let other higher-level pieces of the stdlib then safely benefit from their performance). It is not intended that higher-level portions of the stdlib (like net/http) would have manual calls to a runtime.free.

Patterns of append usage like slices.Collect and slices.AppendSeq are also targets, with some good initial results. (slices.Collect for example could be updated manually to start, which I've tried as an experiment, or it could wait for the compiler to automatically recognize its pattern).

Note that runtime.free and the related runtime entry points are not public APIs, and cannot be called directly by end users.

Basic approach

See the more detailed write-up in the CL 673695 commit message, which covers these 4 runtime entry points to start:

 func free(ptr unsafe.Pointer)
 func freesized(ptr unsafe.Pointer, uintptr size, noscan bool)
 func freetracked(trackedObjs *[]trackedObj)
 func makeslicetracked64(et *_type, len64 int64, cap64 int64, trackedObjs *[]trackedObj) unsafe.Pointer

At least for the approach I have so far, the free work tries to take advantage of things the compiler already knows or that the compiler is taught with modest effort (for example in escape analysis).

Performance

Some preliminary benchmarks. See CL 673695 for commentary on these benchmarks.

strings.Builder.Write with runtime.freesized

                                               │       old        │    new                              │
                                               │      sec/op      │   sec/op     vs base                │
BuildString_Builder/1Write_36Bytes_NoGrow-4           55.82n ± 2%   55.86n ± 2%        ~ (p=0.794 n=20)
BuildString_Builder/2Write_36Bytes_NoGrow-4           125.2n ± 2%   115.4n ± 1%   -7.86% (p=0.000 n=20)
BuildString_Builder/3Write_36Bytes_NoGrow-4           224.0n ± 1%   188.2n ± 2%  -16.00% (p=0.000 n=20)
BuildString_Builder/5Write_36Bytes_NoGrow-4           239.1n ± 9%   205.1n ± 1%  -14.20% (p=0.000 n=20)
BuildString_Builder/8Write_36Bytes_NoGrow-4           422.8n ± 3%   325.4n ± 1%  -23.04% (p=0.000 n=20)
BuildString_Builder/10Write_36Bytes_NoGrow-4          436.9n ± 2%   342.3n ± 1%  -21.64% (p=0.000 n=20)
BuildString_Builder/100Write_36Bytes_NoGrow-4         4.403µ ± 1%   2.381µ ± 2%  -45.91% (p=0.000 n=20)
BuildString_Builder/1000Write_36Bytes_NoGrow-4        48.28µ ± 2%   21.38µ ± 2%  -55.71% (p=0.000 n=20)

(In short, currently strings.Builder incrementally builds up an internal buffer, growing as needed. The modification introduced N-1 frees for N allocations, with the final allocation going to the user. There is only 1 allocation if there is only 1 Builder.Write call, so there is no benefit seen on the first line, but the benefit grows to ~2x faster as there are more allocations per benchmark loop).

Normal allocation (with these changes, but no free/reuse happening)

The new reuse code path for reusing a freed object is currently somewhat interwoven with the normal allocation path in runtime/malloc.go, but some initial micro-benchmarks suggest the impact of these changes on the normal allocation path when no reuse is happening could be small to negligible:

                        │         old        │  new                             │
                        │        sec/op      │ sec/op     vs base               │
Malloc8-16                      11.01n ± 1%   10.94n ± 1%  -0.68% (p=0.038 n=20)
Malloc16-16                     17.15n ± 1%   17.05n ± 0%  -0.55% (p=0.007 n=20)
Malloc32-16                     18.65n ± 1%   18.42n ± 0%  -1.26% (p=0.000 n=20)
MallocTypeInfo8-16              18.63n ± 0%   18.36n ± 0%  -1.45% (p=0.000 n=20)
MallocTypeInfo16-16             22.32n ± 0%   22.65n ± 0%  +1.50% (p=0.000 n=20)
MallocTypeInfo32-16             23.37n ± 0%   23.89n ± 0%  +2.23% (p=0.000 n=20)
geomean                         18.02n        18.01n       -0.05%

The initial performance seems promising, but still preliminary.

In general, some possible performance benefits might include:

  1. less CPU used by GC, which might matter most when CPU is constrained or exhausted (either on a sustained basis, or over some shorter but meaningful time interval).
  2. longer time between GCs, with less time overall with the write barrier enabled (helping application code execute faster).
  3. more cache-friendly allocs for user code (esp. if it ends up with LIFO or near-LIFO free/alloc).
  4. with less GC work overall, maybe fewer assists or other ways the GC can intrude on the app.
  5. maybe slightly more friendly to the new Green Tea GC if it can result in a higher density per span (speculation 😅).

The first maybe is the most obvious, but that might not justify the change for apps that have spare CPU and the GC CPU costs are almost entirely in the "background" compared to the rest of the app work.

The third is particularly interesting because it might help make it more of a win in CPU costs even before considering the costs of running the GC. (I suspect I've observed this effect in some of the benchmarks, but need to look more carefully).

One approach to mitigate the cost for applications that don't care much about GC costs today is the free-related work currently has a size-based threshold (currently 16 bytes). That could be raised to something higher than 16 to the point where the other costs of allocating (including usually zeroing) or operating on the result (like writing data and doing useful work) are large enough costs compared to the overhead of the free-related work such that the free-related overhead is small enough. (For compiler-inserted frees of slices, if the threshold does need to increase from 16, it would be nice if the free threshold could be low enough to match the 32-byte threshold for new Go 1.25 stack allocation optimization introduced by Keith, or maybe that 32-byte threshold ends up being higher). Also, the current cost in the prototype likely can be made cheaper.

Another approach to manage the costs could be via PGO, though I'm hopeful a first version could be viable without requiring PGO.

That said, it likely might be easier to sort through some of the performance trade-offs by seeing more concrete performance examples, which is one of the main goals of this current prototype.


There are multiple ways that runtime.free and related entry points could be used, but I'm hoping this issue could be focused on some possible first steps.

In any event, this is still exploratory. @mknyszek took a look at the code a ~month ago and gave some helpful initial feedback. Absent someone pointing out a fatal flaw in the overall approach, I plan to try to polish things up some more and get more feedback. It might be that this exact approach does not pan out, but maybe we'll learn something via the exploration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ImplementationIssues describing a semantics-preserving change to the Go implementation.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions