runtime, cmd/compile: add runtime.free, runtime.freetracked and GOEXPERIMENT=runtimefree

**UPDATE**: 
* there is now a draft design doc in [CL 700255](https://go.googlesource.com/proposal/+/refs/changes/55/700255/14/design/74299-runtime-free.md).
* there is a CL stack visible at [CL 673695](https://go.dev/cl/673695).
* some sample performance results for reclaiming intermediate memory for `slices.Collect` are in the commit message for [CL 701855](https://go.dev/cl/701855).

----

Opening this issue here to help with discussion.

### Background

[CL 673695](https://go.dev/cl/673695) starts with:

> For reducing allocations, there is sometimes a chicken and egg problem.
> 
> For example, #72036 is a WIP approach to update escape analysis so that interface arguments do not always escape, with a primary target of w.Write(buf) and friends, but many interesting byte slices are variable size known only at runtime. Currently, a make with an unknown capacity (beyond 32 bytes) causes a slice to be heap allocated, so solely improving escape analysis in this case has limited impact on actual applications -- if a slice is heap allocated for two reasons, eliminating one reason just means it is still heap allocated for the other reason, which lessens the value of doing the work to eliminate the first reason.
> 
> In an attempt to help triangulate on what is possible for the runtime, compiler, and standard library to do together to reduce how much work the GC must do, this CL has a proof-of-concept set of runtime.free implementations

Back in February, as part of my write up for #72036 I mentioned that a runtime.free could help with this chicken and egg problem:

> I have a mostly working runtime.free that can be inserted by the compiler (and not for use by normal Go code of course), and I'm looking at hooking it up a few different ways, including when a slice is both variable size and also used as an interface method arg. (I'll probably open a separate issue on this, but I'm also attempting to free memory in some runtime map growth cases, some append cases, etc.

### Prototype

[CL 673695](https://go.dev/cl/673695) contains that mostly working runtime.free. (It's "mostly working" in the sense that it can pass its own tests, including I let it accumulate a few hundred CPU hours running its tests successfully under `stress`. That of course does not mean it is bug free, and it still has some shortcuts, including for example how it updates user-facing stats).

I also have some additional WIP CLs that update the compiler to automatically insert free calls for slices allocated via `make` when the compiler can prove it is safe to do so, and separately updating maps to free map storage during growth and split as well as a small number of lower-level stdlib things like `strings.Builder` and `bytes.Buffer` (which are "low level" in the sense that they already encapsulate some low-level performance tricks, including use of unsafe, but also they let other higher-level pieces of the stdlib then safely benefit from their performance). It is not intended that higher-level portions of the stdlib (like net/http) would have manual calls to a runtime.free.

Patterns of `append` usage like `slices.Collect` and `slices.AppendSeq` are also targets, with some good initial results. (`slices.Collect` for example could be updated manually to start, which I've tried as an experiment, or it could wait for the compiler to automatically recognize its pattern).

Note that runtime.free and the related runtime entry points are not public APIs, and cannot be called directly by end users.

### Basic approach

See the more detailed write-up in the [CL 673695](https://go.dev/cl/673695) commit message, which covers these 4 runtime entry points to start:
```
 func free(ptr unsafe.Pointer)
 func freesized(ptr unsafe.Pointer, uintptr size, noscan bool)
 func freetracked(trackedObjs *[]trackedObj)
 func makeslicetracked64(et *_type, len64 int64, cap64 int64, trackedObjs *[]trackedObj) unsafe.Pointer
```

At least for the approach I have so far, the free work tries to take advantage of things the compiler already knows or that the compiler is taught with modest effort (for example in escape analysis).

### Performance

Some preliminary benchmarks. See [CL 673695](https://go.dev/cl/673695) for commentary on these benchmarks.

#### strings.Builder.Write with runtime.freesized

```
                                               │       old        │    new                              │
                                               │      sec/op      │   sec/op     vs base                │
BuildString_Builder/1Write_36Bytes_NoGrow-4           55.82n ± 2%   55.86n ± 2%        ~ (p=0.794 n=20)
BuildString_Builder/2Write_36Bytes_NoGrow-4           125.2n ± 2%   115.4n ± 1%   -7.86% (p=0.000 n=20)
BuildString_Builder/3Write_36Bytes_NoGrow-4           224.0n ± 1%   188.2n ± 2%  -16.00% (p=0.000 n=20)
BuildString_Builder/5Write_36Bytes_NoGrow-4           239.1n ± 9%   205.1n ± 1%  -14.20% (p=0.000 n=20)
BuildString_Builder/8Write_36Bytes_NoGrow-4           422.8n ± 3%   325.4n ± 1%  -23.04% (p=0.000 n=20)
BuildString_Builder/10Write_36Bytes_NoGrow-4          436.9n ± 2%   342.3n ± 1%  -21.64% (p=0.000 n=20)
BuildString_Builder/100Write_36Bytes_NoGrow-4         4.403µ ± 1%   2.381µ ± 2%  -45.91% (p=0.000 n=20)
BuildString_Builder/1000Write_36Bytes_NoGrow-4        48.28µ ± 2%   21.38µ ± 2%  -55.71% (p=0.000 n=20)
```

(In short, currently `strings.Builder` incrementally builds up an internal buffer, growing as needed. The modification introduced N-1 frees for N allocations, with the final allocation going to the user. There is only 1 allocation if there is only 1 `Builder.Write` call, so there is no benefit seen on the first line, but the benefit grows to ~2x faster as there are more allocations per benchmark loop).

#### Normal allocation (with these changes, but no free/reuse happening)

The new reuse code path for reusing a freed object is currently somewhat interwoven with the normal allocation path in runtime/malloc.go, but some initial micro-benchmarks suggest the impact of these changes on the normal allocation path when no reuse is happening could be small to negligible:


```
                        │         old        │  new                             │
                        │        sec/op      │ sec/op     vs base               │
Malloc8-16                      11.01n ± 1%   10.94n ± 1%  -0.68% (p=0.038 n=20)
Malloc16-16                     17.15n ± 1%   17.05n ± 0%  -0.55% (p=0.007 n=20)
Malloc32-16                     18.65n ± 1%   18.42n ± 0%  -1.26% (p=0.000 n=20)
MallocTypeInfo8-16              18.63n ± 0%   18.36n ± 0%  -1.45% (p=0.000 n=20)
MallocTypeInfo16-16             22.32n ± 0%   22.65n ± 0%  +1.50% (p=0.000 n=20)
MallocTypeInfo32-16             23.37n ± 0%   23.89n ± 0%  +2.23% (p=0.000 n=20)
geomean                         18.02n        18.01n       -0.05%
```

The initial performance seems promising, but still preliminary.

In general, some possible performance benefits might include:

 1. less CPU used by GC, which might matter most when CPU is constrained or exhausted (either on a sustained basis, or over some shorter but meaningful time interval).
 2. longer time between GCs, with less time overall with the write barrier enabled (helping application code execute faster).
 3. more cache-friendly allocs for user code (esp. if it ends up with LIFO or near-LIFO free/alloc).
 4. with less GC work overall, maybe fewer assists or other ways the GC can intrude on the app.
 5. maybe slightly more friendly to the new Green Tea GC if it can result in a higher density per span (speculation 😅).

The first maybe is the most obvious, but that might not justify the change for apps that have spare CPU and the GC CPU costs are almost entirely in the "background" compared to the rest of the app work.

The third is particularly interesting because it might help make it more of a win in CPU costs even before considering the costs of running the GC. (I suspect I've observed this effect in some of the benchmarks, but need to look more carefully).

One approach to mitigate the cost for applications that don't care much about GC costs today is the free-related work currently has a size-based threshold (currently 16 bytes). That could be raised to something higher than 16 to the point where the other costs of allocating (including usually zeroing) or operating on the result (like writing data and doing useful work) are large enough costs compared to the overhead of the free-related work such that the free-related overhead is small enough. (For compiler-inserted frees of slices, if the threshold does need to increase from 16, it would be nice if the free threshold could be low enough to match the 32-byte threshold for new Go 1.25 stack allocation optimization introduced by Keith, or maybe that 32-byte threshold ends up being higher). Also, the current cost in the prototype likely can be made cheaper.

Another approach to manage the costs could be via PGO, though I'm hopeful a first version could be viable without requiring PGO.

That said, it likely might be easier to sort through some of the performance trade-offs by seeing more concrete performance examples, which is one of the main goals of this current prototype.

---------

There are multiple ways that runtime.free and related entry points could be used, but I'm hoping this issue could be focused on some possible first steps.

In any event, this is still exploratory. @mknyszek took a look at the code a ~month ago and gave some helpful initial feedback. Absent someone pointing out a fatal flaw in the overall approach, I plan to try to polish things up some more and get more feedback. It might be that this exact approach does not pan out, but maybe we'll learn something via the exploration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime, cmd/compile: add runtime.free, runtime.freetracked and GOEXPERIMENT=runtimefree #74299

Background

Prototype

Basic approach

Performance

strings.Builder.Write with runtime.freesized

Normal allocation (with these changes, but no free/reuse happening)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime, cmd/compile: add runtime.free, runtime.freetracked and GOEXPERIMENT=runtimefree #74299

Description

Background

Prototype

Basic approach

Performance

strings.Builder.Write with runtime.freesized

Normal allocation (with these changes, but no free/reuse happening)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions