Skip to content

Conversation

@arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Dec 22, 2025

Splitting a buffer results in fetching a new buffer object from a sync.Pool. The buffer object is returned back to the pool only once the shared ref count falls to 0. As a result, only one of the buffer objects is returned back to the pool for re-use. The "leaked" buffer objects may cause noticeable allocations when buffers are split more frequently. I noticed this when attempting to remove a buffer copy by replacing the bufio.Reader.

Solution

This PR introduces a root-owner model for the underlying *[]byte within buffer objects. The root object manages the slice's lifecycle, returning it to the pool only when its reference count reaches zero.

When a buffer is split, the new buffer is treated as a child, incrementing the ref counts for both itself and the root. Once a child’s ref count hits zero, it returns itself to the pool and decrements the root’s count.

Additionally, this PR removes the sync.Pool used for *atomic.Int32 by embedding atomic.Int32 as a value field within the buffer struct. By eliminating the second pool and the associated pointer indirection, we reduce allocation overhead and improve cache locality during buffer lifecycle events.

Benchmarks

A micro-benchmark showing the buffer object leak:

func BenchmarkSplit(b *testing.B) {
	pool := mem.DefaultBufferPool()
	b.Run("split", func(b *testing.B) {
		for b.Loop() {
			size := 1 << 15 // 32 KB
			slice := pool.Get(size)
			buf := mem.NewBuffer(slice, pool)
			left, right := mem.SplitUnsafe(buf, size/2)
			left.Free()
			right.Free()
		}
	})
	b.Run("no-split", func(b *testing.B) {
		for b.Loop() {
			size := 1 << 15 // 32 KB
			slice := pool.Get(size)
			buf := mem.NewBuffer(slice, pool)
			buf.Free()
		}
	})
}

Result on master vs this PR.

goos: linux
goarch: amd64
pkg: google.golang.org/grpc/mem
cpu: Intel(R) Xeon(R) CPU @ 2.60GHz
                  │   old.txt   │               new.txt               │
                  │   sec/op    │   sec/op     vs base                │
Split/split-48      418.2n ± 0%   263.9n ± 1%  -36.89% (p=0.000 n=10)
Split/no-split-48   221.1n ± 1%   208.5n ± 0%   -5.70% (p=0.000 n=10)
geomean             304.1n        234.6n       -22.86%

                  │   old.txt    │                 new.txt                 │
                  │     B/op     │    B/op     vs base                     │
Split/split-48      64.00 ± 0%      0.00 ± 0%  -100.00% (p=0.000 n=10)
Split/no-split-48   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
geomean                        ²               ?                       ² ³
¹ all samples are equal
² summaries must be >0 to compute geomean
³ ratios must be >0 to compute geomean

                  │   old.txt    │                 new.txt                 │
                  │  allocs/op   │ allocs/op   vs base                     │
Split/split-48      1.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)
Split/no-split-48   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
geomean                        ²               ?                       ² ³
¹ all samples are equal
² summaries must be >0 to compute geomean
³ ratios must be >0 to compute geomean

The effect on local gRPC benchmarks is negligible since the SplitUnsafe function isn't called very frequently.

$ go run benchmark/benchresult/main.go unary-before unary-after       
unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurr
entCalls_120-reqSize_16000B-respSize_16000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-c
lientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBuff
erPool_simple-sharedWriteBuffer_false
               Title       Before        After Percentage
            TotalOps      2985694      3024364     1.30%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     74784.94     74784.99     0.00%
           Allocs/op       133.67       133.89     0.00%
             ReqT/op 6369480533.33 6451976533.33     1.30%
            RespT/op 6369480533.33 6451976533.33     1.30%
            50th-Lat   2.410033ms    2.40116ms    -0.37%
            90th-Lat   3.145118ms   3.081771ms    -2.01%
            99th-Lat   3.563055ms   3.629663ms     1.87%
             Avg-Lat   2.410529ms   2.379513ms    -1.29%
           GoVersion     go1.24.8     go1.24.8
         GrpcVersion   1.78.0-dev   1.78.0-dev

RELEASE NOTES:

  • mem: Improve pooling of buffer objects on using SplitUnsafe.

@arjan-bal arjan-bal added this to the 1.79 Release milestone Dec 22, 2025
@arjan-bal arjan-bal added Type: Performance Performance improvements (CPU, network, memory, etc) Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. labels Dec 22, 2025
@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

❌ Patch coverage is 86.36364% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.27%. Comparing base (81a00ce) to head (fbbf808).
⚠️ Report is 22 commits behind head on master.

Files with missing lines Patch % Lines
mem/buffers.go 86.36% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8784      +/-   ##
==========================================
+ Coverage   83.22%   83.27%   +0.05%     
==========================================
  Files         418      418              
  Lines       32385    33005     +620     
==========================================
+ Hits        26952    27485     +533     
- Misses       4050     4106      +56     
- Partials     1383     1414      +31     
Files with missing lines Coverage Δ
mem/buffers.go 87.00% <86.36%> (+0.13%) ⬆️

... and 46 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal force-pushed the better-recycle-buffer-objects branch from 45c3231 to e1a28ac Compare December 22, 2025 10:11
@arjan-bal arjan-bal requested review from dfawley and easwars December 22, 2025 10:38
@arjan-bal arjan-bal assigned easwars, dfawley and arjan-bal and unassigned easwars and dfawley Dec 22, 2025
@arjan-bal arjan-bal force-pushed the better-recycle-buffer-objects branch 2 times, most recently from 7619ed9 to c33b2ae Compare December 23, 2025 09:42
@arjan-bal arjan-bal force-pushed the better-recycle-buffer-objects branch from c33b2ae to 3331987 Compare December 23, 2025 09:56
@arjan-bal arjan-bal changed the title mem: Correctly recycle buffer objects after SplitUnsafe mem: Optimize buffer re-use Dec 23, 2025
@arjan-bal arjan-bal changed the title mem: Optimize buffer re-use mem: Optimize buffer object re-use Dec 23, 2025
@arjan-bal arjan-bal assigned easwars and dfawley and unassigned arjan-bal Dec 26, 2025
Comment on lines +76 to +79
// initialized enables sanity checks without the overhead of atomic
// operations. This field is not safe for concurrent access and is used in a
// best-effort manner for assertion purposes only. It does not play a role
// in the concurrent logic of reference counting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things here:

  • The Buffer interface does that a buffer is not safe for concurrent access. Given that, do we need this to be mentioned here?
  • Do you have an idea of how much overhead the atomic operation of checking if the ref count is zero causes? The reason I'm asking is because this new field (and the checks associated with it) are sprinkled across multiple methods and I'm wondering if the code complexity (and the maintenance costs) are worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a little confused about this line from the docstring:

// Note that a Buffer is not safe for concurrent access and instead each
// goroutine should use its own reference to the data, which can be acquired via
// a call to Ref().

A call to Ref simply increments the reference count. It does not return a new reference to the existing buffer that can be used concurrently. Do we ever use buffers concurrently?

Also, why did we earlier have a pointer to an atomic and not store the atomic by value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Buffer interface documentation states that a buffer is not safe for concurrent access. Given that, do we need to explicitly mention this here?

A call to Ref simply increments the reference count. It does not return a new reference to the existing buffer that can be used concurrently. Do we ever use buffers concurrently?

In the initial design, buf.Ref() likely returned a new object intended to be transferred to a separate goroutine:

ref := buf.Ref()
go func() {
  // use ref here
}()
buf.Free()

However, in the merged implementation, Ref does not return a new object. So, the usage pattern becomes:

buf.Ref()
go func() {
  // use buf here
}()
buf.Free()

Technically, this implies buf is being accessed concurrently. However, the specific pattern that is unsafe is attempting to reference buf in a new goroutine without incrementing the count first:

go func() {
  // Unsafe: Race condition with buf.Free() below
  ref := buf.Ref() 
}()
buf.Free()

Source: #8209 (comment)

Yes, we do follow the safe pattern above by pushing data frame buffers into an unbounded channel to be consumed by another goroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea of how much overhead the atomic operation of checking if the ref count is zero causes? The reason I'm asking is because this new field (and the checks associated with it) are sprinkled across multiple methods and I'm wondering if the code complexity (and the maintenance costs) are worth it?

Earlier there was a check if b.refs == nil, which is not possible using a non-pointer field. Using initialized provides the test coverage.

There are some methods such are Ref and Free which perform atomic operations anyways, so we can check the return value for validation. However, for method like ReadData that don't perform atomic operations, the overhead is significant. According to Gemini, an atomic operation is roughly 10x-15x slower than a similar non-atomic operation under low contention and the difference becomes orders of magnitude larger under high contention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why did we earlier have a pointer to an atomic and not store the atomic by value?

Previously, the new buffer created by SplitUnsafe pointed to the same atomic.Int32 as the original buffer, which required the field to be a pointer. Now, the new object maintains its own ref count and stores a pointer to the original buffer instead. Therefore, the reference count (atomic.Uint32) no longer needs to be a pointer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the information. That helps.

I would still like to see if there is actually any significant performance improvement by having the initialized field. The if b.refs == nil check could also be replaced with a if b.refs.Load() == 0 if there is no significant performance impact.

@easwars easwars assigned arjan-bal and unassigned easwars and dfawley Jan 6, 2026
@arjan-bal arjan-bal removed their assignment Jan 6, 2026
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I need convincing is about the use of the initialized field, as opposed to directly checking the reference count by doing an atomic read of the value. Otherwise LGTM.

Comment on lines +76 to +79
// initialized enables sanity checks without the overhead of atomic
// operations. This field is not safe for concurrent access and is used in a
// best-effort manner for assertion purposes only. It does not play a role
// in the concurrent logic of reference counting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the information. That helps.

I would still like to see if there is actually any significant performance improvement by having the initialized field. The if b.refs == nil check could also be replaced with a if b.refs.Load() == 0 if there is no significant performance impact.

@easwars easwars assigned arjan-bal and unassigned easwars Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. Type: Performance Performance improvements (CPU, network, memory, etc)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants