-
Notifications
You must be signed in to change notification settings - Fork 4.6k
mem: Optimize buffer object re-use #8784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8784 +/- ##
==========================================
+ Coverage 83.22% 83.27% +0.05%
==========================================
Files 418 418
Lines 32385 33005 +620
==========================================
+ Hits 26952 27485 +533
- Misses 4050 4106 +56
- Partials 1383 1414 +31
🚀 New features to boost your workflow:
|
45c3231 to
e1a28ac
Compare
7619ed9 to
c33b2ae
Compare
c33b2ae to
3331987
Compare
| // initialized enables sanity checks without the overhead of atomic | ||
| // operations. This field is not safe for concurrent access and is used in a | ||
| // best-effort manner for assertion purposes only. It does not play a role | ||
| // in the concurrent logic of reference counting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of things here:
- The
Bufferinterface does that a buffer is not safe for concurrent access. Given that, do we need this to be mentioned here? - Do you have an idea of how much overhead the atomic operation of checking if the ref count is zero causes? The reason I'm asking is because this new field (and the checks associated with it) are sprinkled across multiple methods and I'm wondering if the code complexity (and the maintenance costs) are worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also a little confused about this line from the docstring:
// Note that a Buffer is not safe for concurrent access and instead each
// goroutine should use its own reference to the data, which can be acquired via
// a call to Ref().
A call to Ref simply increments the reference count. It does not return a new reference to the existing buffer that can be used concurrently. Do we ever use buffers concurrently?
Also, why did we earlier have a pointer to an atomic and not store the atomic by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
Bufferinterface documentation states that a buffer is not safe for concurrent access. Given that, do we need to explicitly mention this here?
A call to Ref simply increments the reference count. It does not return a new reference to the existing buffer that can be used concurrently. Do we ever use buffers concurrently?
In the initial design, buf.Ref() likely returned a new object intended to be transferred to a separate goroutine:
ref := buf.Ref()
go func() {
// use ref here
}()
buf.Free()However, in the merged implementation, Ref does not return a new object. So, the usage pattern becomes:
buf.Ref()
go func() {
// use buf here
}()
buf.Free()Technically, this implies buf is being accessed concurrently. However, the specific pattern that is unsafe is attempting to reference buf in a new goroutine without incrementing the count first:
go func() {
// Unsafe: Race condition with buf.Free() below
ref := buf.Ref()
}()
buf.Free()Source: #8209 (comment)
Yes, we do follow the safe pattern above by pushing data frame buffers into an unbounded channel to be consumed by another goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an idea of how much overhead the atomic operation of checking if the ref count is zero causes? The reason I'm asking is because this new field (and the checks associated with it) are sprinkled across multiple methods and I'm wondering if the code complexity (and the maintenance costs) are worth it?
Earlier there was a check if b.refs == nil, which is not possible using a non-pointer field. Using initialized provides the test coverage.
There are some methods such are Ref and Free which perform atomic operations anyways, so we can check the return value for validation. However, for method like ReadData that don't perform atomic operations, the overhead is significant. According to Gemini, an atomic operation is roughly 10x-15x slower than a similar non-atomic operation under low contention and the difference becomes orders of magnitude larger under high contention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why did we earlier have a pointer to an atomic and not store the atomic by value?
Previously, the new buffer created by SplitUnsafe pointed to the same atomic.Int32 as the original buffer, which required the field to be a pointer. Now, the new object maintains its own ref count and stores a pointer to the original buffer instead. Therefore, the reference count (atomic.Uint32) no longer needs to be a pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the information. That helps.
I would still like to see if there is actually any significant performance improvement by having the initialized field. The if b.refs == nil check could also be replaced with a if b.refs.Load() == 0 if there is no significant performance impact.
easwars
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I need convincing is about the use of the initialized field, as opposed to directly checking the reference count by doing an atomic read of the value. Otherwise LGTM.
| // initialized enables sanity checks without the overhead of atomic | ||
| // operations. This field is not safe for concurrent access and is used in a | ||
| // best-effort manner for assertion purposes only. It does not play a role | ||
| // in the concurrent logic of reference counting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the information. That helps.
I would still like to see if there is actually any significant performance improvement by having the initialized field. The if b.refs == nil check could also be replaced with a if b.refs.Load() == 0 if there is no significant performance impact.
Splitting a
bufferresults in fetching a newbufferobject from async.Pool. Thebufferobject is returned back to the pool only once the shared ref count falls to 0. As a result, only one of thebufferobjects is returned back to the pool for re-use. The "leaked" buffer objects may cause noticeable allocations when buffers are split more frequently. I noticed this when attempting to remove a buffer copy by replacing the bufio.Reader.Solution
This PR introduces a root-owner model for the underlying
*[]bytewithinbufferobjects. The root object manages the slice's lifecycle, returning it to the pool only when its reference count reaches zero.When a
bufferis split, the newbufferis treated as a child, incrementing the ref counts for both itself and the root. Once a child’s ref count hits zero, it returns itself to the pool and decrements the root’s count.Additionally, this PR removes the
sync.Poolused for*atomic.Int32by embeddingatomic.Int32as a value field within thebufferstruct. By eliminating the second pool and the associated pointer indirection, we reduce allocation overhead and improve cache locality during buffer lifecycle events.Benchmarks
A micro-benchmark showing the buffer object leak:
Result on master vs this PR.
goos: linux goarch: amd64 pkg: google.golang.org/grpc/mem cpu: Intel(R) Xeon(R) CPU @ 2.60GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Split/split-48 418.2n ± 0% 263.9n ± 1% -36.89% (p=0.000 n=10) Split/no-split-48 221.1n ± 1% 208.5n ± 0% -5.70% (p=0.000 n=10) geomean 304.1n 234.6n -22.86% │ old.txt │ new.txt │ │ B/op │ B/op vs base │ Split/split-48 64.00 ± 0% 0.00 ± 0% -100.00% (p=0.000 n=10) Split/no-split-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ geomean ² ? ² ³ ¹ all samples are equal ² summaries must be >0 to compute geomean ³ ratios must be >0 to compute geomean │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ Split/split-48 1.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10) Split/no-split-48 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ geomean ² ? ² ³ ¹ all samples are equal ² summaries must be >0 to compute geomean ³ ratios must be >0 to compute geomeanThe effect on local gRPC benchmarks is negligible since the
SplitUnsafefunction isn't called very frequently.$ go run benchmark/benchresult/main.go unary-before unary-after unary-networkMode_Local-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurr entCalls_120-reqSize_16000B-respSize_16000B-compressor_off-channelz_false-preloader_false-clientReadBufferSize_-1-c lientWriteBufferSize_-1-serverReadBufferSize_-1-serverWriteBufferSize_-1-sleepBetweenRPCs_0s-connections_1-recvBuff erPool_simple-sharedWriteBuffer_false Title Before After Percentage TotalOps 2985694 3024364 1.30% SendOps 0 0 NaN% RecvOps 0 0 NaN% Bytes/op 74784.94 74784.99 0.00% Allocs/op 133.67 133.89 0.00% ReqT/op 6369480533.33 6451976533.33 1.30% RespT/op 6369480533.33 6451976533.33 1.30% 50th-Lat 2.410033ms 2.40116ms -0.37% 90th-Lat 3.145118ms 3.081771ms -2.01% 99th-Lat 3.563055ms 3.629663ms 1.87% Avg-Lat 2.410529ms 2.379513ms -1.29% GoVersion go1.24.8 go1.24.8 GrpcVersion 1.78.0-dev 1.78.0-devRELEASE NOTES:
bufferobjects on usingSplitUnsafe.