-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor #560
base: main
Are you sure you want to change the base?
Refactor #560
Conversation
This commit switches the bufferpool to use the zeropool implementation for sync pools. The stdlib sync.Pool implementation has an issue where it causes an additional heap allocation per Put() call when used with byte slices. github.com/colega/zeropool package has been specifically designed to work around this issue, which reduces GC pressure and improves performance. This also fixes the bufferpool's pkg benchmark to use a new pool per test, to avoid other tests influencing the behavior of the benchmark and sets it to report the allocations.
This fixes the Message's Reset() call to allow reuse of the first segment. Prior to this fix, the first segment was discarded after the first Reset call, effectively causing a new segment to be initialized on every Reset call. By reusing the first segment, the number of heap allocations is reduced and therefore performance is increased in use cases where the message object is reused. The fix involved associtating the segment to the message and fixing checks to ensure the data of the segment is re-allocated after the reset. A benchmark is included to show the current performance of this.
Momentarily, while refactoring is going on.
This shows that the BenchmarkUnmarshal_Reuse is broken.
The prior version isn't commented and it's hard to reason about.
This makes the captable release more efficient, avoid unnecessary allocations.
This avoids some duffcopy calls and improves perf.
Here is another find.
Additionally, this makes it much easier to reason about what "setting the root of the message" entails and the code is much flatter. Also, this gets rid of the need to have the root pointer allocated on The new The greatest benefit of this would be for code that continually instantiates new messages for writing (which necessarily implies setting the root of the new message). |
For the next update I introduce an unrolled version of The Bringing everything so far, we get the following benchmark results, which we can compare to the ones in #554:
|
And for the next update, I introduce an unrolled This significantly reduces the latency for an operation that consists only of updating a field, down to only ~16ns:
|
@matheusd Thank you so much for all this incredible work! I'll review this asap and make sure it gets the attention it deserves. Probably won't be today, but know that you're on my radar! 🙂 🙏 |
Sure thing. I still have at least one additional idea to test out to make this particular workflow faster. Also note that these are all wip, and specially the later pushes are all experimental stuff that I wouldn't expect to merge as is, but rather to use as a baseline to refactor the code to reach these benchmark results. |
The TextField is a reference to a specific text field inside a struct. It records both the pointer and value locations inside a struct, which may be used to fetch or update the underlying value.
Pushed a new experiment that adds a By keeping these around, we can forgo the most costly op from This gets to within 2.2x of a baseline benchmark that just copies the data into a slice:
|
@matheusd I'm getting ready to dive into this, and firstly wanted to thank you again for the detailed overview. Now that you seem to be converging on an implementation, I'm wondering if there's any part of this that we can break off into a smaller PR and merge separately? |
Requires #555, #556
Elided tests and creating as draft to get a first pass review.
The diff is large(ish) so it may be easier to read the full code vs the diff at the moment. But the basic idea of the refactor is the following:
Split allocator strategy and segment management
Using the bufferpool should not be forced upon the caller. And in fact, it is dangerous to use the current implementation of single/multi segment arenas if you send a buffer allocated from anywhere not there. For example, an mmaped file buffer.
Spinning the allocator into its own thing means we can define different allocation strategies (bufferpool, regular runtime functions, simpler caching strategy, read-only, etc).
Unfortunately, due to some tests failing otherwise, I couldn't unify literally everything inside the allocator (see further below for discussion).
Base arena implementation
SingleSegment and MultiSegment arenas have been unified into a single arena impl that offloads the logic to the allocator and segment list.
So the full matrix of Single/Multi and BufferPool/runtime-backed options can be exercised.
Reduce message complexiy
Most decisions have been offloaded from message into the arena, segment list or allocator. This makes
Message
more generic and easier to reason about: in particular, it no longer cares how many segments there are during init (see further below for discussion on Release) and the roundabound way it used to initialize the first segment.Test compatibility and issues
All the existing tests pass. A few that are no longer applicable (dealing with the concrete arena implementations) have been commented out. One test (
TestPromiseOrdering
) is skipped because it's flaky even in the current main branch.Other than those, the code has been specifically designed to not require changes in the existing tests, and therefore should be ensuring full compatibility to the existing code.
Tests for the new features haven't been done yet (but will if this is deemed to be in the right direction).
Message.Release is full of special cases
One main source of frustration during this rewrite is that
Message.Release
is full of special cases, mostly to deal with initializing a message for writing. It has all these cases I had to add to avoid having to touch the existing tests. These special cases are documented now in the code, after aFIXME(matheusd)
line in that function.Personally, I think my
ReleaseForRead()
should be the actual implementation ofRelease
, but in the interest of not breaking client code, I opted for adding a new function instead.This is also somewhat the reason for having to add a
ReadOnlySingleSegmentArena
instead of using a read only allocator: Release() is (currently) expected to check the arena is clear and re-allocate the first segment (i.e. "Prove Reset() cannot be used to reset a read-only message" commit), so I had to go out of my way to create an arena that would make it easier to just read only, while reusing the message struct.