-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance drop on OTP 27.0-rc1 (caused by bitstring refactor) #8229
Comments
Thanks for your report! I can't think of anything that would, suddenly, cause How large are the binaries involved (ballpark), are there tons and tons of tiny ones (=< 64b)? Large (>64b) ones? |
I was surprised by that More time spent in GC seems pretty consistent - I also used In the particular test as reported above I was publishing 100B messages, so there was a significant number of 100B binaries but also a bunch of small ones for sure (headers). I've also tested with 20B messages and there was a similar difference. With 1kb and 5kb messages there seems to be less of a difference (10-15%) but still consistently noticeable. It's not completely clear to me what happens with these binaries within the Beam. They enter as TCP packets so even though the messages can be small (like in this 20B test), they can effectively be parts of the larger binary created at the TCP buffer layer. |
This could be a "cleaner" example. This time consumers are present ( In this case I had ~90k msgs/s with OTP built before bitstring refactor was merged and ~55k msgs/s with the refactor merged (as before, I'm comparing OTP builds with just 1 commit of difference - |
Can you share the VM arguments you're using? Are you setting |
... after thinking a bit more about it, now I can, along with higher memory usage (far beyond the extra word @okeuday mentioned). How does the data you're passing to |
The data that ends up written looks like this: First, for each message we build this data ( Then we do some consolidation: if two messages end up contiguous, we consolidate them with an Acc here: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_classic_queue_store_v2.erl#L237-L270 Finally we do an open/pwrite/close per file with the consolidated data: https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_classic_queue_store_v2.erl#L227 Note that before getting built when the time comes to write them to disk, the messages are kept in a map as a record that contains a bunch of binaries and other values. |
Thanks, that tracks with what I think it could be. Unfortunately it took forever for |
It might be easier to replace Bazel usage with |
this branch is not yet compatible with |
Fixing the awkward interaction between I haven't restored all performance yet, but I've ruled out the slightly increased size of some binary structures as the cause. I'll try to get this fixed before RC2. |
I've found the problem: the GC pressure of off-heap binaries ("vheap") was vastly under-counted prior to 24ef4cb. In these tests this caused it to GC less often and, crucially, when there was less live data to keep. The larger packet sizes were less affected because they weren't under-counted by as much. Tuning I'll make a PR with the changes to the |
Great, thank you! If you are using RabbitMQ for testing, I just pushed some changes to the |
Describe the bug
We've started testing RabbitMQ with OTP 27.0-rc1 and we immediately saw a pretty consistent ~40% performance drop
throughout our performance tests. We've identified the it's caused by the bitstring changes introduced in
#7828.
Specifically, OTP built from SHA 49024e8 performs similar to OTP26, while 24ef4cb shows a significant performance drop (also present in OTP27.0-rc1 and
master
as of today).There was a follow-up PR #7935 which mentions
but I think that doesn't really explain the difference we see.
The most obvious observable difference is the amount of garbage collection and time spent in
pwrite_nif
.Before #7828 (faster):
After #7828 (slower):
To Reproduce
Currently my reproduction steps involve running a RabbitMQ node. If a small test case is needed, I'd appreciate some suggestions for what kind of operations specifically we could try in that test case.
To run a RabbitMQ node, you can either use docker or you can build and run locally. Details below.
To run a node with OTP 27.0-rc1:
note: these images don't work on Apple Silicon.
To build and run locally:
note:
make run-broker
doesn't yet work on this branch sobazel
is needed.With a RabbitMQ running, run our load testing tool:
This publishes, without consuming, 100B messages to a single queue for 10 seconds. With OTP26 and versions prior to #7828 that gives me about 115 000 messages per second. After that PR I get about 65 000 messages per second.
24ef4cb and newer (27.0-rc1,
master
).The text was updated successfully, but these errors were encountered: