-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust ringbuffer sizes #5866
Merged
Merged
Adjust ringbuffer sizes #5866
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rb_adjustments@79445 aka 20231213.11 vs main ewma over 20 builds from 79105 to 79434 Click to see tablemain
rb_adjustments
|
maxtropets
added a commit
to maxtropets/CCF
that referenced
this pull request
Jul 17, 2024
The default was increased in microsoft#5866, which made the destruction test to fail more ofter because of unexpected destruction flow. This was intended to fail on SGX due to out-of-memory and on virtual due to max ringbuffer message size overflow. After increasing the message limit we get the primary to slow (in debug build, at least), so election happens and is messing up the network, and so test can fail on missing primary or other states it doesn't expect. We mitigate the issue here by rolling back msg limit, but the proper fix might be to rewrite the test to make it more robust.
maxtropets
added a commit
to maxtropets/CCF
that referenced
this pull request
Jul 17, 2024
The default was increased in microsoft#5866, which made the destruction test to fail more ofter because of unexpected destruction flow. This was intended to fail on SGX due to out-of-memory and on virtual due to max ringbuffer message size overflow. After increasing the message limit we get the primary to slow (in debug build, at least), so election happens and is messing up the network, and so test can fail on missing primary or other states it doesn't expect. We mitigate the issue here by rolling back msg limit, but the proper fix might be to rewrite the test to make it more robust.
maxtropets
added a commit
to maxtropets/CCF
that referenced
this pull request
Jul 18, 2024
The default was increased in microsoft#5866, which made the destruction test to fail more ofter because of unexpected destruction flow. This was intended to fail on SGX due to out-of-memory and on virtual due to max ringbuffer message size overflow. After increasing the message limit we get the primary to slow (in debug build, at least), so election happens and is messing up the network, and so test can fail on missing primary or other states it doesn't expect. We mitigate the issue here by rolling back msg limit, but the proper fix might be to rewrite the test to make it more robust.
maxtropets
added a commit
to maxtropets/CCF
that referenced
this pull request
Jul 18, 2024
The default was increased in microsoft#5866, which made the destruction test to fail more ofter because of unexpected destruction flow. This was intended to fail on SGX due to out-of-memory and on virtual due to max ringbuffer message size overflow. After increasing the message limit we get the primary to slow (in debug build, at least), so election happens and is messing up the network, and so test can fail on missing primary or other states it doesn't expect. We mitigate the issue here by rolling back msg limit, but the proper fix might be to rewrite the test to make it more robust.
maxtropets
added a commit
to maxtropets/CCF
that referenced
this pull request
Jul 18, 2024
The default was increased in microsoft#5866, which made the destruction test to fail more ofter because of unexpected destruction flow. This was intended to fail on SGX due to out-of-memory and on virtual due to max ringbuffer message size overflow. After increasing the message limit we get the primary to slow (in debug build, at least), so election happens and is messing up the network, and so test can fail on missing primary or other states it doesn't expect. We mitigate the issue here by rolling back msg limit, but the proper fix might be to rewrite the test to make it more robust.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolve #5672. A 4 times increase is in line with the change previously made in 5ecf0a1.
As expected, there is not a substantial impact on performance, but this will allow bigger messages.
The js batched stress test does fail most of the time on debug builds. Superficially this is because the assert for the node crashing is not triggered, but in fact an earlier election causes the test to wait on a node that is by now a backup and not expected to crash. The election itself is triggered by very high I/O pauses when it happens:
I suspect the reason this is limited to debug is the very large amount of logging that probably causes IOPS throttling. But I am not sure why the larger RB/Fragment/Msg sizes, and the reduction in fragmentation has apparently worsened this.
Edit: what happens is that instead of crashing very quickly, the primary gets further in writing out large transactions, producing a bigger ledger. The writes are eventually so slow that they time out, causing an election. At that point we try to submit an even bigger batch on a backup, crashing it as expected.
But then our logic expects the first primary (nodes[0]) to have crashed, which has not happened and will not happen.