-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brief network partitions can cause all ingress requests to time out even after network is healed #2565
Comments
The Bifrost config looks completely normal - I didn't capture the output from the initial run, but here's another run where the cluster got into a bad state from the second partitioning event:
On this occasion, the cluster recovered without external intervention a short while (about 4 minutes) after the end of the test run. Logs from sequencer (n2):
Bifrost config at end state:
|
Complete node logs from the last run in my earlier comment: |
Tested against #2570 to similar results - with 5 and 15 second partitions respectively: Full Jepsen output from the second example: restate-set-vo-20250128T222741.019+0200.tar.gz This is the log configuration at the end of the test:
|
Cluster state:
All the ingresses are apparently stuck:
All three nodes are just emitting this to the log every second:
|
Also possibly relevant:
The log from N2 (
So it assumed leadership for the partition but about 1s later decided to step down? And then nobody else is attempting to grab the leadership since, explaining why ingress requests just hang. |
We might have fixed this issue with the latest batch of fixes. @pcholakov what do you think, should we close this issue? |
Noted the stuck state that I can consistently replicate when running the Jepsen virtual object workloads against a 3-node cluster that experiences intermittent network partitions.
Build:
Restate Server (ad4e5ee x86_64-unknown-linux-gnu 2025-01-28)
Here is a visual representation of what's going on - the grey shaded areas are periods when a random single node is isolated from the others:
The yellow datapoints are ingress request timeouts, blue - successfully completed.
Cluster state after the end of test:
It appears to be Bifrost related as the latest entries in the logs on all three servers indicate a pending seal that isn't finished.
Details
n1
n2
n3
Restarting the
restate-server
processes gets the cluster unstuck.The text was updated successfully, but these errors were encountered: