-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate wedge during passive testing (network v7) #565
Comments
Good news is... everything was quiet when wedged, not spinning out of control. |
It wedged again at instance 154 (got to round 5 at this point). I'm going to restart with more nodes. |
And in network 8, it stopped at instance 2. It's not halted, it reached round 8 before I got bored. But it's still not making progress. I'm pretty sure we have enough power because we got through two instances just fine, and are receiving messages from many peers. |
I think I found the bug:
Important log line here: Line 758 in a5e96fc
And see the comment on the Lines 189 to 191 in a5e96fc
But it's clearly bottom here. So, our proposal is bottom and we're never making progress because everyone is voting for bottom. |
Oh, nvm. |
Interesting... it made progress again. I wonder if it is just nodes de-syncing. Of course, not it's stuck on instance 8, round 6. I am getting some late messages... so I wonder if some nodes are just running slow? |
No, I think it's pubsub:
I have no full confirmation, but it's not looking great. I also wonder if this might partially be due to the fact that we ignore messages from prior instances and might therefore appear to censor? Maybe we should allow messages from the last instance? The cache should make this not terrible... |
Also, we seem to:
Interestingly, we're setting the alarms in the future then blowing past them likely because we're receiving something from the network. That's def sus.
|
Here's a longer sequence to give you an idea:
|
Idea: deploy a network with a very long rebroadcast interval to eliminate re-broadcast as a variable. |
The same thing happened in network version 30. Steb and I narrowed it down to people not receiving quality messages, but this should not result in the network not making progress. Steb noticed that the ticket quality function lacks fairness. While looking at this, I stumbled across the fact that we keep only one ticket per converge value: https://github.com/filecoin-project/go-f3/blob/main/gpbft/gpbft.go#L1310 #578 contains a test exposing the issue, a fix to the converge value ticket bookkeeping and right now a temporary fix for the ticket quality. |
This has been thoroughly investigated by this point. |
Network v7 wedged at instance 108, going till round 7 before I paused/unpaused the network to unwedge.
So, we're going to have to look into the logged messages to figure out exactly what happened.
The text was updated successfully, but these errors were encountered: