Investigate wedge during passive testing (network v7) #565

Stebalien · 2024-08-12T21:44:38Z

Network v7 wedged at instance 108, going till round 7 before I paused/unpaused the network to unwedge.

I restarted our follower to make sure it wasn't just us, it wasn't.
We clearly had enough power and pausing/unpausing fixed everything.

So, we're going to have to look into the logged messages to figure out exactly what happened.

Stebalien · 2024-08-12T21:50:14Z

Good news is... everything was quiet when wedged, not spinning out of control.

Stebalien · 2024-08-12T22:17:07Z

It wedged again at instance 154 (got to round 5 at this point). I'm going to restart with more nodes.

Stebalien · 2024-08-12T23:00:23Z

And in network 8, it stopped at instance 2. It's not halted, it reached round 8 before I got bored. But it's still not making progress. I'm pretty sure we have enough power because we got through two instances just fine, and are receiving messages from many peers.

Stebalien · 2024-08-12T23:46:36Z

I think I found the bug:

{"level":"debug","ts":"2024-08-13T01:42:55.193+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{5}: try step PREPARE (round 4, step PREPARE, proposal [AFY2BZACEANW2522@4172770], value [AFY2BZACEANW2522@4172770])"}
{"level":"debug","ts":"2024-08-13T01:42:55.193+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 01:44:31.193510379 +0200 CEST m=+9234.730969989"}
{"level":"debug","ts":"2024-08-13T01:42:55.375+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{5}: moving to round 5 with [AFY2BZACEANW2522@4172770] (round 5, step COMMIT, proposal [AFY2BZACEANW2522@4172770], value 丄)"}
{"level":"debug","ts":"2024-08-13T01:42:55.421+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 01:46:07.421564358 +0200 CEST m=+9330.959023958"}
{"level":"debug","ts":"2024-08-13T01:42:55.497+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{5}: try step CONVERGE (round 5, step CONVERGE, proposal [AFY2BZACEANW2522@4172770], value 丄)"}

Important log line here:

go-f3/gpbft/gpbft.go

Line 758 in a5e96fc

i.log("moving to round %d with %s", i.round, i.proposal.String())

And see the comment on the i.proposal variable here:

go-f3/gpbft/gpbft.go

Lines 189 to 191 in a5e96fc

    
           // This instance's proposal for the current round. Never bottom. 
        
           // This is set after the QUALITY phase, and changes only at the end of a full round. 
        
           proposal ECChain

But it's clearly bottom here.

So, our proposal is bottom and we're never making progress because everyone is voting for bottom.

Stebalien · 2024-08-12T23:54:15Z

Oh, nvm. i.log implicitly prints the value in addition to the proposal.

Stebalien · 2024-08-12T23:58:39Z

Interesting... it made progress again. I wonder if it is just nodes de-syncing. Of course, not it's stuck on instance 8, round 6. I am getting some late messages... so I wonder if some nodes are just running slow?

Stebalien · 2024-08-13T00:07:33Z

No, I think it's pubsub:

I'm seeing lots of "Ignoring peer" with low scores.
I'm seeing us drop quite a few messages to peers with full queues.

I have no full confirmation, but it's not looking great.

I also wonder if this might partially be due to the fact that we ignore messages from prior instances and might therefore appear to censor? Maybe we should allow messages from the last instance? The cache should make this not terrible...

Stebalien · 2024-08-13T00:39:20Z

Also, we seem to:

Spend a lot of time trying to converge with value bottom.
Finally converge.
Prepare for a base proposal.
Then immediately move to the next round.

Interestingly, we're setting the alarms in the future then blowing past them likely because we're receiving something from the network. That's def sus.

{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: adopting proposal [AFY2BZACEDEA6TBZ@4173269] after converge (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:47:01.027584663 +0200 CEST m=+12984.565044293"}
{"level":"debug","ts":"2024-08-13T02:34:13.072+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step PREPARE (round 7, step PREPARE, proposal [AFY2BZACEDEA6TBZ@4173269], value [AFY2BZACEDEA6TBZ@4173269])"}
{"level":"debug","ts":"2024-08-13T02:34:13.072+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:47:01.072441635 +0200 CEST m=+12984.609901265"}
{"level":"debug","ts":"2024-08-13T02:34:13.074+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: moving to round 8 with [AFY2BZACEDEA6TBZ@4173269] (round 8, step COMMIT, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.105+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:59:49.105237957 +0200 CEST m=+13752.642697587"}
{"level":"debug","ts":"2024-08-13T02:34:13.112+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}

Stebalien · 2024-08-13T00:39:56Z

Here's a longer sequence to give you an idea:

{"level":"debug","ts":"2024-08-13T02:33:50.929+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:33:53.379+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:33:57.306+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:33:57.380+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:33:59.220+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:04.118+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:04.554+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:04.591+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:10.688+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:10.737+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:12.098+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:12.141+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: adopting proposal [AFY2BZACEDEA6TBZ@4173269] after converge (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:47:01.027584663 +0200 CEST m=+12984.565044293"}
{"level":"debug","ts":"2024-08-13T02:34:13.072+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step PREPARE (round 7, step PREPARE, proposal [AFY2BZACEDEA6TBZ@4173269], value [AFY2BZACEDEA6TBZ@4173269])"}
{"level":"debug","ts":"2024-08-13T02:34:13.072+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:47:01.072441635 +0200 CEST m=+12984.609901265"}
{"level":"debug","ts":"2024-08-13T02:34:13.074+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: moving to round 8 with [AFY2BZACEDEA6TBZ@4173269] (round 8, step COMMIT, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.105+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:59:49.105237957 +0200 CEST m=+13752.642697587"}
{"level":"debug","ts":"2024-08-13T02:34:13.112+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.129+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.167+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.168+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.188+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.190+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.200+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.206+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}

Kubuxu · 2024-08-13T15:27:33Z

Idea: deploy a network with a very long rebroadcast interval to eliminate re-broadcast as a variable.

Kubuxu · 2024-08-22T14:25:14Z

The same thing happened in network version 30. Steb and I narrowed it down to people not receiving quality messages, but this should not result in the network not making progress.
See filecoin-project/FIPs#809 (comment)

Steb noticed that the ticket quality function lacks fairness. While looking at this, I stumbled across the fact that we keep only one ticket per converge value: https://github.com/filecoin-project/go-f3/blob/main/gpbft/gpbft.go#L1310

#578 contains a test exposing the issue, a fix to the converge value ticket bookkeeping and right now a temporary fix for the ticket quality.

Stebalien · 2024-08-29T18:23:55Z

This has been thoroughly investigated by this point.

github-project-automation bot added this to F3 Aug 12, 2024

github-project-automation bot moved this to Todo in F3 Aug 12, 2024

rjan90 added this to the Milestone 2: Harderning and Mainnet Deployment Readiness milestone Aug 23, 2024

rjan90 added the bug Something isn't working label Aug 23, 2024

masih moved this from Todo to In progress in F3 Aug 28, 2024

Stebalien closed this as completed Aug 29, 2024

github-project-automation bot moved this from In progress to Done in F3 Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate wedge during passive testing (network v7) #565

Investigate wedge during passive testing (network v7) #565

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024 •

edited

Loading

Stebalien commented Aug 13, 2024

Stebalien commented Aug 13, 2024

Stebalien commented Aug 13, 2024

Kubuxu commented Aug 13, 2024

Kubuxu commented Aug 22, 2024

Stebalien commented Aug 29, 2024

Investigate wedge during passive testing (network v7) #565

Investigate wedge during passive testing (network v7) #565

Comments

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024

Stebalien commented Aug 12, 2024 • edited Loading

Stebalien commented Aug 13, 2024

Stebalien commented Aug 13, 2024

Stebalien commented Aug 13, 2024

Kubuxu commented Aug 13, 2024

Kubuxu commented Aug 22, 2024

Stebalien commented Aug 29, 2024

Stebalien commented Aug 12, 2024 •

edited

Loading