Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate wedge during passive testing (network v7) #565

Closed
Stebalien opened this issue Aug 12, 2024 · 12 comments
Closed

Investigate wedge during passive testing (network v7) #565

Stebalien opened this issue Aug 12, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@Stebalien
Copy link
Member

Network v7 wedged at instance 108, going till round 7 before I paused/unpaused the network to unwedge.

  1. I restarted our follower to make sure it wasn't just us, it wasn't.
  2. We clearly had enough power and pausing/unpausing fixed everything.

So, we're going to have to look into the logged messages to figure out exactly what happened.

@Stebalien
Copy link
Member Author

Good news is... everything was quiet when wedged, not spinning out of control.

@Stebalien
Copy link
Member Author

It wedged again at instance 154 (got to round 5 at this point). I'm going to restart with more nodes.

@Stebalien
Copy link
Member Author

And in network 8, it stopped at instance 2. It's not halted, it reached round 8 before I got bored. But it's still not making progress. I'm pretty sure we have enough power because we got through two instances just fine, and are receiving messages from many peers.

@Stebalien
Copy link
Member Author

I think I found the bug:

{"level":"debug","ts":"2024-08-13T01:42:55.193+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{5}: try step PREPARE (round 4, step PREPARE, proposal [AFY2BZACEANW2522@4172770], value [AFY2BZACEANW2522@4172770])"}
{"level":"debug","ts":"2024-08-13T01:42:55.193+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 01:44:31.193510379 +0200 CEST m=+9234.730969989"}
{"level":"debug","ts":"2024-08-13T01:42:55.375+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{5}: moving to round 5 with [AFY2BZACEANW2522@4172770] (round 5, step COMMIT, proposal [AFY2BZACEANW2522@4172770], value 丄)"}
{"level":"debug","ts":"2024-08-13T01:42:55.421+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 01:46:07.421564358 +0200 CEST m=+9330.959023958"}
{"level":"debug","ts":"2024-08-13T01:42:55.497+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{5}: try step CONVERGE (round 5, step CONVERGE, proposal [AFY2BZACEANW2522@4172770], value 丄)"}

Important log line here:

i.log("moving to round %d with %s", i.round, i.proposal.String())

And see the comment on the i.proposal variable here:

go-f3/gpbft/gpbft.go

Lines 189 to 191 in a5e96fc

// This instance's proposal for the current round. Never bottom.
// This is set after the QUALITY phase, and changes only at the end of a full round.
proposal ECChain

But it's clearly bottom here.

So, our proposal is bottom and we're never making progress because everyone is voting for bottom.

@Stebalien
Copy link
Member Author

Oh, nvm. i.log implicitly prints the value in addition to the proposal.

@Stebalien
Copy link
Member Author

Stebalien commented Aug 12, 2024

Interesting... it made progress again. I wonder if it is just nodes de-syncing. Of course, not it's stuck on instance 8, round 6. I am getting some late messages... so I wonder if some nodes are just running slow?

@Stebalien
Copy link
Member Author

No, I think it's pubsub:

  1. I'm seeing lots of "Ignoring peer" with low scores.
  2. I'm seeing us drop quite a few messages to peers with full queues.

I have no full confirmation, but it's not looking great.

I also wonder if this might partially be due to the fact that we ignore messages from prior instances and might therefore appear to censor? Maybe we should allow messages from the last instance? The cache should make this not terrible...

@Stebalien
Copy link
Member Author

Also, we seem to:

  1. Spend a lot of time trying to converge with value bottom.
  2. Finally converge.
  3. Prepare for a base proposal.
  4. Then immediately move to the next round.

Interestingly, we're setting the alarms in the future then blowing past them likely because we're receiving something from the network. That's def sus.

{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: adopting proposal [AFY2BZACEDEA6TBZ@4173269] after converge (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:47:01.027584663 +0200 CEST m=+12984.565044293"}
{"level":"debug","ts":"2024-08-13T02:34:13.072+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step PREPARE (round 7, step PREPARE, proposal [AFY2BZACEDEA6TBZ@4173269], value [AFY2BZACEDEA6TBZ@4173269])"}
{"level":"debug","ts":"2024-08-13T02:34:13.072+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:47:01.072441635 +0200 CEST m=+12984.609901265"}
{"level":"debug","ts":"2024-08-13T02:34:13.074+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: moving to round 8 with [AFY2BZACEDEA6TBZ@4173269] (round 8, step COMMIT, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.105+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:59:49.105237957 +0200 CEST m=+13752.642697587"}
{"level":"debug","ts":"2024-08-13T02:34:13.112+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}

@Stebalien
Copy link
Member Author

Here's a longer sequence to give you an idea:

{"level":"debug","ts":"2024-08-13T02:33:50.929+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:33:53.379+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:33:57.306+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:33:57.380+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:33:59.220+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:04.118+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:04.554+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:04.591+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:10.688+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:10.737+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:12.098+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:12.141+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: adopting proposal [AFY2BZACEDEA6TBZ@4173269] after converge (round 7, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.027+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:47:01.027584663 +0200 CEST m=+12984.565044293"}
{"level":"debug","ts":"2024-08-13T02:34:13.072+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step PREPARE (round 7, step PREPARE, proposal [AFY2BZACEDEA6TBZ@4173269], value [AFY2BZACEDEA6TBZ@4173269])"}
{"level":"debug","ts":"2024-08-13T02:34:13.072+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:47:01.072441635 +0200 CEST m=+12984.609901265"}
{"level":"debug","ts":"2024-08-13T02:34:13.074+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: moving to round 8 with [AFY2BZACEDEA6TBZ@4173269] (round 8, step COMMIT, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.105+0200","logger":"f3","caller":"go-f3@v0.0.8-0.20240812190957-a5e96fcff98c/host.go:596","msg":"set alarm for 2024-08-13 02:59:49.105237957 +0200 CEST m=+13752.642697587"}
{"level":"debug","ts":"2024-08-13T02:34:13.112+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.129+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.167+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.168+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.188+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.190+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.200+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}
{"level":"debug","ts":"2024-08-13T02:34:13.206+0200","logger":"f3/gpbft","caller":"gpbft/gpbft.go:968","msg":"{12}: try step CONVERGE (round 8, step CONVERGE, proposal [AFY2BZACEDEA6TBZ@4173269], value 丄)"}

@Kubuxu
Copy link
Contributor

Kubuxu commented Aug 13, 2024

Idea: deploy a network with a very long rebroadcast interval to eliminate re-broadcast as a variable.

@Kubuxu
Copy link
Contributor

Kubuxu commented Aug 22, 2024

The same thing happened in network version 30. Steb and I narrowed it down to people not receiving quality messages, but this should not result in the network not making progress.
See filecoin-project/FIPs#809 (comment)

Steb noticed that the ticket quality function lacks fairness. While looking at this, I stumbled across the fact that we keep only one ticket per converge value: https://github.com/filecoin-project/go-f3/blob/main/gpbft/gpbft.go#L1310

#578 contains a test exposing the issue, a fix to the converge value ticket bookkeeping and right now a temporary fix for the ticket quality.

@rjan90 rjan90 added the bug Something isn't working label Aug 23, 2024
@masih masih moved this from Todo to In progress in F3 Aug 28, 2024
@Stebalien
Copy link
Member Author

This has been thoroughly investigated by this point.

@github-project-automation github-project-automation bot moved this from In progress to Done in F3 Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants