-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random Sync Failure Reported by Multiple Users - Continues the 10788 Discussion. #10906
Comments
This continues to be investigated by multiple parties. Here's an update on one particular hypothesis we've been testing: Theory: A bug in Lotus causes us to incorrectly drop pubsub scores for peers when they are propagate "local" messages to us. This causes the local message propogator to stop receiving blocks from peers, and thus falling out of sync. In order to attempt to determine whether this is the case, @magik6k and @TippyFlitsUK have been running nodes with extra logs that should provide some information on when messages coming from peers are either Ignored or Rejected. They have also been trying to stay in connection with each other, and reproducing the issue that causes them to fall out of sync. Next steps: We need to look at these logs, and try to confirm the theory. We are interested in all logs introduced by the commit linked above in general, but especially those pertaining to @magik6k and @TippyFlitsUK's peer IDs, and ESPECIALLY those pertaining to those peer IDs when the node in question is out of sync. We should be able to piece all of this information together based on the info they have shared. If we do see penalties being applied to their peers, we need to assess whether the penalties are valid (the messages being sent are, in fact, "wrong" in some way worthy of a score-cut), or whether they are invalid (the messages are "good", and shouldn't be penalized). Once we know that, we can discuss a fix. Note that this is just one of MANY theories we're testing, this is NOT the definitive next path for the issue at hand. |
Having investigated a bit more, there is one funky thing we appear to do: We penalize peers who send us a message we've already added to our mpool. I'm not sure that this is the cause, but it does just seem wrong -- I've opened #10973 to discuss / fix this. It will unfortunately be a little tricky to test whether this helps with the issue. In order to confirm that, we'll have to reproduce the issue on one of our nodes ("easy" to do), while connected to at least 2 other nodes -- one running the fix in #10973, while the other doesn't. Ideally, we'll see that the node with the patch doesn't penalize us, and continues to send us blocks. We'll only really know for sure when we have a large number of users running the patch on mainnet. |
Based on logs shared by @TippyFlitsUK here, the most common reason for rejecting pubsub messages (and thus lowering peer scores) is the ErrExistingNonce error that was addressed in #10973. This also matches what I am seeing on my node. This does not necessarily mean #10973 will solve the issue in the OP, but it gives us some confidence that it might. Next steps towards this theory could be:
We should still be open to other theories, though. Trying to identify the exact triggers that would cause peers to stop sharing blocks (likely only 1-2 such), and then identifying the things that might lead to those triggers in Lotus could point us to the bug. |
I don't agree the #10973 fix this sync failture case, I merged this commit and meet this case again |
@marco-storswift Thanks for testing this out! I'm also not 100% confident #10973 will fix the issue. However, in order for it to help, we actually need your peers to be running the patch. Unfortunately you merging the commit won't help your own node 😞 Is it possible for you to merge the patch onto a few more nodes, and keep them connected to each other? That might give us some more insight (though we really won't know until the majority of the nodes have upgraded). |
I am also running the patch Marco. Please feel free to connect to me at |
@arajasek @TippyFlitsUK good news, when i update github.com/libp2p/go-libp2p to v0.27.6. every things is ok. |
@marco-storswift That's GREAT news! Lotus team will try to confirm this as well, but would be awesome if more users can try this. @shrenujbansal Can you throw up a tag (not an RC) that bumps go-libp2p to 0.27.6 on top of the latest v1.23.1 RC and point some folks who were experiencing the issue at it? Fingers crossed we confirm the good news ❤️ |
Here's the tag with libp2p 0.27.6 on top of v1.23.1-rc4: https://github.com/filecoin-project/lotus/releases/tag/v1.23.1-libp2p-0.27.6 @marco-storswift Did you also have #10973 in your source code where you saw the issue fixed? |
Yes,had #10973 and bumps go-libp2p to 0.27.6.the sync is ok, I've been running the node for over 48 hours |
@shrenujbansal Can we please get an update on this (ideally daily updates, even if they're "no progress")? |
Below is the summary from the debug done between myself and @MarcoPolo As @arajasek pointed out above #10906 (comment), the ErrExistingNonce error was being treated as a reject and was penalizing peers incorrectly. This bug means that peers who republish their pending messages (automatically every ~5 min) will be penalized by their gossipsub peers. Eventually their peers will remove them from their topics and they won't learn about new blocks over gossipsub. Instead, their only sync mechanism will be the lotus hello protocol when they discover new peers via Kademlia. This specific bug has been fixed with #10973 which is available on master and will be available with the next release. However, you will need several peers to have adopted this fix in order to see improvement High level points:
|
I think we're comfortable saying fixed in #10973 |
Checklist
Latest release
, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.Lotus component
Lotus Version
Repro Steps
Describe the Bug
This new issue aims to collate and organise multiple instances of SP feedback from the existing #10788 issue and across other messaging platforms such as Slack.
A new discussion thread has been created for each contributor and existing feedback has been pre-filled with the details that have already been provided.
The Lotus Team would be very grateful if returning contributors could add additional feedback and logs to their own dedicated discussion thread as and when it becomes available.
If you are experiencing this issue for the first time, please feel free to add your feedback and logs below this thread and the team will distribute it accordingly.
Many thanks all!! 🙏
@RobQuistNL - Logs & Feedback - User Logs & Feedback for issue 10788 #10907 (comment)
@marshyonline - Logs & Feedback - User Logs & Feedback for issue 10788 #10907 (comment)
@scaseye - Logs & Feedback - User Logs & Feedback for issue 10788 #10907 (comment)
@marco-storswift - Logs & Feedback - User Logs & Feedback for issue 10788 #10907 (comment)
@Trevor-K-Smith - Logs & Feedback - User Logs & Feedback for issue 10788 #10907 (comment)
@stuberman - Logs & Feedback - User Logs & Feedback for issue 10788 #10907 (comment)
@piknikSteven2021 - Logs & Feedback - User Logs & Feedback for issue 10788 #10907 (comment)
@donkabat - Logs & Feedback - User Logs & Feedback for issue 10788 #10907 (comment)
Logging Information
The text was updated successfully, but these errors were encountered: