Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial passive testing is happy but not next ones started with 10m succession #765

Closed
masih opened this issue Nov 29, 2024 · 4 comments
Closed
Assignees

Comments

@masih
Copy link
Member

masih commented Nov 29, 2024

Critical question: Why is it that the first test in the morning always seem to work nice, and successive tests seem to run not as good?

Looking at the pubsub settings we forked over from Lotus, there are... a lot of questionable decisions that seem to be rooted in pre-F3 filecoin network behaviour (e.g. this).

I wonder if change in passive testing network causes some loss of mesh or unfair peer scoring such that gossip sub mesh becomes ineffective to the point where messages simply do not propagate fast enough. Take invalid message scoring for example, when networks change it is inevitable that some messages arrive rom previous network that would be considered invalid. We also observe spike in invalid message error in validation flow documented here at initial instance.

So...

  • Could it be that the ineffective gossipsub at least to some extent is the result of change in network during passive testing?
  • Are there parameters set in pubsub that unfairly reduce ranking or negatively impact the mesh by deeming what passive testing does (change in topic, resubscrption, dropping messages between networks) ?
  • Could it be that the current pubsub settings even within a single passive testing network impact peer ranking when instances progress e.g. due to high rate of validation ignores?
@masih
Copy link
Member Author

masih commented Nov 29, 2024

And looks like lotus (and by extension Observer, F3, etc.) retains negative scoring for 6 hours. This is a setting set at top level pubsub. I assume it affects the pubsub instance, i.e. all topics in its lifetime.

@rjan90
Copy link
Contributor

rjan90 commented Nov 29, 2024

Anecdotally I see a lot of PeerIDs with the exact same really high negative score:

lotus net scores
12D3KooWBPyrDyrTRchikR56W21cW3dQ5YRDeAgCZvPjw7jopfuU, -1795600.000000
12D3KooWBNh4V7JeEvYLKvSbGeMMMFJyB3vavEyEipqNYaZh9cNS, -1795600.000000
12D3KooWBNMVxsBq4T5T8qX8E1FWhfyVULDJ56a3mE1m6r3bEJ8f, -1795600.000000
12D3KooWAy4R5DgHcAuP7Z6CJyesQXkNPfoBFShMtdMtg1z3dhWS, -1795600.000000
12D3KooWAmPdJJcrNQ9qL4Dtj229kJ2VngPtrEmz6fd7duc6N8Q4, -1795600.000000
12D3KooWAewsJcXcVoEhCwfvD7zWwCPae8WtVvcL8nvy84HdNivL, -1795600.000000
12D3KooWAY9Vq9wzqRjzaoKheXPDVf9YCf1GpQ32V4mtjtxAaHPW, -1795600.000000
12D3KooWAPsAXsxBpuRJbjiX7cFzNsA8A1UZe8ikWsbgxZ7DDu5Y, -1795600.000000
12D3KooWAEZaEAwxco3Coho2c4KESS5Q868NYhXzSHAXdvwomYAt, -1795600.000000

A total of 139 on my node with the exact same negative score, out of a total of:

lotus net scores | wc -l
2045

Total number of PeerIDs that have negative scores is:

622

@rjan90
Copy link
Contributor

rjan90 commented Nov 29, 2024

For clarity I also grepped for the ones that subscribe to F3, and most have 0 scores - with some occasional negative ones, but not the high negative score as ^^

{"ID":"12D3KooW9sCwBYPVGr9T7A5DMzk8qF4wdGtTGSREK7kMLdJDBLR6","Score":{"Score":0,"Topics":{"/f3/granite/0.0.2/filecoin/21":{"TimeInMesh":0,"FirstMessageDeliveries":0,"MeshMessageDeliveries":0,"InvalidMessageDeliveries":0}},"AppSpecificScore":0,"IPColocationFactor":0,"BehaviourPenalty":0}}
{"ID":"12D3KooW9rUCW2eEmbZsGarEBzdh7RwqZXzVhm5yW4GHpM4PxGLV","Score":{"Score":0,"Topics":{"/f3/granite/0.0.2/filecoin/21":{"TimeInMesh":0,"FirstMessageDeliveries":0,"MeshMessageDeliveries":0,"InvalidMessageDeliveries":0}},"AppSpecificScore":0,"IPColocationFactor":0,"BehaviourPenalty":0}}
{"ID":"12D3KooW9qsRsJmXkgYuyJnZNDwpB75Lhs1dw6myiNFDTLgwgbQA","Score":{"Score":0,"Topics":{"/f3/granite/0.0.2/filecoin/21":{"TimeInMesh":0,"FirstMessageDeliveries":0,"MeshMessageDeliveries":0,"InvalidMessageDeli
[14:16](https://filecoinproject.slack.com/archives/C077HAHSP8U/p1732886203372249?thread_ts=1732883402.397439&cid=C077HAHSP8U)

And the ones with extremly high negative scores are IPColocationFactor

{"ID":"12D3KooWBNMVxsBq4T5T8qX8E1FWhfyVULDJ56a3mE1m6r3bEJ8f","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAy4R5DgHcAuP7Z6CJyesQXkNPfoBFShMtdMtg1z3dhWS","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAmPdJJcrNQ9qL4Dtj229kJ2VngPtrEmz6fd7duc6N8Q4","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAewsJcXcVoEhCwfvD7zWwCPae8WtVvcL8nvy84HdNivL","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAY9Vq9wzqRjzaoKheXPDVf9YCf1GpQ32V4mtjtxAaHPW","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAPsAXsxBpuRJbjiX7cFzNsA8A1UZe8ikWsbgxZ7DDu5Y","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAEZaEAwxco3Coho2c4KESS5Q868NYhXzSHAXdvwomYAt","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWA4kybwSTq57KfMxJ4unVPbFTXTxRpe1S5HcKALRvu2FY","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}

Another test after a prolonged pause should be ran to rule out peer scares, but it does not seem that peerIDs get negatively scored.

@masih masih self-assigned this Dec 3, 2024
@masih
Copy link
Member Author

masih commented Dec 3, 2024

I have not found sufficient evidence to believe this is a genuine issue:

  • Increased time between successive passive testing to 30m.
  • Retried the day after with no over-night tests.
  • Evidence is inconclusive: I cannot reproducibly get the network to misbehave in connection with time between successive passive testing.

Closing.

@masih masih closed this as completed Dec 3, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in F3 Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants