-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Same pubsub messages propagated multiple times #1043
Comments
@wemeetagain hi, I am not familiar with js-libp2p-pubsub side of things, would appreciate some help in pointing this towards maintainers of that part of the stack: do you think this should be moved and tracked under https://github.com/ChainSafe/js-libp2p-pubsub or https://github.com/ChainSafe/js-libp2p-gossipsub? |
I'm thinking it should be under gossipsub. Looking at gossipsub, I don't see any 6 hr timeouts by default. A few mitigations you might try:
import { ERR_TOPIC_VALIDATOR_REJECT } from "libp2p-gossipsub/src/constants"
gossipsub.topicValidators.set(topic, (topic, msg) => {
const decoded = decode(msg.data);
if (Date.now() - decoded.timestamp > OLD_MSG_THRESHOLD) {
const err = new Error("Message timestamp too old")
err.code = ERR_TOPIC_VALIDATOR_REJECT // downscore offending peer
// or
// err.code = ERR_TOPIC_VALIDATOR_IGNORE // don't downscore offending peer
throw err
}
})
My best guess, without knowing any more is: a new person joined the network with an overloaded computer / very laggy internet such that they received and send duplicate messages. Happy to help dive deeper on this if you have any more analytics. |
This seems like it could likely limit the impact, but not prevent the issue.
This is interesting, but we don't have timestamps in all of our messages so it wouldn't work universally.
Could this really explain messages being observed over 100 times? |
If messages took longer than 30 seconds to be received to and from the laggy peer, then it seems like it could. Then any messages published to the peer would be rebroadcast to the network. And if that message is republished to another peer which then republishes back to the laggy peer, the message could then be rebroadcast again. But then again, 30 seconds is ... a long time. |
Just wanted to let you all know we've experienced this two more times in the last week. We shut down all the nodes that we operate on our network at the same time, then brought them back online, and that seemed to resolve things. We're going to look into increasing the EDITTED: to clarify that we shut down all the nodes that we control, not every single node on the network as there are nodes outside of our control |
Some additional information - we're seeing tens of thousands of messages like this one when the issue occurs:
This is not one of our nodes. |
A possible scenario where this can occur is when a message is resent outside the timecache/seen window. The way to avoid this problem in general is to either increase the timecache window or to introduce a validator that can persist seen message ids to disk. When there is a blockchain involved, there is natural protection from this happening by virtue of state maintenance. |
note that the anomalous behaviour may well be a replay attack. |
also note that the disk persistent data structure should only be used after validation, to avoid DoS vectors tha clog it with invalid messages; this is why it is usually done in the validator. |
I have a half-baked theory here. We know that js-ipfs has a memory leak - we consistently see memory usage grow over time on all our nodes, and we've had reports of similar from the community for nodes they are running. My theory is that a node on the network is close to running out of memory, and is spending longer and longer on each garbage collection cycle. This causes CPU usage to spike and the node to slow down processing messages, until it starts taking longer than 30 seconds (the default message TTL) to process messages, kicking off this flood. This is still only a theory, none of our nodes that we have visibility into were at high memory when the last flood happened. It's possible that this is what happened on one of the nodes we don't run or have visibility into, but we can't know for sure. One flaw in this theory is that if this was the cause, I would expect that node to crash due to running out of memory pretty soon after hitting that severe a level of performance degradation, but we saw one of these floods last 6 hours, which would be a long time to be in such degraded state without crashing. So it's not a perfect theory but still seems at least plausible. |
Unless anyone comes up with repro/fix, need to close this – we are lacking an actionable next step here. |
Seems like a reasonable fix in js-ipfs would be to set the TTL to 2 min as
that's what it is in go-ipfs.
*Joel Thorstensson *
Co-founder / CTO
[image: 3Box Labs] <https://3boxlabs.com>
…On Fri, Feb 04, 2022 at 16:01:17, Marcin Rataj ***@***.***> wrote:
Unless anyone comes up with repro/fix, need to close this – we are lacking
an actionable next step here.
—
Reply to this email directly, view it on GitHub
<#1043 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA52ONJGMHW5DMKM7ER3OADUZPZ43ANCNFSM5JG5EPBA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
TODO: we'll create an issue in js-ipfs to increase the default to 2 min |
@oed I am afraid I don't have the full context here, mind closing this issue and filling one https://github.com/ipfs/js-ipfs about changing the default? |
I opened up a PR on the js gossipsub implementation: ChainSafe/js-libp2p-gossipsub#200 |
Incident report from 3Box Labs team
The following issue was observed in our js-ipfs nodes which serves the Ceramic network. We have one shared pubsub topic across these nodes (and nodes run by other parties in the Ceramic network).
Platform:
We run js-ipfs in a Docker container hosted on AWS Fargate on Ubuntu Linux. The container instance has 4096 vCPU units and 8192 GiB RAM. An application load balancer sits in front of the instance to send/receive internet traffic.
Our ipfs daemon is configured with the DHT disabled and pubsub enabled. We swarm connect via wss multiaddresses.
Subsystem:
pubsub
Severity:
High
Description:
Nodes in the Ceramic network use js-libp2p-pubsub as exposed by js-ipfs to propagate updates etc. about data in the network. Nov 30 we observed some very strange behavior of the pubsub system which caused our nodes to continuously crash as they were restarted.
What we observed
In a time period of roughly 6h we observed multiple pubsub messages being sent over and over again in the network. We know these are duplicates because they share the same
seqno
(a field generated by pubsub). The peers that create the messages also include a timestamp in the message payload when they are first being sent, and we observed messages created at the beginning of the 6h time period still being propagated at the end of it.In the plot above messages that are duplicated are shown. As can be observed the number of duplicate messages starts after 12 PM and steadily grows steadily until around 5 PM. We believe this is because older messages are still being sent around as new ones are introduced. As can be see in the table below the plot many messages are seen over 100 time by a single node.
Additional information
Open questions
We still don't really know what the inciting incident was that kicked off this whole event. We've been running for months now without ever seeing anything like this. There haven't been any changes to our ipfs deployment or setup in several weeks/months, nor have there been any changes to the ceramic code that affects our use of pubsub or IPFS in several weeks/months either.
We also don't really know how/why the system recovered 6 hours after the problem started. Question for Libp2p maintainers: are there any 6 hour timeouts in the system anywhere that might be relevant?
Happy to provide any additional information that would be useful to decipher what happened here, let us know!
cc @stbrody @v-stickykeys @smrz2001 @zachferland
Paging @vasco-santos @wemeetagain, maybe you have some insights as to what could have been going on here?
The text was updated successfully, but these errors were encountered: