Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same pubsub messages propagated multiple times #1043

Closed
oed opened this issue Dec 2, 2021 · 16 comments
Closed

Same pubsub messages propagated multiple times #1043

oed opened this issue Dec 2, 2021 · 16 comments
Labels
need/analysis Needs further analysis before proceeding need/triage Needs initial labeling and prioritization

Comments

@oed
Copy link

oed commented Dec 2, 2021

Incident report from 3Box Labs team

The following issue was observed in our js-ipfs nodes which serves the Ceramic network. We have one shared pubsub topic across these nodes (and nodes run by other parties in the Ceramic network).

  • Version:
"libp2p": "^0.32.0",
"libp2p-gossipsub": "^0.11.1",
"ipfs-core": "~0.7.1",
"ipfs-core-types": "~0.5.1",
"ipfs-http-gateway": "~0.4.2",
"ipfs-http-server": "~0.5.1"
  • Platform:
    We run js-ipfs in a Docker container hosted on AWS Fargate on Ubuntu Linux. The container instance has 4096 vCPU units and 8192 GiB RAM. An application load balancer sits in front of the instance to send/receive internet traffic.
    Our ipfs daemon is configured with the DHT disabled and pubsub enabled. We swarm connect via wss multiaddresses.

  • Subsystem:
    pubsub

Severity:

High

Description:

Nodes in the Ceramic network use js-libp2p-pubsub as exposed by js-ipfs to propagate updates etc. about data in the network. Nov 30 we observed some very strange behavior of the pubsub system which caused our nodes to continuously crash as they were restarted.

What we observed

In a time period of roughly 6h we observed multiple pubsub messages being sent over and over again in the network. We know these are duplicates because they share the same seqno (a field generated by pubsub). The peers that create the messages also include a timestamp in the message payload when they are first being sent, and we observed messages created at the beginning of the 6h time period still being propagated at the end of it.

image
In the plot above messages that are duplicated are shown. As can be observed the number of duplicate messages starts after 12 PM and steadily grows steadily until around 5 PM. We believe this is because older messages are still being sent around as new ones are introduced. As can be see in the table below the plot many messages are seen over 100 time by a single node.

Additional information

  • By the end of this 6h window we saw over 25k messages per minute, which is a lot given our normal steady state around 100.
  • We believe there are only ~20 nodes on our network subscribed to the pubsub topic in question
  • The js-ipfs nodes we run for Ceramic run with the DHT turned off. We instead have "peerlist" containing the swarm peers for every node on the network, and on startup we manually swarm connect to all peers.
  • We observed that two multiaddresses in the above peerlist shared the same PeerId. We are 90% certain that these nodes do not share the same private key and instead one of these peers simply incorrectly copied their multiaddress.

Open questions

We still don't really know what the inciting incident was that kicked off this whole event. We've been running for months now without ever seeing anything like this. There haven't been any changes to our ipfs deployment or setup in several weeks/months, nor have there been any changes to the ceramic code that affects our use of pubsub or IPFS in several weeks/months either.

We also don't really know how/why the system recovered 6 hours after the problem started. Question for Libp2p maintainers: are there any 6 hour timeouts in the system anywhere that might be relevant?

Happy to provide any additional information that would be useful to decipher what happened here, let us know!

cc @stbrody @v-stickykeys @smrz2001 @zachferland

Paging @vasco-santos @wemeetagain, maybe you have some insights as to what could have been going on here?

@oed oed added the need/triage Needs initial labeling and prioritization label Dec 2, 2021
@lidel lidel added the need/analysis Needs further analysis before proceeding label Dec 3, 2021
@lidel
Copy link
Member

lidel commented Dec 6, 2021

@wemeetagain hi, I am not familiar with js-libp2p-pubsub side of things, would appreciate some help in pointing this towards maintainers of that part of the stack: do you think this should be moved and tracked under https://github.com/ChainSafe/js-libp2p-pubsub or https://github.com/ChainSafe/js-libp2p-gossipsub?

@wemeetagain
Copy link
Member

do you think this should be moved

I'm thinking it should be under gossipsub.

Looking at gossipsub, I don't see any 6 hr timeouts by default.
The seenCache is what filters out duplicate messages, and by default, the TTL for cache entries is 30 seconds.

A few mitigations you might try:

  • set the TTL higher
  • add topic validation to filter out old messages, optionally penalize the peer thats causing trouble
import { ERR_TOPIC_VALIDATOR_REJECT } from "libp2p-gossipsub/src/constants"

gossipsub.topicValidators.set(topic, (topic, msg) => {
  const decoded = decode(msg.data);
  if (Date.now() - decoded.timestamp > OLD_MSG_THRESHOLD) {
    const err = new Error("Message timestamp too old")
    err.code = ERR_TOPIC_VALIDATOR_REJECT // downscore offending peer
    // or
    // err.code = ERR_TOPIC_VALIDATOR_IGNORE // don't downscore offending peer
    throw err
  }
})

We've been running for months now without ever seeing anything like this.

My best guess, without knowing any more is: a new person joined the network with an overloaded computer / very laggy internet such that they received and send duplicate messages. Happy to help dive deeper on this if you have any more analytics.

@oed
Copy link
Author

oed commented Dec 7, 2021

set the TTL higher

This seems like it could likely limit the impact, but not prevent the issue.

add topic validation to filter out old messages, optionally penalize the peer thats causing trouble

This is interesting, but we don't have timestamps in all of our messages so it wouldn't work universally.

My best guess, without knowing any more is: a new person joined the network with an overloaded computer / very laggy internet such that they received and send duplicate messages.

Could this really explain messages being observed over 100 times?

@wemeetagain
Copy link
Member

Could this really explain messages being observed over 100 times?

If messages took longer than 30 seconds to be received to and from the laggy peer, then it seems like it could.

Then any messages published to the peer would be rebroadcast to the network. And if that message is republished to another peer which then republishes back to the laggy peer, the message could then be rebroadcast again.

But then again, 30 seconds is ... a long time.

@stbrody
Copy link
Contributor

stbrody commented Jan 20, 2022

Just wanted to let you all know we've experienced this two more times in the last week. We shut down all the nodes that we operate on our network at the same time, then brought them back online, and that seemed to resolve things. We're going to look into increasing the seenCache TTL next and see if that helps.

EDITTED: to clarify that we shut down all the nodes that we control, not every single node on the network as there are nodes outside of our control

@smrz2001
Copy link

Some additional information - we're seeing tens of thousands of messages like this one when the issue occurs:

libp2p:gossipsub:score unexpected delivery: message from QmWT2AgJwdxxxxxxxxxxxxxxxxxxxxxxxxxfUox52T5C1K was first seen 58051 ago and has delivery status NaN

This is not one of our nodes.

@lidel
Copy link
Member

lidel commented Jan 28, 2022

@raulk @vyzo any ideas what data should we gather / look at to get closer to the cause here?

@vyzo
Copy link

vyzo commented Jan 28, 2022

A possible scenario where this can occur is when a message is resent outside the timecache/seen window.
Was there some very slow receivers involved?

The way to avoid this problem in general is to either increase the timecache window or to introduce a validator that can persist seen message ids to disk.
It might be worthwile plugging the backend for this into pubsub directly.

When there is a blockchain involved, there is natural protection from this happening by virtue of state maintenance.

@vyzo
Copy link

vyzo commented Jan 28, 2022

note that the anomalous behaviour may well be a replay attack.

@vyzo
Copy link

vyzo commented Jan 28, 2022

also note that the disk persistent data structure should only be used after validation, to avoid DoS vectors tha clog it with invalid messages; this is why it is usually done in the validator.

@stbrody
Copy link
Contributor

stbrody commented Jan 31, 2022

Was there some very slow receivers involved?

I have a half-baked theory here. We know that js-ipfs has a memory leak - we consistently see memory usage grow over time on all our nodes, and we've had reports of similar from the community for nodes they are running. My theory is that a node on the network is close to running out of memory, and is spending longer and longer on each garbage collection cycle. This causes CPU usage to spike and the node to slow down processing messages, until it starts taking longer than 30 seconds (the default message TTL) to process messages, kicking off this flood.

This is still only a theory, none of our nodes that we have visibility into were at high memory when the last flood happened. It's possible that this is what happened on one of the nodes we don't run or have visibility into, but we can't know for sure.

One flaw in this theory is that if this was the cause, I would expect that node to crash due to running out of memory pretty soon after hitting that severe a level of performance degradation, but we saw one of these floods last 6 hours, which would be a long time to be in such degraded state without crashing. So it's not a perfect theory but still seems at least plausible.

@lidel
Copy link
Member

lidel commented Feb 4, 2022

Unless anyone comes up with repro/fix, need to close this – we are lacking an actionable next step here.

@oed
Copy link
Author

oed commented Feb 4, 2022 via email

@BigLep
Copy link
Contributor

BigLep commented Feb 11, 2022

TODO: we'll create an issue in js-ipfs to increase the default to 2 min

@lidel
Copy link
Member

lidel commented Mar 4, 2022

@oed I am afraid I don't have the full context here, mind closing this issue and filling one https://github.com/ipfs/js-ipfs about changing the default?

@stbrody
Copy link
Contributor

stbrody commented Mar 11, 2022

I opened up a PR on the js gossipsub implementation: ChainSafe/js-libp2p-gossipsub#200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/analysis Needs further analysis before proceeding need/triage Needs initial labeling and prioritization
Projects
Archived in project
Development

No branches or pull requests

7 participants