Same pubsub messages propagated multiple times #1043

oed · 2021-12-02T11:49:06Z

Incident report from 3Box Labs team

The following issue was observed in our js-ipfs nodes which serves the Ceramic network. We have one shared pubsub topic across these nodes (and nodes run by other parties in the Ceramic network).

Version:

"libp2p": "^0.32.0",
"libp2p-gossipsub": "^0.11.1",
"ipfs-core": "~0.7.1",
"ipfs-core-types": "~0.5.1",
"ipfs-http-gateway": "~0.4.2",
"ipfs-http-server": "~0.5.1"

Platform:
We run js-ipfs in a Docker container hosted on AWS Fargate on Ubuntu Linux. The container instance has 4096 vCPU units and 8192 GiB RAM. An application load balancer sits in front of the instance to send/receive internet traffic.
Our ipfs daemon is configured with the DHT disabled and pubsub enabled. We swarm connect via wss multiaddresses.
Subsystem:
pubsub

Severity:

High

Description:

Nodes in the Ceramic network use js-libp2p-pubsub as exposed by js-ipfs to propagate updates etc. about data in the network. Nov 30 we observed some very strange behavior of the pubsub system which caused our nodes to continuously crash as they were restarted.

What we observed

In a time period of roughly 6h we observed multiple pubsub messages being sent over and over again in the network. We know these are duplicates because they share the same seqno (a field generated by pubsub). The peers that create the messages also include a timestamp in the message payload when they are first being sent, and we observed messages created at the beginning of the 6h time period still being propagated at the end of it.

In the plot above messages that are duplicated are shown. As can be observed the number of duplicate messages starts after 12 PM and steadily grows steadily until around 5 PM. We believe this is because older messages are still being sent around as new ones are introduced. As can be see in the table below the plot many messages are seen over 100 time by a single node.

Additional information

By the end of this 6h window we saw over 25k messages per minute, which is a lot given our normal steady state around 100.
We believe there are only ~20 nodes on our network subscribed to the pubsub topic in question
The js-ipfs nodes we run for Ceramic run with the DHT turned off. We instead have "peerlist" containing the swarm peers for every node on the network, and on startup we manually swarm connect to all peers.
We observed that two multiaddresses in the above peerlist shared the same PeerId. We are 90% certain that these nodes do not share the same private key and instead one of these peers simply incorrectly copied their multiaddress.

Open questions

We still don't really know what the inciting incident was that kicked off this whole event. We've been running for months now without ever seeing anything like this. There haven't been any changes to our ipfs deployment or setup in several weeks/months, nor have there been any changes to the ceramic code that affects our use of pubsub or IPFS in several weeks/months either.

We also don't really know how/why the system recovered 6 hours after the problem started. Question for Libp2p maintainers: are there any 6 hour timeouts in the system anywhere that might be relevant?

Happy to provide any additional information that would be useful to decipher what happened here, let us know!

cc @stbrody @v-stickykeys @smrz2001 @zachferland

Paging @vasco-santos @wemeetagain, maybe you have some insights as to what could have been going on here?

The text was updated successfully, but these errors were encountered:

lidel · 2021-12-06T15:01:27Z

@wemeetagain hi, I am not familiar with js-libp2p-pubsub side of things, would appreciate some help in pointing this towards maintainers of that part of the stack: do you think this should be moved and tracked under https://github.com/ChainSafe/js-libp2p-pubsub or https://github.com/ChainSafe/js-libp2p-gossipsub?

wemeetagain · 2021-12-06T16:11:37Z

do you think this should be moved

I'm thinking it should be under gossipsub.

Looking at gossipsub, I don't see any 6 hr timeouts by default.
The seenCache is what filters out duplicate messages, and by default, the TTL for cache entries is 30 seconds.

A few mitigations you might try:

set the TTL higher
add topic validation to filter out old messages, optionally penalize the peer thats causing trouble

import { ERR_TOPIC_VALIDATOR_REJECT } from "libp2p-gossipsub/src/constants"

gossipsub.topicValidators.set(topic, (topic, msg) => {
  const decoded = decode(msg.data);
  if (Date.now() - decoded.timestamp > OLD_MSG_THRESHOLD) {
    const err = new Error("Message timestamp too old")
    err.code = ERR_TOPIC_VALIDATOR_REJECT // downscore offending peer
    // or
    // err.code = ERR_TOPIC_VALIDATOR_IGNORE // don't downscore offending peer
    throw err
  }
})

We've been running for months now without ever seeing anything like this.

My best guess, without knowing any more is: a new person joined the network with an overloaded computer / very laggy internet such that they received and send duplicate messages. Happy to help dive deeper on this if you have any more analytics.

oed · 2021-12-07T09:30:39Z

set the TTL higher

This seems like it could likely limit the impact, but not prevent the issue.

add topic validation to filter out old messages, optionally penalize the peer thats causing trouble

This is interesting, but we don't have timestamps in all of our messages so it wouldn't work universally.

My best guess, without knowing any more is: a new person joined the network with an overloaded computer / very laggy internet such that they received and send duplicate messages.

Could this really explain messages being observed over 100 times?

wemeetagain · 2021-12-07T15:31:54Z

Could this really explain messages being observed over 100 times?

If messages took longer than 30 seconds to be received to and from the laggy peer, then it seems like it could.

Then any messages published to the peer would be rebroadcast to the network. And if that message is republished to another peer which then republishes back to the laggy peer, the message could then be rebroadcast again.

But then again, 30 seconds is ... a long time.

stbrody · 2022-01-20T17:56:34Z

Just wanted to let you all know we've experienced this two more times in the last week. We shut down all the nodes that we operate on our network at the same time, then brought them back online, and that seemed to resolve things. We're going to look into increasing the seenCache TTL next and see if that helps.

EDITTED: to clarify that we shut down all the nodes that we control, not every single node on the network as there are nodes outside of our control

smrz2001 · 2022-01-20T18:13:49Z

Some additional information - we're seeing tens of thousands of messages like this one when the issue occurs:

libp2p:gossipsub:score unexpected delivery: message from QmWT2AgJwdxxxxxxxxxxxxxxxxxxxxxxxxxfUox52T5C1K was first seen 58051 ago and has delivery status NaN

This is not one of our nodes.

lidel · 2022-01-28T16:08:53Z

@raulk @vyzo any ideas what data should we gather / look at to get closer to the cause here?

vyzo · 2022-01-28T16:34:43Z

A possible scenario where this can occur is when a message is resent outside the timecache/seen window.
Was there some very slow receivers involved?

The way to avoid this problem in general is to either increase the timecache window or to introduce a validator that can persist seen message ids to disk.
It might be worthwile plugging the backend for this into pubsub directly.

When there is a blockchain involved, there is natural protection from this happening by virtue of state maintenance.

vyzo · 2022-01-28T16:35:44Z

note that the anomalous behaviour may well be a replay attack.

vyzo · 2022-01-28T16:53:46Z

also note that the disk persistent data structure should only be used after validation, to avoid DoS vectors tha clog it with invalid messages; this is why it is usually done in the validator.

stbrody · 2022-01-31T16:53:22Z

Was there some very slow receivers involved?

I have a half-baked theory here. We know that js-ipfs has a memory leak - we consistently see memory usage grow over time on all our nodes, and we've had reports of similar from the community for nodes they are running. My theory is that a node on the network is close to running out of memory, and is spending longer and longer on each garbage collection cycle. This causes CPU usage to spike and the node to slow down processing messages, until it starts taking longer than 30 seconds (the default message TTL) to process messages, kicking off this flood.

This is still only a theory, none of our nodes that we have visibility into were at high memory when the last flood happened. It's possible that this is what happened on one of the nodes we don't run or have visibility into, but we can't know for sure.

One flaw in this theory is that if this was the cause, I would expect that node to crash due to running out of memory pretty soon after hitting that severe a level of performance degradation, but we saw one of these floods last 6 hours, which would be a long time to be in such degraded state without crashing. So it's not a perfect theory but still seems at least plausible.

lidel · 2022-02-04T16:01:05Z

Unless anyone comes up with repro/fix, need to close this – we are lacking an actionable next step here.

oed · 2022-02-04T16:03:04Z

Seems like a reasonable fix in js-ipfs would be to set the TTL to 2 min as that's what it is in go-ipfs. *Joel Thorstensson * Co-founder / CTO [image: 3Box Labs] <https://3boxlabs.com>

…

On Fri, Feb 04, 2022 at 16:01:17, Marcin Rataj ***@***.***> wrote: Unless anyone comes up with repro/fix, need to close this – we are lacking an actionable next step here. — Reply to this email directly, view it on GitHub <#1043 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA52ONJGMHW5DMKM7ER3OADUZPZ43ANCNFSM5JG5EPBA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

BigLep · 2022-02-11T15:33:04Z

TODO: we'll create an issue in js-ipfs to increase the default to 2 min

lidel · 2022-03-04T19:55:32Z

@oed I am afraid I don't have the full context here, mind closing this issue and filling one https://github.com/ipfs/js-ipfs about changing the default?

stbrody · 2022-03-11T22:17:50Z

I opened up a PR on the js gossipsub implementation: ChainSafe/js-libp2p-gossipsub#200

oed added the need/triage Needs initial labeling and prioritization label Dec 2, 2021

lidel added the need/analysis Needs further analysis before proceeding label Dec 3, 2021

smrz2001 mentioned this issue Feb 4, 2022

hotfix: set gossipsub seen cache ttl to 2 minutes ceramicnetwork/js-ceramic#2029

Merged

2 tasks

lidel closed this as completed Mar 11, 2022

stbrody mentioned this issue Mar 11, 2022

chore: increase the default for seenTTL to match that of go-libp2p ChainSafe/js-libp2p-gossipsub#200

Merged

tinytb moved this to Done in js-libp2p Oct 11, 2022

tinytb added this to js-libp2p Oct 11, 2022

stbrody mentioned this issue Oct 26, 2022

seenMessages TTL should be based on last observed time, not first libp2p/go-libp2p-pubsub#502

Closed

This was referenced Feb 25, 2023

Pubsub flood due to same message propagated multiple times libp2p/go-libp2p-pubsub#524

Closed

Pubsub flood due to same message propagated multiple times ipfs/kubo#9665

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Same pubsub messages propagated multiple times #1043

Same pubsub messages propagated multiple times #1043

oed commented Dec 2, 2021

lidel commented Dec 6, 2021 •

edited

Loading

wemeetagain commented Dec 6, 2021

oed commented Dec 7, 2021

wemeetagain commented Dec 7, 2021

stbrody commented Jan 20, 2022 •

edited

Loading

smrz2001 commented Jan 20, 2022

lidel commented Jan 28, 2022 •

edited

Loading

vyzo commented Jan 28, 2022

vyzo commented Jan 28, 2022

vyzo commented Jan 28, 2022

stbrody commented Jan 31, 2022

lidel commented Feb 4, 2022

oed commented Feb 4, 2022 via email

BigLep commented Feb 11, 2022

lidel commented Mar 4, 2022

stbrody commented Mar 11, 2022

Same pubsub messages propagated multiple times #1043

Same pubsub messages propagated multiple times #1043

Comments

oed commented Dec 2, 2021

Incident report from 3Box Labs team

Severity:

Description:

What we observed

Additional information

Open questions

lidel commented Dec 6, 2021 • edited Loading

wemeetagain commented Dec 6, 2021

oed commented Dec 7, 2021

wemeetagain commented Dec 7, 2021

stbrody commented Jan 20, 2022 • edited Loading

smrz2001 commented Jan 20, 2022

lidel commented Jan 28, 2022 • edited Loading

vyzo commented Jan 28, 2022

vyzo commented Jan 28, 2022

vyzo commented Jan 28, 2022

stbrody commented Jan 31, 2022

lidel commented Feb 4, 2022

oed commented Feb 4, 2022 via email

BigLep commented Feb 11, 2022

lidel commented Mar 4, 2022

stbrody commented Mar 11, 2022

lidel commented Dec 6, 2021 •

edited

Loading

stbrody commented Jan 20, 2022 •

edited

Loading

lidel commented Jan 28, 2022 •

edited

Loading