Measuring the Effectiveness of Gossipsub in Filecoin - Call for Participation #11118

yiannisbot · 2023-07-31T09:27:22Z

yiannisbot
Jul 31, 2023

TL;DR

The ProbeLab team at Protocol Labs is running a measurement project to understand the behaviour, health and dynamics of the Filecoin PubSub layer (i.e., the Gossipsub protocol). The metrics that are critical to our study include: how long does it take for blocks and messages to propagate throughout the network of Storage Providers (SPs)? Are misbehaving nodes (if any) excluded from the Gossipsub mesh as per the protocol’s score function?

In order to achieve our goal, we are calling Filecoin Storage Providers (SPs) and any Lotus Node Operator (LNO) to opt-in and provide traces from their Lotus client nodes, which we will then analyse.

Our tooling collects data automatically as long as the SP/LNO opts-in through the right config option (see further down for details).
No further action or special set up is needed by the SP/LNO.
There is no extra memory or disk usage requirement.
There is an extra 2.5% CPU requirement at times and a 2x increase in incoming and outgoing bandwidth requirement.
We need at least 10% of nodes that participate in the Filecoin network to opt-in to the experiment, in order to be able to extract accurate results.

Context & Purpose

Gossipsub is one of the most central protocols in the Filecoin blockchain. It is the main protocol that transfers blocks and transaction messages in the Filecoin blockchain. It uses a number of different techniques to propagate messages, while making sure that the network stays in a healthy state (e.g., nodes discard invalid messages and malicious nodes are progressively excluded from the network). Verifying the correct operation of the protocol in the pubsub network is of utmost importance and a task that is way overdue.

The goal of this project is to measure the performance of the Gossipsub protocol in the Filecoin network. We want to answer questions such as: “How long does it take for messages to propagate throughout the network?”, “Do nodes in different geographic locations observe highly variable latencies?”, “Are there clusters of SP, or Lotus nodes being formed as a result of geographic distance between the nodes and how does this affect block propagation (and leader election?)?”. Having better understanding of these aspects can help SPs and LNOs to apply better configurations for their node settings, e.g., PropagationDelaySecs.

What data is the pubsub tracer collecting?

Lotus dev teams and the community has been developing tracer software for Gossipsub for some time now. The first instantiation of a Gossipsub tracer came with #7398. The tracer in this PR collects:

Join/leave GossipSub topics messages
Graft/Prune messages of peers per topic
GossipSub PeerScores

In order to find the broadcast latency of messages we need traces of RPC_recv and hence we have extended the pubsub tracer at: #7398 with extra functionality and support for pushing traces to Elastic Search. The details are given in this PR: #10405.

It is also important to highlight that the tracer sees and collects PeerIDs of nodes. However, PeerIDs are not linked to individual SPs or LNOs, unless SPs/LNOs want to claim their PeerID(s). The tracer does not collect IP addresses of Storage Providers or Lotus nodes. IP Addresses are seen by the tracecatcher through the HTTP request made to submit each batch of traces, but we do not record or keep that IP Address. If SPs/LNOs want to hide the IP from which they submit traces, they can trivially proxy through a different machine.

The data collected is all necessary in order to identify correct operation of the protocol. No data is gathered to identify specific node operators, unless we identify unexpected behaviour or potentially (intentionally or unintentionally) misbehaving nodes.

Why should Storage Providers and Lotus Node Operators opt-in and participate?

Lotus devs, the Filecoin community and Storage Providers are all contributing to the Filecoin network for the wider benefit of decentralized storage. Similarly to thorough monitoring of the chain itself, the network layer of Filecoin is equally important.

By opting-in and participating in the measurement study, Storage Providers will help verify that the network is healthy, no misbehaving nodes are causing problems to the network, blocks are propagating as they are expected to and misbehaving nodes are excluded from the Gossipsub mesh (and the Filecoin network as an extension). By digging into the details, we will also be able to identify potential improvements to the Gossipsub protocol for faster and more resource efficient operation.

🎯 NOTE: It is important to highlight that in order to get accurate results out of the collected traces, we need at least 10% of the Lotus nodes that participate in the Filecoin network to opt-in and participate in the experiment.

How can a Storage Provider/Lotus Node opt-in?

To participate in the study, nodes must use the latest version of Lotus in the master branch. At least until commit [aef2ab63] is included in the next release (1.23.3, as per LOTUS ROADMAP ). The commit includes the latest upgrades done in the pre-existing [Lotus Tracer](https://github.com/filecoin-project/lotus/pull/7398) .

However, since the default configuration comes without the remote tracer being enabled, SPs will have to manually configure the config.toml to submit the traces to the Elastic Search node which has been set up by the ProbeLab team at PL (see: https://probelab.io for more details).

To be more precise, two fields under the [PubSub] section must be updated in the config.toml (which is generally under ~/.lotus/config.toml):

  # Connection string for elasticsearch instance.
  # If present tracer will save data to elasticsearch.
  # Format: https://<username>:<password>@<elasticsearch_url>:<port>/
  #
  # type: string
  # env var: LOTUS_PUBSUB_ELASTICSEARCHTRACER
  ElasticSearchTracer = "http://<ip-url>:5151/"

  # Name of elasticsearch index that will be used to save tracer data.
  # This property is used only if ElasticSearchTracer propery is set.
  #
  # type: string
  # env var: LOTUS_PUBSUB_ELASTICSEARCHINDEX
  ElasticSearchIndex = "traces"

No user, password, or dedicated ElasticSearchIndex will be provided to authenticate SPs on the trace submission, ensuring no possible link between the SP and the PeerID (unless the SP wants to share it).

What is the overhead of participating?

Collecting and sharing extra data inevitably comes with a cost. We have made all the development optimizations in order to keep overhead at a minimum. Here is a breakdown of the data we have seen from our Lotus node:

CPU

We haven’t seen an increase in the CPU requirement for running the tracer. At times the CPU utilization increased by 2.5% (2023-03-12 in the plot), but this has not been constant.

Fig: CPU Utilization - Red vertical line denotes the deployment of the tracer on our Lotus node.

Disk Usage

We haven’t seen extra disk usage requirement, other than that required by Lotus (i.e., ~4GBs/hr), as shown below, where the red vertical line denotes the deployment of the tracer on our Lotus node. For more details, see the Github PR.

Fig: Disk Usage - Red vertical line denotes the deployment of the tracer on our Lotus node.

Fig: Disk Usage Increment - Red vertical line denotes the deployment of the tracer on our Lotus node.

Memory

Similarly to disk usage, we haven’t seen any extra memory requirement, after deploying the tracer on our node. The red vertical line denotes the deployment of the tracer on our Lotus node and the increment coefficient doesn’t get any steeper than before the vertical red line.

Fig: Memory Usage - Red vertical line denotes the deployment of the tracer on our Lotus node.

Bandwidth

Lotus clients should expect a 2x increment of both incoming and outgoing bandwidth. In more detail:

Incoming Bandwidth: 1.5x-2x traffic increment (~150MB/h) when enabling the remote trace submission to the ES instance. We have seen the same network increment as “Outgoing Bandwidth” at the ES machine. Note the initial spike (before 2023-03-08) originated from the lotus node syncing the chain.

Fig: Lotus Incoming Bandwidth Requirement - Red vertical line denotes the deployment of the tracer on our Lotus node.

Outgoing Bandwidth: 2x increase in sent bytes (350MB/h) after enabling the remote gossip traces submission to ES. This load is generated from formatting GossipSub traces in json packages and submitting them to a remote Elastic-Search instance.

Fig: Lotus Outgoing Bandwidth Requirement - Red vertical line denotes the deployment of the tracer on our Lotus node.

More details on all of the above results are given here: #10405 (comment)

What’s next?

In order to be able to get an accurate view of the network dynamics, we expect to collect data for at least 1 month in the first place. Should the need arise, we will follow up with any modifications needed to the tracer software and include in a future Lotus release, together with comms to the community.

We will analyse the collected data and produce a report for the following metrics: message propagation latency, gossipsub mesh stability, gossip effectiveness, score function behaviour and network topology. We expect the report to be ready 1 month after the collection month.

According to the current plan, the rough timeline is as follows:

September 2023: Tracing software is enabled in ~10-20% of Lotus nodes.
Sept-Oct 2023: Trace collection month
Oct-Nov 2023: Results analysis and report writing

Note that the plan will need to shift if it takes more time to reach the 10-20% of nodes needed to collect accurate results. The collection month and results analysis will shift accordingly.

You can check the progress and follow developments related to this project here: 🖥️ Gossipsub Measurement in the Filecoin Network

Who is running the experiment?

The development of the majority of the tracer software, as well as the experiment itself, is being ran by the PL ProbeLab team. The ProbeLab team primarily focuses on measurements and optimization of the IPFS network and its underlying libp2p stack. You can find more information about ProbeLab’s projects in this Notion page, similar studies carried out by the team in this Github repository: https://github.com/plprobelab/network-measurements/tree/master/results and results reported by the team on the IPFS network’s performance at: https://probelab.io.

Contact points of the ProbeLab team:

FIL Slack (or IPFS Discord bridged channel): #probe-lab, or #gossipsub-measurements for a channel dedicated to this project
email: probelab@protocol.ai

stuberman · 2023-07-31T12:09:10Z

stuberman
Jul 31, 2023

Happy to support this. I am not finding any information about proper configuration of ElasticSearchTracer = and the probelab.io link does not offer any examples or details that I can find.

0 replies

iand · 2023-07-31T13:00:06Z

iand
Jul 31, 2023

@stuberman the tracer changes are in Lotus master branch but have not yet been included in a tagged release. We are expecting this to happen in the upcoming release of Lotus. For reference the config settings are in the master branch here.

0 replies

jennijuju · 2023-08-02T02:33:42Z

jennijuju
Aug 2, 2023
Maintainer

As we are doing some final testing before the release is out, we discovered that enabling tracing is impacting lotus miner node's syncing performance. The team is debugging here - we will update this thread when the 🔴 is clear!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measuring the Effectiveness of Gossipsub in Filecoin - Call for Participation #11118

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Measuring the Effectiveness of Gossipsub in Filecoin - Call for Participation #11118

yiannisbot Jul 31, 2023

TL;DR

Context & Purpose

What data is the pubsub tracer collecting?

Why should Storage Providers and Lotus Node Operators opt-in and participate?

How can a Storage Provider/Lotus Node opt-in?

What is the overhead of participating?

CPU

Disk Usage

Memory

Bandwidth

What’s next?

Who is running the experiment?

Replies: 3 comments

stuberman Jul 31, 2023

iand Jul 31, 2023

jennijuju Aug 2, 2023 Maintainer

yiannisbot
Jul 31, 2023

stuberman
Jul 31, 2023

iand
Jul 31, 2023

jennijuju
Aug 2, 2023
Maintainer