Measuring the Effectiveness of Gossipsub in Filecoin - Call for Participation #11118
Replies: 3 comments
-
Happy to support this. I am not finding any information about proper configuration of |
Beta Was this translation helpful? Give feedback.
-
@stuberman the tracer changes are in Lotus master branch but have not yet been included in a tagged release. We are expecting this to happen in the upcoming release of Lotus. For reference the config settings are in the master branch here. |
Beta Was this translation helpful? Give feedback.
-
As we are doing some final testing before the release is out, we discovered that enabling tracing is impacting lotus miner node's syncing performance. The team is debugging here - we will update this thread when the 🔴 is clear! |
Beta Was this translation helpful? Give feedback.
-
TL;DR
The ProbeLab team at Protocol Labs is running a measurement project to understand the behaviour, health and dynamics of the Filecoin PubSub layer (i.e., the Gossipsub protocol). The metrics that are critical to our study include: how long does it take for blocks and messages to propagate throughout the network of Storage Providers (SPs)? Are misbehaving nodes (if any) excluded from the Gossipsub mesh as per the protocol’s score function?
In order to achieve our goal, we are calling Filecoin Storage Providers (SPs) and any Lotus Node Operator (LNO) to opt-in and provide traces from their Lotus client nodes, which we will then analyse.
Context & Purpose
Gossipsub is one of the most central protocols in the Filecoin blockchain. It is the main protocol that transfers blocks and transaction messages in the Filecoin blockchain. It uses a number of different techniques to propagate messages, while making sure that the network stays in a healthy state (e.g., nodes discard invalid messages and malicious nodes are progressively excluded from the network). Verifying the correct operation of the protocol in the pubsub network is of utmost importance and a task that is way overdue.
The goal of this project is to measure the performance of the Gossipsub protocol in the Filecoin network. We want to answer questions such as: “How long does it take for messages to propagate throughout the network?”, “Do nodes in different geographic locations observe highly variable latencies?”, “Are there clusters of SP, or Lotus nodes being formed as a result of geographic distance between the nodes and how does this affect block propagation (and leader election?)?”. Having better understanding of these aspects can help SPs and LNOs to apply better configurations for their node settings, e.g.,
PropagationDelaySecs
.What data is the pubsub tracer collecting?
Lotus dev teams and the community has been developing tracer software for Gossipsub for some time now. The first instantiation of a Gossipsub tracer came with #7398. The tracer in this PR collects:
In order to find the broadcast latency of messages we need traces of
RPC_recv
and hence we have extended the pubsub tracer at: #7398 with extra functionality and support for pushing traces to Elastic Search. The details are given in this PR: #10405.It is also important to highlight that the tracer sees and collects
PeerIDs
of nodes. However,PeerIDs
are not linked to individual SPs or LNOs, unless SPs/LNOs want to claim theirPeerID(s)
. The tracer does not collect IP addresses of Storage Providers or Lotus nodes. IP Addresses are seen by the tracecatcher through the HTTP request made to submit each batch of traces, but we do not record or keep that IP Address. If SPs/LNOs want to hide the IP from which they submit traces, they can trivially proxy through a different machine.The data collected is all necessary in order to identify correct operation of the protocol. No data is gathered to identify specific node operators, unless we identify unexpected behaviour or potentially (intentionally or unintentionally) misbehaving nodes.
Why should Storage Providers and Lotus Node Operators opt-in and participate?
Lotus devs, the Filecoin community and Storage Providers are all contributing to the Filecoin network for the wider benefit of decentralized storage. Similarly to thorough monitoring of the chain itself, the network layer of Filecoin is equally important.
By opting-in and participating in the measurement study, Storage Providers will help verify that the network is healthy, no misbehaving nodes are causing problems to the network, blocks are propagating as they are expected to and misbehaving nodes are excluded from the Gossipsub mesh (and the Filecoin network as an extension). By digging into the details, we will also be able to identify potential improvements to the Gossipsub protocol for faster and more resource efficient operation.
How can a Storage Provider/Lotus Node opt-in?
To participate in the study, nodes must use the latest version of Lotus in the
master
branch. At least until commit [aef2ab63] is included in the next release (1.23.3, as per LOTUS ROADMAP ). The commit includes the latest upgrades done in the pre-existing[Lotus Tracer](https://github.com/filecoin-project/lotus/pull/7398)
.However, since the default configuration comes without the remote tracer being enabled, SPs will have to manually configure the config.toml to submit the traces to the Elastic Search node which has been set up by the ProbeLab team at PL (see: https://probelab.io for more details).
To be more precise, two fields under the
[PubSub]
section must be updated in theconfig.toml
(which is generally under~/.lotus/config.toml
):No user, password, or dedicated
ElasticSearchIndex
will be provided to authenticate SPs on the trace submission, ensuring no possible link between the SP and thePeerID
(unless the SP wants to share it).What is the overhead of participating?
Collecting and sharing extra data inevitably comes with a cost. We have made all the development optimizations in order to keep overhead at a minimum. Here is a breakdown of the data we have seen from our Lotus node:
CPU
We haven’t seen an increase in the CPU requirement for running the tracer. At times the CPU utilization increased by 2.5% (2023-03-12 in the plot), but this has not been constant.
Fig: CPU Utilization - Red vertical line denotes the deployment of the tracer on our Lotus node.
Disk Usage
We haven’t seen extra disk usage requirement, other than that required by Lotus (i.e., ~4GBs/hr), as shown below, where the red vertical line denotes the deployment of the tracer on our Lotus node. For more details, see the Github PR.
Fig: Disk Usage - Red vertical line denotes the deployment of the tracer on our Lotus node.
Fig: Disk Usage Increment - Red vertical line denotes the deployment of the tracer on our Lotus node.
Memory
Similarly to disk usage, we haven’t seen any extra memory requirement, after deploying the tracer on our node. The red vertical line denotes the deployment of the tracer on our Lotus node and the increment coefficient doesn’t get any steeper than before the vertical red line.
Fig: Memory Usage - Red vertical line denotes the deployment of the tracer on our Lotus node.
Bandwidth
Lotus clients should expect a 2x increment of both incoming and outgoing bandwidth. In more detail:
Incoming Bandwidth: 1.5x-2x traffic increment (~150MB/h) when enabling the remote trace submission to the ES instance. We have seen the same network increment as “Outgoing Bandwidth” at the ES machine. Note the initial spike (before 2023-03-08) originated from the lotus node syncing the chain.
Fig: Lotus Incoming Bandwidth Requirement - Red vertical line denotes the deployment of the tracer on our Lotus node.
Outgoing Bandwidth: 2x increase in sent bytes (350MB/h) after enabling the remote gossip traces submission to ES. This load is generated from formatting GossipSub traces in json packages and submitting them to a remote Elastic-Search instance.
Fig: Lotus Outgoing Bandwidth Requirement - Red vertical line denotes the deployment of the tracer on our Lotus node.
More details on all of the above results are given here: #10405 (comment)
What’s next?
In order to be able to get an accurate view of the network dynamics, we expect to collect data for at least 1 month in the first place. Should the need arise, we will follow up with any modifications needed to the tracer software and include in a future Lotus release, together with comms to the community.
We will analyse the collected data and produce a report for the following metrics: message propagation latency, gossipsub mesh stability, gossip effectiveness, score function behaviour and network topology. We expect the report to be ready 1 month after the collection month.
According to the current plan, the rough timeline is as follows:
Note that the plan will need to shift if it takes more time to reach the 10-20% of nodes needed to collect accurate results. The collection month and results analysis will shift accordingly.
You can check the progress and follow developments related to this project here: 🖥️ Gossipsub Measurement in the Filecoin Network
Who is running the experiment?
The development of the majority of the tracer software, as well as the experiment itself, is being ran by the PL ProbeLab team. The ProbeLab team primarily focuses on measurements and optimization of the IPFS network and its underlying libp2p stack. You can find more information about ProbeLab’s projects in this Notion page, similar studies carried out by the team in this Github repository: https://github.com/plprobelab/network-measurements/tree/master/results and results reported by the team on the IPFS network’s performance at: https://probelab.io.
Contact points of the ProbeLab team:
Beta Was this translation helpful? Give feedback.
All reactions