Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some gossip never seems to sync in laboratory setup #6531

Closed
joostjager opened this issue May 12, 2022 · 8 comments · Fixed by #7239
Closed

Some gossip never seems to sync in laboratory setup #6531

joostjager opened this issue May 12, 2022 · 8 comments · Fixed by #7239
Labels
bug Unintended code behaviour gossip graph p2p Code related to the peer-to-peer behaviour

Comments

@joostjager
Copy link
Contributor

joostjager commented May 12, 2022

Background

I am experimenting with a pathfinding benchmark setup that spins up a bunch of nodes inside docker: https://github.com/bottlepay/pathfinding-benchmark/tree/lnd-6531

One essential step before starting the test is to make sure that all channels are opened and the latest channel policies gossiped to the test node that is going to make the test payments.

To detect whether the test node is fully synced, I set the default lnd base fee to 999 everywhere. What I'd want to see in the DescribeGraph output of the test node is every channel represented with policies that don't have a 999 base fee. (Unfortunately it seems impossible to override the policy directly when a channel is created with lnd, meaning that an extra channel_update is always needed.)

Your environment

  • lnd v0.14.3-beta
  • bitcoind

Steps to reproduce

Run run.sh

Note: make sure that there's sufficient memory, otherwise you may run into the death loop that is described here: #6210 (comment)

Expected behaviour

At some point during the setup stage, the test runner will wait for all channel policies to be received. In the used test graph, there should be 94 non-999-basefee policies. Then it proceeds with making the pathfinding test payments.

Actual behaviour

The test runner gets stuck on

testrunner_1               | 2022-05-12T14:24:12.807Z	DEBUG	Gossiped edges	{"count": 90, "expected": 94}

It keeps waiting for the last 4 (number varies) up to date policies, but isn't getting them.

To try to get propagation working, I'm using the following lnd.conf settings:

trickledelay=100
historicalsyncinterval=10s

Also the code waits for 2 minutes before changing the channel policies from the default, to avoid hitting the gossip rate limit.

Repro isn't 100%, sometimes it does get fully synced.

I've encountered propagation issues on mainnet too occasionally. Of course there could be many lnd and non-lnd related reasons for that. But perhaps this set up could help uncover a latent bug somewhere in the gossip chain.

@Roasbeef Roasbeef added bug Unintended code behaviour p2p Code related to the peer-to-peer behaviour gossip labels May 12, 2022
@joostjager
Copy link
Contributor Author

joostjager commented May 13, 2022

Some things that I've found out so far:

  • The missing channel updates never even left the node that sets the policy. The problem isn't propagation on the network, but the update not even being broadcast.
  • The reason why it isn't broadcast inAuthenticatedGossiper.sendBatch is because the active syncers don't have a an update horizon set (
    if g.remoteUpdateHorizon == nil {
    )
  • Looking at the logs, GossipTimestampRange, which should set the horizon, is sent and received, but only later. In other cases it doesn't seem to be sent at all.

Is there a race condition with a peer being a syncer but without a horizon yet?

I am not familiar enough with this sub-system to determine what this means exactly, but perhaps it gives a clue?

@joostjager
Copy link
Contributor Author

@saubyk is this issue worth prioritizing? There are quite a few open issues/bugs around gossip that may be hard to reproduce. But this one is easy to repro reliably.

I know from experience that this bug has a big impact. Without a patch applied, I couldn't get my node announcement to broadcast at all.

@saubyk
Copy link
Collaborator

saubyk commented Dec 6, 2022

@joostjager yes we are prioritizing the fix for the gossip propagation issue
#7186 is the pr I believe for this

@joostjager
Copy link
Contributor Author

It seems that one is not the fix: #7186 (comment)

@yyforyongyu
Copy link
Member

Nope it's not. Meanwhile what's the topology looking like? I'll try to run the experiment to see what went wrong.

@joostjager
Copy link
Contributor Author

joostjager commented Dec 6, 2022

The topology is defined here. It are a lot of nodes, and part of the test setup is to verify that the designated sender node has received all channel updates.

For this test, I use a modified lnd version: https://github.com/bottlepay/lnd/commits/pathfinding-benchmark-mod-mpp. Docker-compose is automatically pulling in this branch and building it.

It contains three changes:

  • Reduced scrypt parameters to improve memory footprint
  • Reduced cache pre-allocation to improve memory footprint
  • The gossip patch.

Without the patch, the sender node isn't going to get a full picture of the graph.

@Roasbeef
Copy link
Member

Roasbeef commented Dec 7, 2022

Can you try out #7239 in your test bed?

Roasbeef added a commit to Roasbeef/lnd that referenced this issue Dec 7, 2022
In this commit, we modify our gossip broadcast logic to ensure that we
always will send out our own gossip messages regardless of the
filtering/feature policies of the peer.

Before this commit, it was possible that when we went to broadcast an
announcement, none of our peers actually had us as a syncer peer (lnd
terminology). In this case, the FilterGossipMsg function wouldn't do
anything, as they don't have an active timestamp filter set. When we go
to them merge the syncer map, we'd add all these peers we didn't send
to, meaning we would skip them when it came to broadcast time.

In this commit, we now split things into two phases: we'll broadcast
_our_ own announcements to all our peers, but then do the normal
filtering and chunking for the announcements we got from a remote peer.

Fixes lightningnetwork#6531
Fixes lightningnetwork#7223
Fixes lightningnetwork#7073
@joostjager
Copy link
Contributor Author

@Roasbeef ran it three times to be sure, and problem seems solved wtih #7239

Roasbeef added a commit to Roasbeef/lnd that referenced this issue Dec 15, 2022
In this commit, we modify our gossip broadcast logic to ensure that we
always will send out our own gossip messages regardless of the
filtering/feature policies of the peer.

Before this commit, it was possible that when we went to broadcast an
announcement, none of our peers actually had us as a syncer peer (lnd
terminology). In this case, the FilterGossipMsg function wouldn't do
anything, as they don't have an active timestamp filter set. When we go
to them merge the syncer map, we'd add all these peers we didn't send
to, meaning we would skip them when it came to broadcast time.

In this commit, we now split things into two phases: we'll broadcast
_our_ own announcements to all our peers, but then do the normal
filtering and chunking for the announcements we got from a remote peer.

Fixes lightningnetwork#6531
Fixes lightningnetwork#7223
Fixes lightningnetwork#7073
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour gossip graph p2p Code related to the peer-to-peer behaviour
Projects
None yet
4 participants