-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Libp2p doesn't discover and dial peers but the ones provided in the invitation code #1982
Comments
@kingalg is this in a new community or joining our own quiet community? (I think it's in a new community but just confirming) @siepra can you work with kinga to make sure you can reproduce this? I think the most important thing is that you're able to reproduce whatever problem(s) kinga is seeing. And can you switch to working on this at the next convenient stopping moment? It's not urgent until the rest of the Sprint is close to completion but it may be good to start investigating it now. |
To me this belongs in sprint because it's a blocker to the release. Let me know if there's some reason why it is blocked otherwise let's leave it in sprint. |
@holmesworcester yes, it belongs to sprint, it was missclick on my part. It was tested on new communities. One of them was one day old, had few hundred messages and about 20 users (most of them disconnected) and the rest were completely new, created only to test this issue. |
one theory to test is to check if tor process is not shutting down. |
We decided that we will attempt to reproduce this in the office with kinga on a developer's device where we can see logs. |
This has always been an issue but it was just not visible when we had registrar because joining users were receiving list of all peers registered in community. What we can do?
I'd go for first option and if this doesn't work out I'd fallback to workarounds. |
I think what we want is to use the addresses from CSRs and orbitdb. Libp2p probably makes different assumptions about the context so if we rely only on libp2p's default peer discovery mechanism this will be suboptimal or insufficient for our case. We want to make sure that:
We should also make sure there are tests for these cases so that we don't inadvertently break these properties with future changes. The last case will be hard to make a test for, but perhaps it's something like, if A, B, and C are online in a large community with hundreds of peers offline, and A and B are connected and B and C are connected, A and C should soon be connected. |
Also, solving this problem was part of the initial epic as described in #1340 , but it got missed when dividing the epic up into tasks:
We should run through the initial epic as described and make sure we're not missing anything else, perhaps? |
Questions about how this will work:
(For example: we needed to connect to last seen / often online because otherwise having many peers who joined once and then left would mess up the community because trying all of them would take too much time. Were there any other fixes related to discovery? Libp2p didn't consider new peers that joined until we gave libp2p a new list. Peer discovery never worked. Let's do some quick documentation of how this works. Throw some notes in a markdown file in the wiki or a docs directory. |
Things to improve:
|
Some follow up questions: Re: question 4, what happens if peer information is in an invitation link but not in the CSRs? Re: question 6, when someone new joins, does dialing new peers begin as soon as we receive the CSRs? Or do they only dial the peers in the invitation link, even once they replicate CSRs? Re: question 10, the fact that we do not dial new peers until restart at seems like a bug. If we have a node that is always on, it will never restart, so it will never dial new peers. Another problem is if a new user joins by connecting to 1 online peer and then loses that connection, it sounds like they will not dial any new peers. For regression tests, let's add some! What cases would we like to cover?
|
We dial peer from invitation link. We have to connect to someone to even replicate csrs. Does temporary lack of CSR mean that given peer is suspicious? Or can it mean that that we just don't have csr yet?
In current implementation we only directly dial peers from invitation link. We will be dialing as soon as we replicate CSR but only if we are not already connected to given peer.
Yes, however if peer is always online someone else may dial the peer eventually. I'm thinking, maybe we could start dialing all known peers again if we reach 0 connections and stop as soon as we connect to someone.
|
Will the address from the link stick around after, even once someone has replicated many CSRs? Ah, I suppose we don't know when we have replicated all CSRs.
So this is in progress as part of this work, correct? Or will we create a new issue?
We should have a target number of connections, like 6 or 8, and dial peers if we fall under that number.
In at least one case it would be helpful: say there are 100 online peers, but they are all new peers, so a returning peer doesn't know their addresses. If they all dial not-connected peers randomly at some cadence, they would eventually dial the returning peer. If they don't, the returning peer would never connect. I suspect there are other cases when it's helpful too. That said, I think we did learn that there is a performance cost to attempting lots of connections over Tor, so maybe we don't want to overdo it. Another question is: do we know why libp2p's logic for this stuff not working? It might be that there are subtleties here where we don't want to have to roll our own approach to cover all the cases. |
Yes, this is part of this work
Maybe it's because of our websocketovertor, Bartek mentioned that he noticed that we may be lacking some implementation there. |
We have to decide what's the DOD here. If we want to do it properly we probably have to write our own dialer. Especially if we're talking about full control on when we are connecting to who and when we are dropping the connection and also maybe how often do we want to dial given peer. Plus we will maybe have more control over connections in the future. However writing own dialer can be a bigger task. If we want to temporarily just fix the known issues with glue we also can do that, it'll be done quicker but it is not a long term solution and we will probably have to get rid of it soon anyway. |
Let's do the easiest fixes now. We are on an old version of libp2p. We should return to this after we upgrade, since the problems might be different. |
By "easiest" you mean "only fix basic known issues and do not write custom dialer"? |
I mean make some decision about what the most valuable/easy fixes are and do those. What do you think they are? |
Let's keep this focused on the problem this issue was created for (libp2p only dials peers in the invitation code) and create separate issues for the other problems, since this problem is new and related to current work, while the others are not. Other issues:
So I think we just need to make sure newly replicated peer info becomes part of our list of peers without restart. Is that right? Do we think this requires writing a custom dialer? |
Here are some reasons weighing on the side of "don't make changes internal to libp2p yet":
@vinkabuki @EmiM thoughts? |
Writing custom dialer is probably in our future anyway but we can wait for upgrading libp2p and in the meantime try other approach. I looked at libp2p dialer code and we may be able to get more out of libp2p connection manager configuration by setting proper minConnections/maxConnections (we already do use that but we may adjust it a bit) and tagging peers. |
@EmiM so what are your next steps? |
|
Note:
|
It would probably be good to test this a lot on desktop and Android for any performance or battery regressions, in a community with many peers. Do we remember why we disabled the autodialer in the past? Was it just dialing way too much? |
We've never diabled autodialer. We added initial dialing on our part because libp2p bootstrap was slow and was dialing only one peer at a time. |
Great! |
Version: 2.0.3-alpha.10 As far as I could check this issue is fixed. |
Mobile: mobile@2.0.1-alpha.7 (iOS 325)
Desktop: quiet@2.0.1-alpha.7 ( to the lesser extend, it's not that significant on desktop, may be not an issue at all, but I wanted to make a note about version that I was using during those tests)
Quiet started to be very slow and it's disconnecting a lot. Also, sometimes it's not syncing after coming back from background. In previous versions it was working rather consistently so this is new problem in this version.
The text was updated successfully, but these errors were encountered: