Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libp2p doesn't discover and dial peers but the ones provided in the invitation code #1982

Closed
kingalg opened this issue Oct 18, 2023 · 29 comments
Assignees
Labels
bug Something isn't working

Comments

@kingalg
Copy link
Collaborator

kingalg commented Oct 18, 2023

Mobile: mobile@2.0.1-alpha.7 (iOS 325)
Desktop: quiet@2.0.1-alpha.7 ( to the lesser extend, it's not that significant on desktop, may be not an issue at all, but I wanted to make a note about version that I was using during those tests)

Quiet started to be very slow and it's disconnecting a lot. Also, sometimes it's not syncing after coming back from background. In previous versions it was working rather consistently so this is new problem in this version.

@kingalg kingalg added the bug Something isn't working label Oct 18, 2023
@kingalg kingalg added this to Quiet Oct 18, 2023
@kingalg kingalg moved this to Backlog - Mobile in Quiet Oct 18, 2023
@kingalg kingalg moved this from Backlog - Mobile to Blocked in Quiet Oct 18, 2023
@holmesworcester
Copy link
Contributor

holmesworcester commented Oct 18, 2023

@kingalg is this in a new community or joining our own quiet community? (I think it's in a new community but just confirming)

@siepra can you work with kinga to make sure you can reproduce this? I think the most important thing is that you're able to reproduce whatever problem(s) kinga is seeing. And can you switch to working on this at the next convenient stopping moment? It's not urgent until the rest of the Sprint is close to completion but it may be good to start investigating it now.

@holmesworcester holmesworcester moved this from Blocked to Sprint in Quiet Oct 18, 2023
@holmesworcester
Copy link
Contributor

To me this belongs in sprint because it's a blocker to the release. Let me know if there's some reason why it is blocked otherwise let's leave it in sprint.

@kingalg
Copy link
Collaborator Author

kingalg commented Oct 19, 2023

@holmesworcester yes, it belongs to sprint, it was missclick on my part. It was tested on new communities. One of them was one day old, had few hundred messages and about 20 users (most of them disconnected) and the rest were completely new, created only to test this issue.

@vinkabuki
Copy link
Contributor

one theory to test is to check if tor process is not shutting down.

@siepra siepra moved this from Sprint to In progress in Quiet Oct 26, 2023
@siepra siepra self-assigned this Oct 26, 2023
@holmesworcester
Copy link
Contributor

We decided that we will attempt to reproduce this in the office with kinga on a developer's device where we can see logs.

@siepra siepra changed the title Quiet is slow and disconnecting a lot, especially on mobile Libp2p doesn't discover and dial peers but the ones provided in the invitation code Oct 27, 2023
@siepra siepra assigned EmiM and unassigned siepra Oct 27, 2023
@EmiM
Copy link
Contributor

EmiM commented Oct 27, 2023

This has always been an issue but it was just not visible when we had registrar because joining users were receiving list of all peers registered in community.

What we can do?

  1. Investigate if we can make libp2p connect to other peers in the network. So if B connects to A and C connects to A then can B connect to C?
  2. implement a workaround by passing to libp2p a list of addresses gathered from csrs
  3. Bump libp2p. However this is not easy and I don't know if even possible right now. Bumping libp2p requires bumping ipfs and bumping ipfs is blocked by orbitdb.

I'd go for first option and if this doesn't work out I'd fallback to workarounds.

@holmesworcester
Copy link
Contributor

holmesworcester commented Oct 27, 2023

I think what we want is to use the addresses from CSRs and orbitdb.

Libp2p probably makes different assumptions about the context so if we rely only on libp2p's default peer discovery mechanism this will be suboptimal or insufficient for our case.

We want to make sure that:

  1. All synced clients know about all peers, so that even if only one of those peers is online, the user will eventually connect to them.
  2. Clients bias to last-seen and most-online peers when connecting, so that having a large community with many offline peers does not make connecting to the guest peer slower and slower (because connecting over Tor takes some time and we can't make hundreds of simultaneous attempts.)
  3. Clients still try peers according to libp2p's own logic, so that the network doesn't just center around a few often-online peers, which it would if we only connected to the most online peers. (This piece is tricky to reason about.)

We should also make sure there are tests for these cases so that we don't inadvertently break these properties with future changes. The last case will be hard to make a test for, but perhaps it's something like, if A, B, and C are online in a large community with hundreds of peers offline, and A and B are connected and B and C are connected, A and C should soon be connected.

@holmesworcester
Copy link
Contributor

Also, solving this problem was part of the initial epic as described in #1340 , but it got missed when dividing the epic up into tasks:

Note on peer discovery: this might already be taken care of but we should make sure that users still discover all peers in the community when they connect and start syncing, by syncing this data from OrbitDB. It's okay if their addresses are not added to this list until they are registered with owner.

We should run through the initial epic as described and make sure we're not missing anything else, perhaps?

@holmesworcester
Copy link
Contributor

holmesworcester commented Nov 7, 2023

Questions about how this will work:

  1. At what point in joining does a new user know the entire list of all peer onion address and peer ID in the network?
  2. How do they learn this?
  3. How do all existing members of a community learn of a new member's onion address and peer ID, and when?
  4. Are there other mechanisms through which peers discover onion addresses or peer IDs other than the above?
  5. What is a peer's strategy now for connecting to other peers?
  6. Who do peers try first?
  7. How many peers do they try?
  8. Do we think this makes sense and will result in all peers being connected in a graph as desired?
  9. Have we tried other approaches before and what did we think about those?
  10. Will a peer who never restarts update their list of peers? Or is it only updated on restart?
  11. What's the history of problems solved here?
  12. What regression tests do we have to cover solved problems?
  13. Any other thoughts?

(For example: we needed to connect to last seen / often online because otherwise having many peers who joined once and then left would mess up the community because trying all of them would take too much time. Were there any other fixes related to discovery? Libp2p didn't consider new peers that joined until we gave libp2p a new list. Peer discovery never worked.

Let's do some quick documentation of how this works. Throw some notes in a markdown file in the wiki or a docs directory.

@vinkabuki
Copy link
Contributor

vinkabuki commented Nov 7, 2023

@holmesworcester @leblowl

  1. Once replicating all csrs
  2. Through orbitdb db csrs db replication
  3. Once they replicate their csr
  4. Yes, through invitation link
  5. We provide dial list to libp2p and we manually dial peers until connecting to the first peer.
  6. When joining they dial peers as given in the invitation link, for returning users we collect data of last seen peers and most shared connection time with, it's also what we share in invite link
  7. max 4 for the new uses, have to check for returning users.
  8. It makes sense but needs to be validated, and possibly improved.
  9. There is this native libp2p DHT module. Also in past iterations we were providing bootstrap list, at some point we discovered that libp2p autodialer doesn't work well so we implemented our own mechanism for dialing peers.
  10. atm we don't add new peers to libp2p on the runtime.
  11. bootstrap address list not working and replaced with our dialing mechanism. autodialer was dialing only one peer at the time which was super slow.
  12. We should definitely add some.
  13. @vinkabuki and @EmiM opinion is that our own dialer for many reasons (problems with native solutions, using non standard websocketOverTor transport, specific requirements we very often face, and ease of use and control) is the best option.

Things to improve:

  1. Ensure libp2p is autodialing provided peers.
  2. Add every valid replicated peer info to libp2p on runtime.

@holmesworcester
Copy link
Contributor

Some follow up questions:

Re: question 4, what happens if peer information is in an invitation link but not in the CSRs?

Re: question 6, when someone new joins, does dialing new peers begin as soon as we receive the CSRs? Or do they only dial the peers in the invitation link, even once they replicate CSRs?

Re: question 10, the fact that we do not dial new peers until restart at seems like a bug. If we have a node that is always on, it will never restart, so it will never dial new peers. Another problem is if a new user joins by connecting to 1 online peer and then loses that connection, it sounds like they will not dial any new peers.

For regression tests, let's add some! What cases would we like to cover?

  • User discovers all peers
  • New user dials a peer not in the invite link
  • Already-online user will dial a peer from a just-received CSR if needed
  • Anything else?

@EmiM
Copy link
Contributor

EmiM commented Nov 8, 2023

Re: question 4, what happens if peer information is in an invitation link but not in the CSRs?

We dial peer from invitation link. We have to connect to someone to even replicate csrs. Does temporary lack of CSR mean that given peer is suspicious? Or can it mean that that we just don't have csr yet?

Re: question 6, when someone new joins, does dialing new peers begin as soon as we receive the CSRs? Or do they only dial the peers in the invitation link, even once they replicate CSRs?

In current implementation we only directly dial peers from invitation link. We will be dialing as soon as we replicate CSR but only if we are not already connected to given peer.

Re: question 10, the fact that we do not dial new peers until restart at seems like a bug. If we have a node that is always on, it will never restart, so it will never dial new peers. Another problem is if a new user joins by connecting to 1 online peer and then loses that connection, it sounds like they will not dial any new peers.

Yes, however if peer is always online someone else may dial the peer eventually. I'm thinking, maybe we could start dialing all known peers again if we reach 0 connections and stop as soon as we connect to someone.

  • Should all not connected peers be dialed indefinitely?

@holmesworcester
Copy link
Contributor

holmesworcester commented Nov 8, 2023

We dial peer from invitation link. We have to connect to someone to even replicate csrs. Does temporary lack of CSR mean that given peer is suspicious? Or can it mean that that we just don't have csr yet?

Will the address from the link stick around after, even once someone has replicated many CSRs? Ah, I suppose we don't know when we have replicated all CSRs.

In current implementation we only directly dial peers from invitation link. We will be dialing as soon as we replicate CSR but only if we are not already connected to given peer.

So this is in progress as part of this work, correct? Or will we create a new issue?

I'm thinking, maybe we could start dialing all known peers again if we reach 0 connections and stop as soon as we connect to someone.

We should have a target number of connections, like 6 or 8, and dial peers if we fall under that number.

Should all not connected peers be dialed indefinitely?

In at least one case it would be helpful: say there are 100 online peers, but they are all new peers, so a returning peer doesn't know their addresses. If they all dial not-connected peers randomly at some cadence, they would eventually dial the returning peer. If they don't, the returning peer would never connect.

I suspect there are other cases when it's helpful too. That said, I think we did learn that there is a performance cost to attempting lots of connections over Tor, so maybe we don't want to overdo it.

Another question is: do we know why libp2p's logic for this stuff not working? It might be that there are subtleties here where we don't want to have to roll our own approach to cover all the cases.

@EmiM
Copy link
Contributor

EmiM commented Nov 9, 2023

So this is in progress as part of this work, correct? Or will we create a new issue?

Yes, this is part of this work

Another question is: do we know why libp2p's logic for this stuff not working?

Maybe it's because of our websocketovertor, Bartek mentioned that he noticed that we may be lacking some implementation there.

@EmiM
Copy link
Contributor

EmiM commented Nov 10, 2023

We have to decide what's the DOD here.

If we want to do it properly we probably have to write our own dialer. Especially if we're talking about full control on when we are connecting to who and when we are dropping the connection and also maybe how often do we want to dial given peer. Plus we will maybe have more control over connections in the future. However writing own dialer can be a bigger task.

If we want to temporarily just fix the known issues with glue we also can do that, it'll be done quicker but it is not a long term solution and we will probably have to get rid of it soon anyway.

@holmesworcester
Copy link
Contributor

Let's do the easiest fixes now. We are on an old version of libp2p. We should return to this after we upgrade, since the problems might be different.

@EmiM
Copy link
Contributor

EmiM commented Nov 13, 2023

By "easiest" you mean "only fix basic known issues and do not write custom dialer"?

@holmesworcester
Copy link
Contributor

I mean make some decision about what the most valuable/easy fixes are and do those.

What do you think they are?

@holmesworcester
Copy link
Contributor

holmesworcester commented Nov 13, 2023

Let's keep this focused on the problem this issue was created for (libp2p only dials peers in the invitation code) and create separate issues for the other problems, since this problem is new and related to current work, while the others are not.

Other issues:

  1. Peers never try to dial peers randomly, so partitions can happen. We have a separate issue for this: Possible subgraphs (network partitions) in libp2p #2056
  2. Peers don't dial more peers when they lose connections, when they should try to stay connected to at least 6 peers. https://github.com/orgs/TryQuiet/projects/3/views/1?pane=issue&itemId=44468676
  3. Any others?

So I think we just need to make sure newly replicated peer info becomes part of our list of peers without restart. Is that right? Do we think this requires writing a custom dialer?

@holmesworcester
Copy link
Contributor

Here are some reasons weighing on the side of "don't make changes internal to libp2p yet":

  1. We are on an old version of IPFS
  2. We are probably on an old version of libp2p
  3. We want it to be easy to upgrade libp2p
  4. We don't yet know if js-ipfs is sufficient for our work on iOS, so there's still some risk of needing to change.
  5. In our experiments with iOS we will have an always-on node that is reachable without Tor, and we will connect to this first in many cases. That will make some of these problems go away, and we can return to them when we need to.

@vinkabuki @EmiM thoughts?

@EmiM
Copy link
Contributor

EmiM commented Nov 14, 2023

Writing custom dialer is probably in our future anyway but we can wait for upgrading libp2p and in the meantime try other approach.

I looked at libp2p dialer code and we may be able to get more out of libp2p connection manager configuration by setting proper minConnections/maxConnections (we already do use that but we may adjust it a bit) and tagging peers.
Libp2p uses peer tags for deciding which connection needs to be pruned first if number of connections reaches maxConnections.

@holmesworcester
Copy link
Contributor

@EmiM so what are your next steps?

@EmiM
Copy link
Contributor

EmiM commented Nov 14, 2023

  • Add fix for libp2p bootstrap list (done)
  • Add fix for dialing new peers as soon as we replicate new csrs (done)
  • Tag peers and adjust libp2p config
  • Add tests

@holmesworcester
Copy link
Contributor

holmesworcester commented Nov 14, 2023

Note:

  • we should check to make sure our approach of "dial 10 peers, wait for attempts to resolve, dial next 10" is still working now that we're replicating peers from the CSRs.
  • "tag peers and adjust libp2p config" does not seem necessary at this stage.

@EmiM
Copy link
Contributor

EmiM commented Nov 16, 2023

#2051

@EmiM EmiM moved this from In progress to Waiting for review in Quiet Nov 20, 2023
@holmesworcester
Copy link
Contributor

It would probably be good to test this a lot on desktop and Android for any performance or battery regressions, in a community with many peers.

Do we remember why we disabled the autodialer in the past? Was it just dialing way too much?

@EmiM
Copy link
Contributor

EmiM commented Nov 21, 2023

We've never diabled autodialer.

We added initial dialing on our part because libp2p bootstrap was slow and was dialing only one peer at a time.
Libp2p autodialer still works and have been working this whole time, it just uses peers added to libp2p's internal peer book. Peer book is updated with peer data when peer is dialed.
So to sum up - we dial initial list of peers (and after this fix - all new peers) and libp2p's autodialer takes care of keeping/establishing at least minConnections number of connections.

@EmiM EmiM moved this from Waiting for review to Merged (develop) in Quiet Nov 21, 2023
@Kacper-RF Kacper-RF moved this from Merged (develop) to Ready for QA in Quiet Nov 21, 2023
@holmesworcester
Copy link
Contributor

Great!

@kingalg
Copy link
Collaborator Author

kingalg commented Nov 28, 2023

Version: 2.0.3-alpha.10
mobile@2.0.3-alpha.11 (iOS 338)

As far as I could check this issue is fixed.

@kingalg kingalg closed this as completed Nov 28, 2023
@kingalg kingalg moved this from Ready for QA to Done in Quiet Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

5 participants