Skip to content
This repository has been archived by the owner on Feb 1, 2023. It is now read-only.

Timeout Management #244

Closed
dirkmc opened this issue Jan 31, 2020 · 9 comments
Closed

Timeout Management #244

dirkmc opened this issue Jan 31, 2020 · 9 comments

Comments

@dirkmc
Copy link
Contributor

dirkmc commented Jan 31, 2020

When a Bitswap Session fetches a block it sends want-haves to all the peers in the session, and a single "optimistic" want-block to one of the peers. If the peer that receives the want-block responds with DONT_HAVE, the Session checks the other responses and sends want-block to any peer that responded with HAVE.

If a peer that receives a want-block doesn't respond, the Session will hang. A peer may not respond because:

  • It is overloaded / network issues
  • older versions of Bitswap do not send a response if they don't have a block

Proposed solution:

  • Detect timeouts in the MessageQueue (which knows which version of Bitswap the peer supports)
  • The MessageQueue informs the Session of timeouts, including whether it's an older version of Bitswap
  • When the Session gets a timeout for a CID
    • if it's from a peer running an older version of Bitswap, treat it as a DONT_HAVE
    • if it's from a peer running a newer version of Bitswap, remove the peer from the Session
@Stebalien
Copy link
Member

Stebalien commented Feb 5, 2020

if it's from a peer running a newer version of Bitswap, remove the peer from the Session

(discussed a bit in person) To account for busy peers, we should consider only timing out peers if they haven't sent us anything (want have, block, don't have) within some period.

(not discussed in person) Actually, we could even do this in all cases for all peers with outstanding wants. This might be significantly simpler than tracking per-want timeouts. Thoughts? I don't want to redirect this when you've already started, only if you think this would be simpler.

I'm even wondering if it makes sense to remove old peers from the session the same way (when idle) (we may have discussed this?). My thinking is that, if an old peer isn't responding to requests, it might not have anything we're looking for. If we run out of peers to ask, we'll just broadcast. (and we'll broadcast occasionally anyways).

@dirkmc
Copy link
Contributor Author

dirkmc commented Feb 5, 2020

Actually, we could even do this in all cases for all peers with outstanding wants. This might be significantly simpler than tracking per-want timeouts.

I think we still need to track per-want timeouts for the case where we send a want-block to a peer running an old version of Bitswap, because old Bitswap simply doesn't respond to want-block if it doesn't have the block (this is the case covered by #248)

I'm even wondering if it makes sense to remove old peers from the session the same way (when idle)

If the peers have a part of the data that we haven't got to requesting yet, and they are dropped from the session, then we have to wait for a timeout + broadcast in order to add them back into the session.
Is there a disadvantage of keeping them in the session?

@Stebalien
Copy link
Member

This would unfairly disadvantage old peers who have some of the blocks. On the other hand, old peers that have some of the blocks are more likely to slow us down if we have other peers with most/all of the blocks so there is a trade-off here.

then we have to wait for a timeout

Note: we'll have to wait for a timeout either way. If the peer turns out to be bad and we keep randomly selecting them, we'll have to wait for a timeout in that case as well before we ask other peers in the session.

@dirkmc
Copy link
Contributor Author

dirkmc commented Feb 5, 2020

Note: we'll have to wait for a timeout either way. If the peer turns out to be bad and we keep randomly selecting them, we'll have to wait for a timeout in that case as well before we ask other peers in the session.

Ah yeah that's true.

I suggest we consider each kind of timeout separately:

  1. DONT_HAVE timeout (only for peers running old Bitswap)
    Timeout due to old Bitswap not responding if it doesn't have a block.
    Action: The session assumes DONT_HAVE.
  2. Peer timeout (greater than the DONT_HAVE timeout)
    Timeout for when peer doesn't respond to any want for some period of time.
    Action: Session removes peer from the session.
  3. Session timeout (greater than the Peer timeout)
    Session wants a block and no peer has responded with the block. This can occur if there are several peers running old Bitswap in the session and we ask them for the block one by one.
    Action: The session broadcasts the want to all connected peers and queries the DHT for the want.
    Note that if all peers respond with DONT_HAVE we will also broadcast, but this may not happen in time if several of the peers are running old Bitswap.

One mitigating factor is that Bitswap will favour peers with lower latency, which are likely to be new peers (as the simulated DONT_HAVE for old peers occurs after latency + padding).

@Stebalien
Copy link
Member

I have two concerns with that:

  1. With 1 & 2, the peer may be busy sending us (or other peers) other blocks. We'd need to carefully tune the timeout not just based on latency but based on how long it usually takes the other peer to respond to a want. Even that is a bit difficult given that response time will vary based on how many higher priority want are outstanding.
  2. With 3, old peers will never be removed from sessions, even if they only had the root block.

@dirkmc
Copy link
Contributor Author

dirkmc commented Feb 6, 2020

I'm not sure if I was clear above, I'm suggesting that we implement all 3 of these timeouts. Note that the Session timeout (3) is already there, it was in the old Bitswap and was ported over to the new Bitswap.

1. DONT_HAVE timeout (peers running old Bitswap)

You're right it's tricky to measure accurately because the peer may be busy sending blocks to other peers. In practice though it may not be such a problem:

  • latency gives us a rough idea of how long we should expect to wait for a response
  • if we timeout too early, we will send the request to a different peer
  • if the original peer then responds, we will still end up getting the block
  • in the worst case other peers also respond, and we get some duplicate blocks
  • we send out a cancel once we get the block, so we may avoid some duplicates
  • we only need this timeout for peers running old Bitswap (which should be less and less over time)

So the DONT_HAVE timeout becomes more like a hint: wait for a peer for x amount of time, and if it doesn't respond, ask someone else at the risk of getting some duplicates.

2. Peer timeout

  • For peers running new Bitswap
    The Peer timeout should be close to one RTT:
    • On the server side Bitswap sends responses first to peers with the least active data in their queue
      This means that from the client's perspective, either
      • there is a lot of active data in my queue (because I made lots of requests), so I should expect some responses coming down the pipeline soon
      • there is not much active data in my queue, so any request I make will be prioritized and I should expect a quick response
    • In the worst case
      • The Session removes a peer from the session because it doesn't respond to a want in time
      • The response arrives, and the Session adds the peer back again
  • For peers running old Bitswap
    We should use a longer timeout, but if it's too short it's not so bad because the same worst case applies: if the peer is removed from the session because of a timeout, and then its response arrives, it will be added back into the session.

@Stebalien
Copy link
Member

I'm not sure if I was clear above, I'm suggesting that we implement all 3 of these timeouts.

👍

  1. DONT_HAVE timeout (peers running old Bitswap)

My concern is that:

  1. This will be a common situation. As long as our peer's wantlists are deep enough, the true want latency will exceed the real latency.
  2. When this happens, the effect will cascade. That is, assuming a peer is sending blocks at a steady rate, if we timeout one want before a peer gets the chance to send us the block, we'll end up timing out all subsequent wants to that peer for the same reason.

You're right in that it might not lead to duplicate blocks but it might lead to a lot of extra work (i.e., sending lots of wants to peers that we immediately cancel, and forcing peers to deal with these). Also remember: "freezing". IIRC, we turned that off in the new system but canceling wants to old peers will slow us to a crawl (maybe not something we really care about?).

However, this is something that we should be able to test.

  • We can measure canceled wants by looking at how many time we make requests to the datastore/blockstore. We should be able to easily instrument that.
  • We should also try tracking bandwidth usage. That could be a bit tricker.
  1. Peer timeout

Ah, sorry, I misread this. I thought this only applied to new peers. Applying this to all peers resolves the rest of my concerns.

@Stebalien
Copy link
Member

Sync agreement: go with the simplest solution for now. We won't know the block latency anyways when we first start so this isn't wasted work. Later, we can experiment with tracking per-block latencies.

@Stebalien Stebalien mentioned this issue Feb 13, 2020
21 tasks
@dirkmc
Copy link
Contributor Author

dirkmc commented Feb 14, 2020

Closing in favour of ipfs/boxo#125

@dirkmc dirkmc closed this as completed Feb 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants