-
Notifications
You must be signed in to change notification settings - Fork 112
fix: memory leak in latency tracker on timeout after cancel #164
Conversation
if !ok || !data.lt.WasCancelled(ptm.k) { | ||
// If the request was cancelled, make sure we clean up the request tracker | ||
if ok && data.lt.WasCancelled(ptm.k) { | ||
data.lt.RemoveRequest(ptm.k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hannahhoward can you remember why we don't just remove the request on cancel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nevermind. @dirkmc pointed me to
// if we make a targeted request that is then cancelled, but we still |
So, I think the correct fix is to detect that all peers have sent back their responses. This fix will only record one latency.
However, this shouldn't be an issue as we should be hitting the timeout and removing all these trackers anyways. Something else is fishy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latency tracker records latency per-peer.
Each peerData has an associated latency tracker:
type peerData struct {
hasLatency bool
latency time.Duration
lt *latencyTracker
}
The latency tracker keeps track of the time that a request for a CID was sent to a peer, whether a cancel was then issued, and how long it took for a response to arrive:
type requestData struct {
startedAt time.Time
wasCancelled bool
timeoutFunc *time.Timer
}
type latencyTracker struct {
requests map[cid.Cid]*requestData
}
There are broadcast requests, and requests to a specific set of peers
- if the request was a broadcast, then if the request times out we clean up the latency tracker data for the request for that CID
- if the request was to a specific peer, then if the request times out we call
recordResponse()
.recordResponse()
callsdata.AdjustLatency()
on thepeerData
, which in turn cleans up the latency tracker data for the request for that CID.
The problem is that if the request was cancelled, we don't callrecordResponse()
so the latency tracker data for that request never gets cleaned up.
I think your initial suggestion makes the most sense - I don't know if it's worth the complexity and extra memory required to wait for a timeout after a cancel. I think we should just clean up the latency tracker information as soon as there is a cancel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed the per-peer part. This code looks correct.
I think we should just clean up the latency tracker information as soon as there is a cancel.
Without tracking latencies for canceled wants, we may not correctly update latencies. For example:
Assume peers 1 and 2 have recorded latencies of 1s.
- Ask peers 1 and 2.
- 2 seconds pass.
- Get a response for peer 1.
- Record a latency of 2s for peer 1.
- Cancel the request to peer 2.
- Remove the latency tracker.
We've just penalized peer 1 for responding before peer 2.
On the other hand, the current code will penalize peer 1 and not peer 2 if peer 2 processes the cancel before sending the block.
The correct solution may be randomization. That is:
- Record the first latency and remove all other trackers.
- Make sure to randomly sample peers that aren't the fastest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes you're right, that's a tricky problem to solve 🤔
We may also be able to reduce the per-peer logic/state by centralizing this logic a bit. |
…r-memory-leak fix: memory leak in latency tracker on timeout after cancel This commit was moved from ipfs/go-bitswap@f62bf54
May be the cause of #162