Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

retrieval: fix memory leak #2103

Merged
merged 4 commits into from
Feb 19, 2020
Merged

retrieval: fix memory leak #2103

merged 4 commits into from
Feb 19, 2020

Conversation

acud
Copy link
Member

@acud acud commented Feb 15, 2020

We have a memory leak on production and it seems that this is the problem:

  1. We issue a retrieve request to a peer, adding an entry to retrievals map on the retrieval peer in addRetrieval
  2. Peer cannot find the chunk, never returning an error
  3. Netstore RemoteFetch times out on context or defined search timeout
  4. Entry on Peer never gets deleted
  5. Due to the fact that the peers is stable and not disconnected - the map on the Peer struct leaks

It is not possible to create a regression test on netstore due to circular dependencies.

EDIT: I have added a test that checks that NoSuitablePeer error is emitted correctly from netstore. To verify that the leak happens, modify expireRetrieval to the following, then run TestNoSuitablePeer test. it will panic because the record is found in memory, exacerbating the problem.

func (p *Peer) expireRetrieval(ruid uint) {
	p.mtx.Lock()
	defer p.mtx.Unlock()

	if _, found := p.retrievals[ruid]; found {
		panic("found")
	}

	delete(p.retrievals, ruid)
}

@acud acud added the bug label Feb 15, 2020
@acud acud added this to the 0.6.0 milestone Feb 15, 2020
@acud acud self-assigned this Feb 15, 2020
@@ -384,7 +384,7 @@ func (r *Retrieval) handleChunkDelivery(ctx context.Context, p *Peer, msg *Chunk
}

// RequestFromPeers sends a chunk retrieve request to the next found peer
func (r *Retrieval) RequestFromPeers(ctx context.Context, req *storage.Request, localID enode.ID) (*enode.ID, error) {
func (r *Retrieval) RequestFromPeers(ctx context.Context, req *storage.Request, localID enode.ID) (*enode.ID, func(), error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a short explanation what is the purpose of the returned function, as now, it requires to go through the code. Or to name it in function signature.

t.Fatal("not enough nodes up")
}
// allow the two nodes time to set up the protocols otherwise kademlias will be empty when retrieve requests happen
time.Sleep(50 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that if this sleep is too short, the test passes but from different reason then this test is designed for? It would not have any peers in kademlia. Should there be some sort of check to see if peers are connected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would fail in both cases. i added a check nevertheless

Copy link
Member

@janos janos Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Unfortunately, now it's timing out https://travis-ci.org/ethersphere/swarm/jobs/652344397#L1236.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed a fix

Copy link
Contributor

@pradovic pradovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@acud acud merged commit b7216e3 into master Feb 19, 2020
@acud acud deleted the fix-retrieval-mem-leak branch February 19, 2020 11:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants