Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

simplification of Fetchers #1344

Merged
merged 17 commits into from
Jun 17, 2019
Merged

simplification of Fetchers #1344

merged 17 commits into from
Jun 17, 2019

Conversation

nonsense
Copy link
Contributor

@nonsense nonsense commented Apr 12, 2019

This PR is a subset of the simplify-fetchers branch for easier review and mostly addressing the issues described at #1309

It includes:

  1. A refactor of NetStore, where we replace FetchFunc with two new functions introduced are:
	NetStore.RemoteGet = delivery.RequestFromPeers
	NetStore.RemoteFetch

You probably want to start reviewing this PR by checking them out and checking where they are used.

  1. swarm/network/fetcher.go has been removed and replaced with singleflight.Group. Every time the Netstore.Get looks for chunk that is not in our LocalStore, it uses the singleflight.Group in order to suppress duplicate retrievals for the same chunk.

Each in-flight request for a chunk adds a Fetcher value to a map, so that the NetStore.Get receives a signal that the chunk it requested has been delivered. The signal that a chunk is delivered and stored to the LocalStore is the closing of the Fetcher.Delivered channel.

NetStore.Put is responsible for removing fetchers and closing the delivered channel. If a fetcher item has been added to the fetchers map, then we must get that chunk and there is an interested party in the delivered signal.

We need to add a timeout to the fetchers, because even though a chunk has been requested and an actor waits for it, doesn't mean it will be delivered.

  1. handleOfferedHashesMsg uses the Netstore.Has to determine if it needs a chunk (while syncing). If a chunk is needed, then a Fetcher is also created, so that we keep track while the current batch is being delivered.

It is possible for a chunk to be both requested through a retrieve request and delivered through syncing independently, but there would always be only one Fetcher and only one Delivered channel for it so that interested parties are notified when the chunk is delivered and stored.

  1. // * first it tries to request explicitly from peers that are known to have offered the chunk - this part of the functionality of RequestFromPeers has been removed. We no longer control the flow through values in the context.Context.

  2. RequestTimeout has been split into RequestTimeout and FailedPeerSkipDelay as these have different meanings. All timeouts are now placed in the timeouts package, and have documentation.

  3. Added LNetStore (we probably need a better name) - a NetStore wrapper needed for the LazyChunkReader.

  4. Found a bug in OfferedHashes where we request the same chunk from multiple peers via the OfferedHashes/WantedHashes protocol. In a future PR we will address it the following way: if we have requested a chunk, we have a fetcher for it, so subsequent OfferedHashes won't have an effect.

  5. Solved bug where the interval passed to SubscriptionPull is exclusive, meaning we lose one chunk between batches.

  6. Solved a bug with chunk deliveries described at Fix tracing for chunk delivery #1292 - chunks are delivered, but fetcher continues to make requests for the same chunk.


TODO TESTS:

  • fix broken unit tests, and remove irrelevant unit tests.
  • review and fix streamer_test.go tests
    • review and fix TestStreamerDownstream*
  • TestDeliveryFromNodes 8_200 seems to be flaky on macOS. TestDeliveryFromNodes is removed altogether, because we no longer blindly forward requests to our peers - FindPeer was changed.
  • review and fix flaky TestRetrieval
    • make sure it works when retrieval is done after syncing completes
    • make sure it works when retrieval is done before syncing completes
  • review and fix flaky TestFileRetrieval

TODO:

  • fix log levels - currently chunk not found results in errors, which is not right.
  • address all new todos and fixme leftovers in the change set
  • add global timeout for fetchers - if a chunk is never delivered, we don't want to keep a fetcher around forever - for this we need proper lifecycle and ownership handling for fetchers, something that we don't have right now.

NICE TO HAVE:

  • RequestFromPeers quit channel - notify that a peer was disconnected and we don't have to wait for timeouts.SearchTimeout - we can immediately call next peer.
  • add test for FindPeer - it has at least one new condition - not going out of depth - Add test for FindPeer #1362

@nonsense
Copy link
Contributor Author

This is the result of resolving the conflicts between simplify-fetchers branch and the new localstore + localstore migration branches.

This PR is still in progress, because I have to fix the tests and run integration tests on it (smoke tests), but the general gist of the changes probably won't change in the process - namely the new Fetchers impl., the new NetStore, and the handlers for RetrieveRequest, OfferedHashes, WantedHashes, NeedData.

Therefore feel free to start reviewing and asking questions if something is not clear, but also expect to see new commits.

@nonsense nonsense force-pushed the simple-fetchers branch 13 times, most recently from 05f9168 to d8f9a43 Compare April 16, 2019 13:11
@nonsense nonsense requested a review from zelig as a code owner April 17, 2019 07:53
@nonsense nonsense force-pushed the simple-fetchers branch 6 times, most recently from 3fa521a to c37ef91 Compare April 17, 2019 12:53
@nonsense nonsense force-pushed the simple-fetchers branch 3 times, most recently from 935a0bd to 60145da Compare April 17, 2019 13:08
@@ -62,52 +59,42 @@ func NewDelivery(kad *network.Kademlia, netStore *storage.NetStore) *Delivery {

// RetrieveRequestMsg is the protocol msg for chunk retrieve requests
type RetrieveRequestMsg struct {
Addr storage.Address
SkipCheck bool
HopCount uint8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protocol message change. after we merge this, we need to bump protocol versions.

@@ -300,293 +280,3 @@ func TestStreamerDownstreamChunkDeliveryMsgExchange(t *testing.T) {
}

}

func TestDeliveryFromNodes(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestDeliveryFromNodes is removed, because we no longer test that a request for a chunk is forwarded within a chain of nodes. FindPeer functionality is changed so that requests with the NN are handled, but not forwarded.

if wait := c.NeedData(ctx, hash); wait != nil {
log.Trace("checking offered hash", "ref", fmt.Sprintf("%x", hash))

if _, wait := c.NeedData(ctx, hash); wait != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR we introduce no change to the production syncer code. The bug where a chunk might be requested for N peers if N peers offer it to us, will be addressed in a follow PR.

NewNetFetcherFunc func(ctx context.Context, addr Address, peers *sync.Map) NetFetcher
const (
// capacity for the fetchers LRU cache
fetchersCapacity = 500000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetchers LRU cache is responsible for removing stale fetchers (outgoing requests for chunks that were never delivered. 500k fetchers is equal to 28MB, and in practice we should never reach that many active fetchers).

var fetcherTimeout = 2 * time.Minute // timeout to cancel the fetcher even if requests are coming in
// NewNetStore creates a new NetStore using the provided chunk.Store and localID of the node.
func NewNetStore(store chunk.Store, localID enode.ID) *NetStore {
fetchers, _ := lru.New(fetchersCapacity)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ignore the error, because we know that we are creating a LRU cache with a positive size.

@acud acud removed request for janos, frncmx and gluk256 June 12, 2019 20:17
storage/feed/handler.go Show resolved Hide resolved
@nonsense
Copy link
Contributor Author

nonsense commented Jun 12, 2019

@jpeletier could you explain a bit further the behaviour you rely on? Based on the tests it appeared that context.DeadlineExceeded was used to determine if a chunk is not found (also a comment next to the check...) by the NetStore, which is not really correct - there can be a variety of errors right now, which are not exported as error values, but essentially mean that the chunk was not found, for example

  • no suitable peer
  • global timeout reached

From your comment I see that you care mostly if the client cancelled the context or if the chunk was not found by the NetStore, is that right? If this is the case, then we could probably refactor this bit.


To give you background, the reason for this change is because the Swarm Feed tests were failing, when a chunk was not found by the NetStore and it returns an error. You can reproduce this, by reverting this particular change, and trying to run the tests.

jpeletier
jpeletier previously approved these changes Jun 14, 2019
@nonsense nonsense added this to the 0.4.2 milestone Jun 14, 2019
Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work @nonsense.
a few small comments, mainly about tracing, whitespacing (feel free to ignore), and a small comment about the netstore shutdown. otherwise lgtm

network/stream/delivery.go Show resolved Hide resolved
network/stream/messages.go Show resolved Hide resolved
network/stream/messages.go Show resolved Hide resolved
storage/netstore.go Outdated Show resolved Hide resolved
storage/netstore.go Show resolved Hide resolved
storage/netstore.go Show resolved Hide resolved
storage/netstore.go Show resolved Hide resolved
storage/netstore.go Show resolved Hide resolved
storage/netstore.go Show resolved Hide resolved
storage/netstore.go Outdated Show resolved Hide resolved
Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ship it 🚢

// TODO: define "eligible"
func (d *Delivery) RequestFromPeers(ctx context.Context, req *network.Request) (*enode.ID, chan struct{}, error) {
requestFromPeersCount.Inc(1)
// getOriginPo returns the originPo if the incoming Request has an Origin
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the codebase po generally refers to proximity order which is the number of matching leading (most significant) bits between two byte slices.

@nonsense nonsense merged commit 0b724bd into master Jun 17, 2019
nonsense added a commit that referenced this pull request Jun 17, 2019
nonsense added a commit that referenced this pull request Jun 17, 2019
@nonsense nonsense mentioned this pull request Jun 17, 2019
13 tasks
@nonsense nonsense removed this from the 0.4.2 milestone Jun 17, 2019
@nonsense nonsense deleted the simple-fetchers branch July 8, 2019 08:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants