Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(pushsync): Don't account for failed retries due to accounting #1662

Merged
merged 7 commits into from
May 12, 2021

Conversation

aloknerurkar
Copy link
Contributor

@aloknerurkar aloknerurkar commented May 6, 2021

#1638

  • During pushsync retries if we fail to reserve amount with peer, we don't consider the attempt and try again with a different peer. Previously if we fail to reserve the amount, we did not even try to push anything
  • Added a FailedRequestCache in pushsync to remember failed traces. So if we failed to push {peer, chunk} multiple times, we will not try the same peer for the chunk and try a different peer
  • We added some metrics in the pushsync code to figure out how many attempts we made etc

This change is Reviewable

@aloknerurkar
Copy link
Contributor Author

Current changes include:

  1. Rebasing @acud 's changes on the latest master and fixing conflicts
  2. Adding logic to discount failed retry attempts (for both originator and forwarder)
  3. Adding test to ensure above behavior

@aloknerurkar aloknerurkar requested review from acud, zelig and metacertain May 6, 2021 05:27
@aloknerurkar aloknerurkar self-assigned this May 6, 2021
@aloknerurkar aloknerurkar added the ready for review The PR is ready to be reviewed label May 10, 2021
}
if err != nil {
logger.Debugf("could not push to peer %s: %v", peer.String(), err)
resultC <- &pushResult{err: err}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error does not seem to be read anywhere.
Consider using the errgroup package.

Stamp: stamp,
}); err != nil {
_ = streamer.Reset()
return nil, true, fmt.Errorf("chunk %s deliver to peer %s: %w", ch.Address().String(), peer.String(), err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the String() call is not needed on %s arguments

pkg/pushsync/pushsync.go Show resolved Hide resolved
Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @aloknerurkar, @metacertain, and @zelig)


pkg/pushsync/pushsync.go, line 354 at r2 (raw file):

Previously, zelig (Viktor Trón) wrote…

this error does not seem to be read anywhere.
Consider using the errgroup package.

errgroup is not needed since there is no concurrency here


pkg/pushsync/pushsync.go, line 365 at r2 (raw file):

		select {
		case r := <-resultC:
			if r.receipt != nil {

check if r.err != nil, log the error if not and continue to the next retry


pkg/pushsync/pushsync.go, line 396 at r2 (raw file):

	streamer, err := ps.streamer.NewStream(ctx, peer, nil, protocolName, protocolVersion, streamName)
	if err != nil {
		return nil, true, fmt.Errorf("new stream for peer %s: %w", peer.String(), err)

this should be nil,false, error since there was no attempt yet (nothing has been sent)


pkg/pushsync/pushsync.go, line 440 at r2 (raw file):

Previously, zelig (Viktor Trón) wrote…

if you use errgroup or separate errs channel, you wont need this struct

errgroup won't help, since it will result in an error if one error or more has occurred. but since subsequent retries might resolve the problem, we would not like to error. i would personally refrain from adding an errgroup here. also, errgroup helps when havin concurrent requests, which is not the case here, since we wait for one request to expire before trying the next i.e. no parallel requests here

@aloknerurkar
Copy link
Contributor Author


pkg/pushsync/pushsync.go, line 396 at r2 (raw file):

Previously, acud (acud) wrote…

this should be nil,false, error since there was no attempt yet (nothing has been sent)

If NewStream returns an error, we are not able to connect to a peer right? I was assuming this might take some time to return in which case we can call it an attempt. No?

In the subsequent change, we will filter peers which return this error. Which should mitigate the issue (by not considering these peers) in my view. If we don't consider this an attempt, we will not consider this node to be blocked for future requests and retry this.

Copy link
Contributor Author

@aloknerurkar aloknerurkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 5 unresolved discussions (waiting on @acud, @metacertain, and @zelig)


pkg/pushsync/pushsync.go, line 354 at r2 (raw file):

Previously, acud (acud) wrote…

errgroup is not needed since there is no concurrency here

Agree with @acud We only do 1 request at a time and only if it fails we retry.


pkg/pushsync/pushsync.go, line 365 at r2 (raw file):

Previously, acud (acud) wrote…

check if r.err != nil, log the error if not and continue to the next retry

This error is returned from pushPeer. The error is already logged there once.


pkg/pushsync/pushsync.go, line 407 at r2 (raw file):

Previously, zelig (Viktor Trón) wrote…

the String() call is not needed on %s arguments

Done.

@aloknerurkar aloknerurkar changed the title refactor: pushsync to not account for failed retries due to accounting refactor(pushsync): Don't account for failed retries due to accounting May 10, 2021
Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 4 unresolved discussions (waiting on @acud, @aloknerurkar, @metacertain, and @zelig)


pkg/pushsync/pushsync.go, line 365 at r2 (raw file):

Previously, aloknerurkar (Alok Nerurkar) wrote…

This error is returned from pushPeer. The error is already logged there once.

OK 👍 note here that the happy path is nested, usually you would check for an error and if there's an error do something, otherwise the happy path is assumed to be at the root of the method/block


pkg/pushsync/pushsync.go, line 396 at r2 (raw file):
If NewStream returns an error it is most probably that we are no longer connected to this peer or that some other problem may exist. It should not take too long to return here.

If we don't consider this an attempt, we will not consider this node to be blocked for future requests and retry this.

As far as I remember it was agreed that the boolean returned from the method signifies whether there was an actual successful attempt to communicate the chunk. It should not be true unless the chunk actually got sent successfully, which is not the case here. In any case, a failed attempt to send to the peer (i.e. if we did not manage to get around to sending the chunk to the peer) must always result in a failed attempt logged for the trace in the LRU cache and the peer added to the skip list


pkg/pushsync/pushsync.go, line 407 at r2 (raw file):

Previously, aloknerurkar (Alok Nerurkar) wrote…

Done.

that actually doesn't work with chunk addresses in all cases. a manual test to confirm would be nice

Copy link
Contributor Author

@aloknerurkar aloknerurkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 4 unresolved discussions (waiting on @acud, @aloknerurkar, @metacertain, and @zelig)


pkg/pushsync/pushsync.go, line 365 at r2 (raw file):

Previously, acud (acud) wrote…

OK 👍 note here that the happy path is nested, usually you would check for an error and if there's an error do something, otherwise the happy path is assumed to be at the root of the method/block

There will be an error check once the next PR will be merged.

The comment on the bottom was meant to clarify that we will be continuing the loop. We will revisit this once I merge the next PR and if more details are required will add.


pkg/pushsync/pushsync.go, line 407 at r2 (raw file):

Previously, acud (acud) wrote…

that actually doesn't work with chunk addresses in all cases. a manual test to confirm would be nice

DEBU[2021-05-11T13:25:40+05:30] could not push to peer 6000000000000000000000000000000000000000000000000000000000000000: new stream for peer 6000000000000000000000000000000000000000000000000000000000000000: peer not reachable

This does work in the test. Not sure what you mean by 'all' cases?

Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @metacertain and @zelig)


pkg/pushsync/pushsync.go, line 407 at r2 (raw file):

Previously, aloknerurkar (Alok Nerurkar) wrote…

DEBU[2021-05-11T13:25:40+05:30] could not push to peer 6000000000000000000000000000000000000000000000000000000000000000: new stream for peer 6000000000000000000000000000000000000000000000000000000000000000: peer not reachable

This does work in the test. Not sure what you mean by 'all' cases?

we had some formatting problems with using %s and only swarm.Address in the argument, I think this happens with the logger for some reason

Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @metacertain and @zelig)

@acud acud removed the request for review from metacertain May 12, 2021 05:25
@aloknerurkar
Copy link
Contributor Author

Just for documenting:

We have decided to classify the NewStream error as an attempt for now. Idea is that it should not happen often and if it does, we can know using the new metrics. The metrics once run on production env, should tell us how the cache is being utilized. Once we have a better understanding, we can tweak these things in the future. Will check in as-is for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pull-request ready for review The PR is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants