Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery/storage/managedwriter: custom gax.Retryer #5094

Open
GlenDC opened this issue Nov 5, 2021 · 7 comments
Open

bigquery/storage/managedwriter: custom gax.Retryer #5094

GlenDC opened this issue Nov 5, 2021 · 7 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@GlenDC
Copy link

GlenDC commented Nov 5, 2021

Is your feature request related to a problem? Please describe.

The BQ ManagedWriter for the Storage API uses a gax.Retryer under the hood.
It does so for 2 purposes:

  • when (re)-opening an underlying stream connection
  • when appending data

In both cases it is not possible for me to define what Retryer to use. This means that I cannot define
my own conditions to retry. One severe consequence is that currently it is not possible to put a maximum deadline for the
back-off algorithm, and so in theory this can block forever.

Describe the solution you'd like

Be able to define my own retryer. For the AppendRows command this could be as obvious as allowing us to define gax.CallOptionss ourselves. This solution wouldn't allow us to define what to do for the re(connect) use-case though. So either the latter is handled separately with an initial config, or the initial config is to be used for both use cases.

Describe alternatives you've considered

  • do not use the managed writer: this is possible by using the apiv1 package directly,
    but feels a lot like reinventing the wheel to me;
  • accept that I cannot define the Retryer myself: this is possible, but makes it that in worst case
    it could be retrying forever;
  • fork the managed writer code: this is what I've done now and allows me to use a custom retryer;

Extra

Here is the code of the retryer I developed myself and use in my forked ManagedWriter.
Note that I am not an expert in either GRPC or the google cloud Go API. I wrote this Retryer
based on some short experience with it as well as the documentation sources.

// Retryer is a retryer inspired by other community back-off implementations,
// in order to not have another dependency added to this library,
// while still being able to rely on existing retry-related google code of
// dependencies already required by this library for its core functionality
type Retryer struct {
	backoff                gax.Backoff
	retries                int
	maxRetries             int
	startTime              time.Time
	maxRetryDeadlineOffset time.Duration
	deadlineCtx            context.Context
	cancelDeadlineCtx      func()
}

// compile-time interface compliance
var _ gax.Retryer = (*Retryer)(nil)

func NewRetryer(ctx context.Context, maxRetries int, initialRetryDelay time.Duration, maxRetryDeadlineOffset time.Duration, retryDelayMultiplier float64) *Retryer {
	startTime := time.Now()
	deadlineCtx, cancelDeadlineCtx := context.WithDeadline(ctx, startTime.Add(maxRetryDeadlineOffset))
	return &Retryer{
		backoff: gax.Backoff{
			Initial:    initialRetryDelay,
			Max:        maxRetryDeadlineOffset,
			Multiplier: retryDelayMultiplier,
		},
		maxRetries:             maxRetries,
		startTime:              startTime,
		maxRetryDeadlineOffset: maxRetryDeadlineOffset,
		deadlineCtx:            deadlineCtx,
		cancelDeadlineCtx:      cancelDeadlineCtx,
	}
}

// RetryOp retries the operation
func (r *Retryer) RetryOp(op func(context.Context) error) error {
	defer r.cancelDeadlineCtx()
	for {
		err := op(r.deadlineCtx)
		if err == nil {
			return nil
		}
		pause, ok := r.Retry(err)
		if !ok {
			// retryer wishes not to retry, return early
			return err
		}
		// we'll retry, but first will sleep
		time.Sleep(pause)
	}
}

// Retry implements gax::Retryer::Retry
func (r *Retryer) Retry(err error) (pause time.Duration, shouldRetry bool) {
	defer func() {
		if !shouldRetry {
			r.cancelDeadlineCtx()
		}
	}()
	if err == nil {
		// no error returned, no need to retry
		return 0, false
	}
	if errors.Is(r.deadlineCtx.Err(), context.Canceled) {
		// if parent ctx is done or the deadline has been reached,
		// no retry is possible any longer either
		return 0, false
	}
	if r.retries >= r.maxRetries {
		// no longer need to retry,
		// already exhausted our retry attempts
		return 0, false
	}
	// do not retry in case we have a non-retryable GRPC error
	st, ok := status.FromError(err)
	if !ok {
		// do not retry, as this either means we have an unexpected kind of error,
		// or it means we had no error at all
		return false
	}
	switch st.Code() {
	case codes.Unavailable, codes.FailedPrecondition, codes.ResourceExhausted, codes.DataLoss:
		// do retry
	default:
		return false
	}
	// correct the Max time, as to stay as close as possible to our max elapsed retry time
	elapsedTime := time.Since(r.startTime)
	r.backoff.Max = r.maxRetryDeadlineOffset - elapsedTime
	// retry with the pause time indicated by the gax BackOff algorithm
	r.retries += 1
	return r.backoff.Pause(), true
}
@GlenDC GlenDC added the triage me I really want to be triaged. label Nov 5, 2021
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Nov 5, 2021
@GlenDC
Copy link
Author

GlenDC commented Nov 5, 2021

Besides this feature request I was also wondering how realistic is for this managed writer to become a production-ready part of the storage API, as currently it is still marked as experimental.

@shollyman
Copy link
Contributor

Thanks for filing this issue. Retries is something that's still not fully complete in the veneer, so I'll keep this in mind when extending it.

The primary reason that the managedwriter package is still marked experimental is that I'm still not satisfied that the call signatures are final; once that's done we can promote the package. For example, #5078 added call option support and was a breaking change from the perspective of existing signatures last week.

@shollyman shollyman added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Nov 8, 2021
@GlenDC
Copy link
Author

GlenDC commented Nov 8, 2021

Ok super, thanks! If you need help or contributions please do ask :)

@GlenDC
Copy link
Author

GlenDC commented Nov 9, 2021

@shollyman I believe this issue does deserve a high priority, but I guess (P1) is what that means?

I was under the impression that you couldn't quickly run in a situation where you could retry forever, but apparently an error such as:

ready append resulted in error: rpc error: code = InvalidArgument desc = The proto field mismatched with BigQuery field at benchmark_TemporaryDataProto2.timestamp, the proto field type message, BigQuery field type TIMESTAMP Entity: projects/oi-bigquery/datasets/benchmarks_bqwriter/tables/tmp/_default

which you can very easily get if you mismatch your proto model with the used table's BQ model,
can get you in an infinite retry loop as well.

Having the default gax.Retryer already be able to work with an absolute deadline would help resolve that, but next to that being able to use a custom gax.Retryer would be even more awesome IMHO.

@GlenDC
Copy link
Author

GlenDC commented Nov 12, 2021

@shollyman, shall I make a PR for this as my first code contribution to the BigQuery Go API? Or is this something you want to discuss further first?

GlenDC pushed a commit to GlenDC/google-cloud-go that referenced this issue Nov 12, 2021
and also use the ManagedStreamer's call options
for the append rows call

resolves googleapis#5094
GlenDC pushed a commit to GlenDC/google-cloud-go that referenced this issue Nov 12, 2021
and also use the ManagedStreamer's call options
for the append rows call

Refs: googleapis#5094
GlenDC pushed a commit to GlenDC/google-cloud-go that referenced this issue Nov 12, 2021
…gedwriter

and also use the ManagedStreamer's call options
for the append rows call

Refs: googleapis#5094
@shollyman
Copy link
Contributor

Starting to get back to this, thanks for putting together than initial PR. Some thoughts:

  • We don't currently have a user-accessible CallOptions mechanism so that users can specify their own Retryer to use for stream re-opening. client.NewManagedStream() already has a variadic argument, so it probably makes sense to augment via a new WriterOption to specify call options (in addition to the ones we inject for routing). The underlying grpc AppendRows() client call which opens the stream already has it's own retryer mechanism, so the additional retryer in openWithRetry may be unnecessary.

  • Another aspect to consider is exposing the underlying CallOptions on the Client itself, so the client can be instantiated with customizations rather than having to supply them as part of the individual method invocations. These should probably be wrapped rather than just exposing the raw client's CallOptions directly, and then merging them into the raw client as needed.

  • Retries for individual data appends are likely to be somewhat stateful in more advanced scenarios, and thus the gax Retryer interface is insufficient as it only has access to the error. We don't yet support retries here in a meaningful way yet, but as the receive processor evolves we'll get a bit more clarity on what's needed.

@GlenDC
Copy link
Author

GlenDC commented Dec 7, 2021

Given point (3) @shollyman , I suspect that for now this PR can remain stale, as we might not want to extend/modify the API regarding this until it is clear on what is needed for this?

Can you share some examples on where a more stateful approach is needed for so called advanced scenarios?
And can you for these examples also show why the gax.Retryer would be insufficient for it?

@shollyman shollyman added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
2 participants