Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk/kv: uploading to cloud storage in ExportRequest evaluation considered harmful #66486

Closed
nvanbenschoten opened this issue Jun 15, 2021 · 6 comments
Assignees
Labels
A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-disaster-recovery T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

In #66338, we found that rate-limiting ExportRequests during evaluation could lead to transitive stalls in workload traffic, if interleaved poorly with a Split. That PR moved rate-limiting of ExportRequest above latching to avoid holding latches when not strictly necessary. We've also discussed ideas around dropping latches earlier for reads, before evaluation, in #66485. That issue seems challenging and large in scope, but promising.

However, for now, we need to continue to be careful about the duration that read-only requests hold read latches.

During a code audit earlier today, we found that ExportRequest can be configured to upload files to cloud storage directly during evaluation, while holding latches:

if exportStore != nil {
exported.Path = GenerateUniqueSSTName(base.SQLInstanceID(cArgs.EvalCtx.NodeID()))
if err := retry.WithMaxAttempts(ctx, base.DefaultRetryOptions(), maxUploadRetries, func() error {
// We blindly retry any error here because we expect the caller to have
// verified the target is writable before sending ExportRequests for it.
if err := cloud.WriteFile(ctx, exportStore, exported.Path, bytes.NewReader(data)); err != nil {
log.VEventf(ctx, 1, "failed to put file: %+v", err)
return err
}
return nil

This seems potentially disastrous, as it means that we will be performing network operations during evaluation. In fact, we'll even retry this upload up to 5 times (maxUploadRetries). So it's hard to place any limit on the duration that a given Export request may run for. As a result, it's hard to place any limit on the duration that a given Export request may transitively block foreground reads and writes.

I'd like to learn whether we need this capability and push to get rid of it. Even once we address #66485, it still seems like an abuse to touch the network during request evaluation, which is meant to operate in a sandboxed scope of a replica. That is simply not what the framework is meant for.

Interestingly, we do have a separate code path that avoids this. We have a way to specify that an ExportRequest should return an SST (using ReturnSST) instead of immediately uploading it. We then can perform the upload from the DistSQL backupProcessor:

// writeFile writes the data specified in the export response file to the backup
// destination. The ExportRequest will do this if its ReturnSST argument is set
// to false. In that case, we want to write the file from the processor.

This seems like a much more appropriate way to evaluate a backup. It also seems like it doesn't trade much in terms of performance when the backupProcessor is scheduled on the same node as the range's leaseholder. Either way, we're still pulling chunks into memory and then uploading them. The only difference is that we'll pull the chunk up a few levels higher in the stack.

Am I understanding all of this correctly? If so, what can we do here?

/cc. @dt @aayushshah15 @andreimatei

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. labels Jun 15, 2021
@nvanbenschoten
Copy link
Member Author

It's worth pointing out that this would compose really badly with evaluation-time rate-limiting of ExportRequests. It should be less severe with #66338, as a few slow Exports will no longer cascade with the possibility of blocking many ranges.

@dt
Copy link
Member

dt commented Jun 15, 2021

The ReturnSST flag is currently only set by the non-system tenant's backupDataProcessor, where we want the per-tenant SQL pod process to be the one that does or doesn't get to make outbound network requests, not the shared KV layer, but we did some handwringing when made that change over the cost of the extra hop on network, memory, etc. The overhead of backups is already a sensitive issue for some customers so making it higher was something we were pretty wary of doing. Tenants may not have existing expectations, but any regression for existing non-tenant users might be poorly received.

If anything, I think we were planning to start making ExportRequest run longer in the near future. Currently it iterates into a buffer until some size limit, then stops and opens a remote file and writes the content of the buffer. This has two drawbacks: one is that it forces the buffer, and thus resulting file, to be sized to fit in memory, which is not great for us in the context of #44480. Another less common but more severe issue is ranges with write traffic/ttl/etc that mean they end up with large quantities of revisions to of a single key. This is a problem because files have key boundaries, so all the revisions of any one key must be exported in one file. Ever since the max range size was upped, ranges are now able to accumulate far more revisions of a key than Export will write based on its max file size, meaning clusters can get themselves into a state that can't they can't back themselves up.

To fix both these, we've been working on instead chasing it so we write directly while iterating: we've changed the API for our external IO to return an io.Writer and changed storage.ExportToSST to take an io.Writer and are planning to change ExportRequest to open the remote file first, then pass it to the iteration to eliminate the buffer, and along with it the size limit. This however would mean we'll be exporting potentially even longer.

It seems like #66485 should help a lot here, shouldn't it?

@miretskiy
Copy link
Contributor

We should be able to evaluate the impact of always using ReturnSST though? Perhaps few runs of 2-4TB backup with/without this flag?

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@dt
Copy link
Member

dt commented Jun 16, 2021

In the meantime we can/should add a setting to opt into proxying writes to the sql proc, to make it easier to measure the overhead and/or give users who want it -- and whatever that overhead is -- that choice. Opened #66540.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 30, 2021
Related to cockroachdb#66486.

Command evaluation is meant to operate in a sandbox. It certainly
shouldn't have access to a DB handle.
@lunevalex lunevalex added O-postmortem Originated from a Postmortem action item. N-followup Needs followup. labels Jul 1, 2021
craig bot pushed a commit that referenced this issue Jul 1, 2021
67094: kv: remove EvalContext.DB r=nvanbenschoten a=nvanbenschoten

Related to #66486.

Command evaluation is meant to operate in a sandbox. It certainly shouldn't have access to a `DB` handle.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten
Copy link
Member Author

There was discussion that moved to https://cockroachlabs.slack.com/archives/C2C5FKPPB/p1623782623130900.

@aliher1911
Copy link
Contributor

Writes were moved from under the request evaluation to processor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-disaster-recovery T-kv KV Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants