Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: sql: fix internal executor when it encounters a retry error #101857

Merged
merged 5 commits into from
Apr 23, 2023

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Apr 19, 2023

Backport 5/5 commits from #101477 on behalf of @yuzefovich.

/cc @cockroachdb/release


This PR contains several commits that fix a long-standing bug in the internal executor that could make it double count some things (rows or metadata) if an internal retry error occurs. In particular, since at least 21.1 (when we introduced the "streaming" internal executor in #59330) if the IE encounters a retry error after it has communicated some results to the client, it would proceed to retry the corresponding command as if the incomplete execution and the retry error never happened. In other words, it was possible for some rows to be double "counted" (either to be directly included multiple times into the result set or indirectly into the "rows affected" number). This PR fixes the problem by returning the retry error to the client in cases when some actual rows have already been communicated and by resetting the number of "rows affected" when "rewinding the stmt buffer" in order to retry the error transparently.

Fixes: #98558.


Release justification: bug fix.

This commit cleans up `streamingCommandResult` a bit by storing `CmdPos`
directly in it (which allows us to remove redundant `resWithPos`
wrapper) and removing a redundant argument in `closeCallback`. This is
effectively a purely mechanical change.

Also fix up some comments.

Release note: None
This commit fixes a bug with the internal executor when it encounters an
internal retry. Previously, the implementation of the "rewind
capability" was such that we always assumed that we can rewind the
StmtBuf for the IE at any point; however, that is not generally true. In
particular, if we communicated anything to the client of the IE's
connExecutor (`rowsIterator` in this context), then we cannot rewind the
current command we're evaluating. In theory, we could have pushed some
rows through the iterator only to encounter the internal retry later which
would then lead to re-executing the command from the start (in other
words, the rows that we've pushed before the retry would be
double-pushed); in practice, however, we haven't actually seen this (at
least yet). What we have seen is the number of rows affected being
double counted. This particular reproduction was already fixed in the
previous commit, but this commit fixes the problem more generally.

This commit makes it so that for every `streamingCommandResult` we are
tracking whether it has already communicated something to the client so
that the result can no longer be rewound, and then we use that tracking
mechanism to correctly implement the rewind capability. We have three
possible types of data that we communicate:
- rows
- number of rows affected
- column schema.

If the retry happens after some rows have been communicated, we're out
of luck - there is no way we can retry the stmt internally, so from now
on we will return a retry error to the client.

If the retry happens after "rows affected", then given the adjustment in
the previous commit we can proceed transparently.

In order to avoid propagating the retry error up when it occurs after
having received the column schema but before pushing out any rows, this
commit adjusts the behavior to always keep the latest column schema,
thus, we can still proceed transparently in this case.

This bug has been present since at least 21.1 when
`streamingCommandResult` was introduced. However, since we now might
return a retry error in some cases, this could lead to test failures or
flakes, or even to errors in some internal CRDB operations that execute
statements of ROWS type (if there is no appropriate retry logic), so
I intend to only backport this to 23.1. There is also no release note
since the only failure we've seen is about double counted "rows
affected" number, the likelihood of which has significantly increased
due to the jobs system refactor (i.e. mostly 23.1 is affected AFAIK).

Additionally, this commit makes it so that we correctly block the
`execInternal` call until the first actual, non-metadata result is seen
(this behavior is needed to correctly synchronize access to the txn
before the stmt is given to the execution engine).

Release note: None
This commit teaches the internal executor to avoid propagating the retry
errors to the client when executing `Exec{Ex}` methods. In those
methods, we only return the number of rows affected, thus, in order to
preserve the ability to transparently retry all statements (even if they
are of ROWS statement type), we simply need to be able to reset the rows
affected number on the `rowsIterator` if a result is being discarded
(which indicates that a retry error has been encountered and the rewind
mechanism is at play).

This structure relies on the assumption that we push at most one command
into the StmtBuf that results in "rows affected" which is true at the
moment of writing, and I don't foresee that changing, at least for now.
(This assumption would be incorrect only if we were to allow executing
multiple statements as part of a single method invocation on the
internal executor, but I don't think that's going to happen - we only
might do so in a few tests.)

Release note: None
This commit changes the `RestrictedCommandResult` interface a bit to
make `SetRowsAffected` just set the number of rows affected, rather than
increment it. This method is supposed to be called only once, so this
change seems reasonable on its own, but it also has an additional
benefit for the follow-up commit. In particular, in the follow-up commit
we will preserve the ability to automatically retry statements of
non-ROWS stmt type by the internal executor.

Note that this commit on its own fixes the behavior for the "rows
affected" statements (the only known occurrence of the bug that is
generally fixed in the following commit).

Release note: None
@blathers-crl blathers-crl bot requested a review from a team as a code owner April 19, 2023 18:04
@blathers-crl blathers-crl bot requested a review from a team April 19, 2023 18:04
@blathers-crl blathers-crl bot requested review from a team as code owners April 19, 2023 18:04
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.1-101477 branch from 68c2546 to e5f1477 Compare April 19, 2023 18:04
@blathers-crl blathers-crl bot requested review from miretskiy and removed request for a team April 19, 2023 18:04
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 19, 2023
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.1-101477 branch from 180ec2d to 2878683 Compare April 19, 2023 18:04
@blathers-crl
Copy link
Author

blathers-crl bot commented Apr 19, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member

I'll let this bake on master till beginning of next week.

Interestingly, the CI failure on this PR is concerning (it's INFO: slow quiesce. on the newly-added test), looking into that.

@yuzefovich
Copy link
Member

The CI failure turned to be unrelated to this PR - discussed on slack https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1681935584627449.

@yuzefovich yuzefovich merged commit 33677fa into release-23.1 Apr 23, 2023
@yuzefovich yuzefovich deleted the blathers/backport-release-23.1-101477 branch April 23, 2023 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants