release-23.1: sql: fix internal executor when it encounters a retry error #101857

blathers-crl · 2023-04-19T18:04:46Z

Backport 5/5 commits from #101477 on behalf of @yuzefovich.

/cc @cockroachdb/release

This PR contains several commits that fix a long-standing bug in the internal executor that could make it double count some things (rows or metadata) if an internal retry error occurs. In particular, since at least 21.1 (when we introduced the "streaming" internal executor in #59330) if the IE encounters a retry error after it has communicated some results to the client, it would proceed to retry the corresponding command as if the incomplete execution and the retry error never happened. In other words, it was possible for some rows to be double "counted" (either to be directly included multiple times into the result set or indirectly into the "rows affected" number). This PR fixes the problem by returning the retry error to the client in cases when some actual rows have already been communicated and by resetting the number of "rows affected" when "rewinding the stmt buffer" in order to retry the error transparently.

Fixes: #98558.

Release justification: bug fix.

This commit cleans up `streamingCommandResult` a bit by storing `CmdPos` directly in it (which allows us to remove redundant `resWithPos` wrapper) and removing a redundant argument in `closeCallback`. This is effectively a purely mechanical change. Also fix up some comments. Release note: None

This commit fixes a bug with the internal executor when it encounters an internal retry. Previously, the implementation of the "rewind capability" was such that we always assumed that we can rewind the StmtBuf for the IE at any point; however, that is not generally true. In particular, if we communicated anything to the client of the IE's connExecutor (`rowsIterator` in this context), then we cannot rewind the current command we're evaluating. In theory, we could have pushed some rows through the iterator only to encounter the internal retry later which would then lead to re-executing the command from the start (in other words, the rows that we've pushed before the retry would be double-pushed); in practice, however, we haven't actually seen this (at least yet). What we have seen is the number of rows affected being double counted. This particular reproduction was already fixed in the previous commit, but this commit fixes the problem more generally. This commit makes it so that for every `streamingCommandResult` we are tracking whether it has already communicated something to the client so that the result can no longer be rewound, and then we use that tracking mechanism to correctly implement the rewind capability. We have three possible types of data that we communicate: - rows - number of rows affected - column schema. If the retry happens after some rows have been communicated, we're out of luck - there is no way we can retry the stmt internally, so from now on we will return a retry error to the client. If the retry happens after "rows affected", then given the adjustment in the previous commit we can proceed transparently. In order to avoid propagating the retry error up when it occurs after having received the column schema but before pushing out any rows, this commit adjusts the behavior to always keep the latest column schema, thus, we can still proceed transparently in this case. This bug has been present since at least 21.1 when `streamingCommandResult` was introduced. However, since we now might return a retry error in some cases, this could lead to test failures or flakes, or even to errors in some internal CRDB operations that execute statements of ROWS type (if there is no appropriate retry logic), so I intend to only backport this to 23.1. There is also no release note since the only failure we've seen is about double counted "rows affected" number, the likelihood of which has significantly increased due to the jobs system refactor (i.e. mostly 23.1 is affected AFAIK). Additionally, this commit makes it so that we correctly block the `execInternal` call until the first actual, non-metadata result is seen (this behavior is needed to correctly synchronize access to the txn before the stmt is given to the execution engine). Release note: None

This commit teaches the internal executor to avoid propagating the retry errors to the client when executing `Exec{Ex}` methods. In those methods, we only return the number of rows affected, thus, in order to preserve the ability to transparently retry all statements (even if they are of ROWS statement type), we simply need to be able to reset the rows affected number on the `rowsIterator` if a result is being discarded (which indicates that a retry error has been encountered and the rewind mechanism is at play). This structure relies on the assumption that we push at most one command into the StmtBuf that results in "rows affected" which is true at the moment of writing, and I don't foresee that changing, at least for now. (This assumption would be incorrect only if we were to allow executing multiple statements as part of a single method invocation on the internal executor, but I don't think that's going to happen - we only might do so in a few tests.) Release note: None

Release note: None

This commit changes the `RestrictedCommandResult` interface a bit to make `SetRowsAffected` just set the number of rows affected, rather than increment it. This method is supposed to be called only once, so this change seems reasonable on its own, but it also has an additional benefit for the follow-up commit. In particular, in the follow-up commit we will preserve the ability to automatically retry statements of non-ROWS stmt type by the internal executor. Note that this commit on its own fixes the behavior for the "rows affected" statements (the only known occurrence of the bug that is generally fixed in the following commit). Release note: None

blathers-crl · 2023-04-19T18:04:50Z

cockroach-teamcity · 2023-04-19T18:05:10Z

This change is

yuzefovich · 2023-04-19T20:04:35Z

I'll let this bake on master till beginning of next week.

Interestingly, the CI failure on this PR is concerning (it's INFO: slow quiesce. on the newly-added test), looking into that.

yuzefovich · 2023-04-19T20:30:45Z

The CI failure turned to be unrelated to this PR - discussed on slack https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1681935584627449.

yuzefovich added 5 commits April 13, 2023 15:28

sql: document how the machinery of the internal executor fits together

dbdfd6c

Release note: None

blathers-crl bot requested a review from a team as a code owner April 19, 2023 18:04

blathers-crl bot requested a review from a team April 19, 2023 18:04

blathers-crl bot requested review from a team as code owners April 19, 2023 18:04

blathers-crl bot force-pushed the blathers/backport-release-23.1-101477 branch from 68c2546 to e5f1477 Compare April 19, 2023 18:04

blathers-crl bot requested review from miretskiy and removed request for a team April 19, 2023 18:04

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 19, 2023

blathers-crl bot force-pushed the blathers/backport-release-23.1-101477 branch from 180ec2d to 2878683 Compare April 19, 2023 18:04

blathers-crl bot assigned yuzefovich Apr 19, 2023

blathers-crl bot requested review from rytaft, cucaroach, knz, rafiss, yuzefovich and ZhouXing19 April 19, 2023 18:04

yuzefovich removed request for miretskiy and rytaft April 19, 2023 18:10

knz approved these changes Apr 19, 2023

View reviewed changes

yuzefovich merged commit 33677fa into release-23.1 Apr 23, 2023

yuzefovich deleted the blathers/backport-release-23.1-101477 branch April 23, 2023 23:49

yuzefovich mentioned this pull request Apr 23, 2023

jobs: TestJobControlByType failed #99200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.1: sql: fix internal executor when it encounters a retry error #101857

release-23.1: sql: fix internal executor when it encounters a retry error #101857

blathers-crl bot commented Apr 19, 2023 •

edited by yuzefovich

Loading

blathers-crl bot commented Apr 19, 2023 •

edited by yuzefovich

Loading

cockroach-teamcity commented Apr 19, 2023

yuzefovich commented Apr 19, 2023

yuzefovich commented Apr 19, 2023

release-23.1: sql: fix internal executor when it encounters a retry error #101857

release-23.1: sql: fix internal executor when it encounters a retry error #101857

Conversation

blathers-crl bot commented Apr 19, 2023 • edited by yuzefovich Loading

blathers-crl bot commented Apr 19, 2023 • edited by yuzefovich Loading

cockroach-teamcity commented Apr 19, 2023

yuzefovich commented Apr 19, 2023

yuzefovich commented Apr 19, 2023

blathers-crl bot commented Apr 19, 2023 •

edited by yuzefovich

Loading

blathers-crl bot commented Apr 19, 2023 •

edited by yuzefovich

Loading