-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: sql: fix internal executor when it encounters a retry error #101857
release-23.1: sql: fix internal executor when it encounters a retry error #101857
Conversation
This commit cleans up `streamingCommandResult` a bit by storing `CmdPos` directly in it (which allows us to remove redundant `resWithPos` wrapper) and removing a redundant argument in `closeCallback`. This is effectively a purely mechanical change. Also fix up some comments. Release note: None
This commit fixes a bug with the internal executor when it encounters an internal retry. Previously, the implementation of the "rewind capability" was such that we always assumed that we can rewind the StmtBuf for the IE at any point; however, that is not generally true. In particular, if we communicated anything to the client of the IE's connExecutor (`rowsIterator` in this context), then we cannot rewind the current command we're evaluating. In theory, we could have pushed some rows through the iterator only to encounter the internal retry later which would then lead to re-executing the command from the start (in other words, the rows that we've pushed before the retry would be double-pushed); in practice, however, we haven't actually seen this (at least yet). What we have seen is the number of rows affected being double counted. This particular reproduction was already fixed in the previous commit, but this commit fixes the problem more generally. This commit makes it so that for every `streamingCommandResult` we are tracking whether it has already communicated something to the client so that the result can no longer be rewound, and then we use that tracking mechanism to correctly implement the rewind capability. We have three possible types of data that we communicate: - rows - number of rows affected - column schema. If the retry happens after some rows have been communicated, we're out of luck - there is no way we can retry the stmt internally, so from now on we will return a retry error to the client. If the retry happens after "rows affected", then given the adjustment in the previous commit we can proceed transparently. In order to avoid propagating the retry error up when it occurs after having received the column schema but before pushing out any rows, this commit adjusts the behavior to always keep the latest column schema, thus, we can still proceed transparently in this case. This bug has been present since at least 21.1 when `streamingCommandResult` was introduced. However, since we now might return a retry error in some cases, this could lead to test failures or flakes, or even to errors in some internal CRDB operations that execute statements of ROWS type (if there is no appropriate retry logic), so I intend to only backport this to 23.1. There is also no release note since the only failure we've seen is about double counted "rows affected" number, the likelihood of which has significantly increased due to the jobs system refactor (i.e. mostly 23.1 is affected AFAIK). Additionally, this commit makes it so that we correctly block the `execInternal` call until the first actual, non-metadata result is seen (this behavior is needed to correctly synchronize access to the txn before the stmt is given to the execution engine). Release note: None
This commit teaches the internal executor to avoid propagating the retry errors to the client when executing `Exec{Ex}` methods. In those methods, we only return the number of rows affected, thus, in order to preserve the ability to transparently retry all statements (even if they are of ROWS statement type), we simply need to be able to reset the rows affected number on the `rowsIterator` if a result is being discarded (which indicates that a retry error has been encountered and the rewind mechanism is at play). This structure relies on the assumption that we push at most one command into the StmtBuf that results in "rows affected" which is true at the moment of writing, and I don't foresee that changing, at least for now. (This assumption would be incorrect only if we were to allow executing multiple statements as part of a single method invocation on the internal executor, but I don't think that's going to happen - we only might do so in a few tests.) Release note: None
Release note: None
This commit changes the `RestrictedCommandResult` interface a bit to make `SetRowsAffected` just set the number of rows affected, rather than increment it. This method is supposed to be called only once, so this change seems reasonable on its own, but it also has an additional benefit for the follow-up commit. In particular, in the follow-up commit we will preserve the ability to automatically retry statements of non-ROWS stmt type by the internal executor. Note that this commit on its own fixes the behavior for the "rows affected" statements (the only known occurrence of the bug that is generally fixed in the following commit). Release note: None
68c2546
to
e5f1477
Compare
180ec2d
to
2878683
Compare
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
I'll let this bake on master till beginning of next week. Interestingly, the CI failure on this PR is concerning (it's |
The CI failure turned to be unrelated to this PR - discussed on slack https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1681935584627449. |
Backport 5/5 commits from #101477 on behalf of @yuzefovich.
/cc @cockroachdb/release
This PR contains several commits that fix a long-standing bug in the internal executor that could make it double count some things (rows or metadata) if an internal retry error occurs. In particular, since at least 21.1 (when we introduced the "streaming" internal executor in #59330) if the IE encounters a retry error after it has communicated some results to the client, it would proceed to retry the corresponding command as if the incomplete execution and the retry error never happened. In other words, it was possible for some rows to be double "counted" (either to be directly included multiple times into the result set or indirectly into the "rows affected" number). This PR fixes the problem by returning the retry error to the client in cases when some actual rows have already been communicated and by resetting the number of "rows affected" when "rewinding the stmt buffer" in order to retry the error transparently.
Fixes: #98558.
Release justification: bug fix.