Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
sql: fix internal executor when it encounters a retry error
This commit fixes a bug with the internal executor when it encounters an internal retry. Previously, the implementation of the "rewind capability" was such that we always assumed that we can rewind the StmtBuf for the IE at any point; however, that is not generally true. In particular, if we communicated anything to the client of the IE's connExecutor (`rowsIterator` in this context), then we cannot rewind the current command we're evaluating. In theory, we could have pushed some rows through the iterator only to encounter the internal retry later which would then lead to re-executing the command from the start (in other words, the rows that we've pushed before the retry would be double-pushed); in practice, however, we haven't actually seen this (at least yet). What we have seen is the number of rows affected being double counted. This particular reproduction was already fixed in the previous commit, but this commit fixes the problem more generally. This commit makes it so that for every `streamingCommandResult` we are tracking whether it has already communicated something to the client so that the result can no longer be rewound, and then we use that tracking mechanism to correctly implement the rewind capability. We have three possible types of data that we communicate: - rows - number of rows affected - column schema. If the retry happens after some rows have been communicated, we're out of luck - there is no way we can retry the stmt internally, so from now on we will return a retry error to the client. If the retry happens after "rows affected", then given the adjustment in the previous commit we can proceed transparently. In order to avoid propagating the retry error up when it occurs after having received the column schema but before pushing out any rows, this commit adjusts the behavior to always keep the latest column schema, thus, we can still proceed transparently in this case. This bug has been present since at least 21.1 when `streamingCommandResult` was introduced. However, since we now might return a retry error in some cases, this could lead to test failures or flakes, or even to errors in some internal CRDB operations that execute statements of ROWS type (if there is no appropriate retry logic), so I intend to only backport this to 23.1. There is also no release note since the only failure we've seen is about double counted "rows affected" number, the likelihood of which has significantly increased due to the jobs system refactor (i.e. mostly 23.1 is affected AFAIK). Additionally, this commit makes it so that we correctly block the `execInternal` call until the first actual, non-metadata result is seen (this behavior is needed to correctly synchronize access to the txn before the stmt is given to the execution engine). Release note: None
- Loading branch information