Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-24.3: colexec: harden eager cancellation in parallel unordered sync #134609

Merged
merged 1 commit into from
Nov 8, 2024

Commits on Oct 30, 2024

  1. colexec: harden eager cancellation in parallel unordered sync

    This commit hardens the eager cancellation mechanism in the parallel
    unordered synchronizer. It was recently fixed in dda8b3a,
    but the newly added test exposed a bug where the eager cancellation in
    a child PUS could poison the query execution of another input of the
    parent PUS, incorrectly failing the query altogether. More detailed
    description can be found [here](#127942 (comment)),
    but in short, due sharing of the same leaf txn between most operators in
    a flow, eager cancellation of one operator could lead to poisoning the
    execution of another operator which could only happen with a hierarchy
    of PUSes. This commit fixes such situation by swallowing all context
    cancellation errors in draining state of a PUS _even if_ that
    particular PUS didn't eagerly cancel its inputs.
    
    The rationale for why this behavior is safe is the following:
    - if the query should result in an error, then some other error must
    have been propagated to the client, and this is what caused the sync to
    transition into the draining state in the first place. (We do replace
    errors for the client in one case - set `DistSQLReceiver.SetError` where
    some errors from KV have higher priority then others, but it isn't
    applicable here.)
    - if the query should not result in an error and should succeed, yet we
    have some pending context cancellation errors, then it must be the case
    that query execution was short-circuited (e.g. because of the LIMIT), so
    we can pretend the part of the execution that hit the pending error
    didn't actually run since clearly it wasn't necessary to compute the
    query result.
    
    Note that we couldn't swallow all types of errors in the draining state
    (e.g. ReadWithinUncertaintyIntervalError that comes from the KV layer
    results in "poisoning" the txn, so we need to propagate it to the
    client), so we only have a single error type that we swallow.
    
    Also note that having another PUS is needed for this problem to occur
    because we must have concurrency between the child PUS that performs the
    eager cancellation and another operator that gets poisoned, and while we
    have two sources of concurrency within a single flow, only PUS is
    applicable (the other being outboxes but we only have eager cancellation
    for local plans).
    
    Additionally, while working on this change I realized another reason for
    why we don't want to lift the restriction for having eager cancellation
    only on "leaf" PUSes, so I extended the comment. This commit also adds
    a few more logic tests.
    
    Release note: None
    yuzefovich committed Oct 30, 2024
    Configuration menu
    Copy the full SHA
    0818fe3 View commit details
    Browse the repository at this point in the history