Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: release quota on failed Raft proposals #38632

Merged

Commits on Jul 3, 2019

  1. Configuration menu
    Copy the full SHA
    67b57a0 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    03a984c View commit details
    Browse the repository at this point in the history
  3. storage: pull out propBufCntRes.arrayLen method

    This avoids confusing arithmetic in a few spots.
    
    Release note: None
    nvanbenschoten committed Jul 3, 2019
    Configuration menu
    Copy the full SHA
    f991d76 View commit details
    Browse the repository at this point in the history
  4. storage: be extra paranoid about dropped proposals when flushing propBuf

    This commit makes propBuf.FlushLockedWithRaftGroup more careful about accidentally
    dropping proposals when flushing its array. In most cases, any error will result
    in a log.Fatal call further up the stack anyway, but it's good to be careful about
    this stuff or we risk leaking proposals. We now make sure to register all proposals
    with the proposer after an error, even if though we stop attempting to propose entries
    with the Raft group.
    
    Release note: None
    nvanbenschoten committed Jul 3, 2019
    Configuration menu
    Copy the full SHA
    ef02100 View commit details
    Browse the repository at this point in the history
  5. storage: release quota on failed Raft proposals

    Fixes cockroachdb#34180.
    Fixes cockroachdb#35493.
    Fixes cockroachdb#36983.
    Fixes cockroachdb#37108.
    Fixes cockroachdb#37371.
    Fixes cockroachdb#37384.
    Fixes cockroachdb#37551.
    Fixes cockroachdb#37879.
    Fixes cockroachdb#38095.
    Fixes cockroachdb#38131.
    Fixes cockroachdb#38136.
    Fixes cockroachdb#38549.
    Fixes cockroachdb#38552.
    Fixes cockroachdb#38555.
    Fixes cockroachdb#38560.
    Fixes cockroachdb#38562.
    Fixes cockroachdb#38563.
    Fixes cockroachdb#38569.
    Fixes cockroachdb#38578.
    Fixes cockroachdb#38600.
    
    _A lot of the early issues fixed by this had previous failures, but nothing
    very recent or actionable. I think it's worth closing them now that they
    should be fixed in the short term._
    
    This fixes a bug introduced in 1ff3556 where Raft proposal quota is
    no longer released when Replica.propose fails. This used to happen
    [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
    but that code was accidentally lost in the rewrite.
    
    I tracked this down by running a series of `import/tpch/nodes=4` and
    `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
    import would stall after a few hours and the roachtest health reports
    would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
    I tracked the stalled latch acquisition to a stalled proposal quota acquisition
    by a conflicting command. The range debug page showed the following:
    
    <image>
    
    We see that the leaseholder of the Range has no pending commands
    but also no available proposal quota. This indicates a proposal
    quota leak, which led to me finding the lost release in this
    error case.
    
    The (now confirmed) theory for what went wrong in these roachtests is that
    they are performing imports, which generate a large number of AddSSTRequests.
    These requests are typically larger than the available proposal quota
    for a range, meaning that they request all of its available quota. The
    effect of this is that if even a single byte of quota is leaked, the entire
    range will seize up and stall when an AddSSTRequests is issued.
    Instrumentation revealed that a ChangeReplicas request with a quota size
    equal to the leaked amount was failing due to the error:
    ```
    received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
    ```
    Because of the missing error handling, this quota was not being released back
    into the pool, causing future requests to get stuck indefinitely waiting for
    leaked quota, stalling the entire import.
    
    Release note: None
    nvanbenschoten committed Jul 3, 2019
    Configuration menu
    Copy the full SHA
    ba3813c View commit details
    Browse the repository at this point in the history