-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: integrate Concurrency Manager into Replica request path #45482
storage: integrate Concurrency Manager into Replica request path #45482
Conversation
414a4c2
to
53393f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- only some minor comments.
Reviewed 12 of 12 files at r1, 12 of 12 files at r2, 3 of 3 files at r3, 40 of 40 files at r4, 30 of 30 files at r5, 1 of 1 files at r6, 3 of 3 files at r7, 1 of 1 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/storage/replica.go, line 1153 at r5 (raw file):
// Release the latches acquired by the request back to the spanlatch // manager. Must be done AFTER the timestamp cache is updated. ec.g is only // when the Raft proposal has assumed responsibility for releasing latches.
word missing after "only"?
pkg/storage/replica_read.go, line 112 at r5 (raw file):
if lResult.WrittenIntents != nil { // These will all be unreplicated locks.
are reads that acquire unreplicated locks only in a read-only batch?
And how does this relate to the TODO earlier in this file about releasing latches early. If the read wants to acquire unreplicated locks it can't release latches, yes? So that TODO only applies to non-locking reads?
pkg/storage/replica_send.go, line 134 at r5 (raw file):
// 2. is continuing to execute asynchronously and needs to maintain isolation // from conflicting requests throughout the lifetime of its asynchronous // processing.
what is an example of this asynchronous processing?
pkg/storage/batcheval/result/result.go, line 35 at r2 (raw file):
// to asynchronous intent processing on the proposer, so that an attempt to // resolve them is made. EncounteredIntents []roachpb.Intent
I got mired in plumbing code and couldn't find where these are actually used -- could you point me to it?
And why are we trying to do anything with intents that it did not conflict with -- isn't it wasteful given that those transactions may be ongoing and there is no conflict?
pkg/storage/batcheval/result/result.go, line 39 at r2 (raw file):
WrittenIntents []roachpb.LockUpdate // ResolvedIntents stores any resolved intent spans, either with finalized // or pending statuses. Unlike WrittenIntents and EncounteredIntents, values
what exactly does "finalized or pending statuses" mean in this case -- is it referring to the TransactionStatus
?
pkg/storage/concurrency/concurrency_control.go, line 318 at r4 (raw file):
// The maximal set of spans that the request will access. Latches // will be acquired for these spans. // TODO(nvanbenschoten): don't allocate these SpanSet objects.
how will this "don't allocate" be accomplished?
pkg/storage/concurrency/concurrency_manager.go, line 344 at r7 (raw file):
// state transition, it is safer to simply clear the lockTable and rebuild // it from persistent intent state. m.lt.Clear()
The concurrency manager exists only on the leaseholder for the range, yes?
Is this the case where the raft leader is at a different node?
pkg/storage/concurrency/lock_table_waiter.go, line 161 at r1 (raw file):
// and quickly resolve abandoned intents then we might be able to // get rid of this state. delay := LockTableLivenessPushDelay.Get(&w.st.SV)
Doesn't the comment about waiting "for each" contradict this change to add a delay?
pkg/storage/concurrency/lock_table_waiter.go, line 245 at r3 (raw file):
obsTS, ok := h.Txn.GetObservedTimestamp(w.nodeID) if !ok { // This was set earlier, so it's completely unexpected to
why is !ok
not an error anymore?
pkg/storage/concurrency/lock_table_waiter_test.go, line 67 at r1 (raw file):
ir := &mockIntentResolver{} st := cluster.MakeTestingClusterSettings() LockTableLivenessPushDelay.Override(&st.SV, 1*time.Millisecond)
is there a reason this is overridden to 1ms and not 0ms in tests?
pkg/storage/intentresolver/intent_resolver.go, line 277 at r5 (raw file):
// MaybePushTransactions tries to push the conflicting transaction(s): // either moving their timestamp forward on a read/write conflict, abortint
s/abortint/aborting/
53393f1
to
f6761da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/storage/replica.go, line 1153 at r5 (raw file):
Previously, sumeerbhola wrote…
word missing after "only"?
Done.
pkg/storage/replica_read.go, line 112 at r5 (raw file):
are reads that acquire unreplicated locks only in a read-only batch?
Yes, although they will acquire write latches to synchronize properly with other SFU reads. This is a little strange, given that we're in executeReadOnlyBatch
. I think we need to reframe this function and executeWriteBatch
to make it clear that the most fundamental distinction between them is that the latter can propose to Raft and the former cannot.
And how does this relate to the TODO earlier in this file about releasing latches early. If the read wants to acquire unreplicated locks it can't release latches, yes? So that TODO only applies to non-locking reads?
Great point, I added to the comment.
pkg/storage/replica_send.go, line 134 at r5 (raw file):
Added to the comment:
The most prominent example of asynchronous processing is
with requests that have the "async consensus" flag set. A more subtle
case is with requests that are acknowledged by the Raft machinery after
their Raft entry has been committed but before it has been applied to
the replicated state machine. In all of these cases, responsibility
for releasing the concurrency guard is handed to Raft.
pkg/storage/batcheval/result/result.go, line 35 at r2 (raw file):
could you point me to it?
They are passed to the IntentResolver here.
And why are we trying to do anything with intents that it did not conflict with -- isn't it wasteful given that those transactions may be ongoing and there is no conflict?
CleanupIntentsAsync
uses a PUSH_TOUCH
push RPC, so it simply checks whether the intents transaction is alive or not. If it is, it doesn't wait around for it to finish like a PUSH_ABORT
style push does.
pkg/storage/batcheval/result/result.go, line 39 at r2 (raw file):
is it referring to the TransactionStatus?
Yes, and we call committed or aborted statuses "finalized". See TransactionStatus.IsFinalized
.
pkg/storage/concurrency/concurrency_control.go, line 318 at r4 (raw file):
Previously, sumeerbhola wrote…
how will this "don't allocate" be accomplished?
I just don't want to store them by reference on this object. I'd rather store them by value.
pkg/storage/concurrency/concurrency_manager.go, line 344 at r7 (raw file):
The concurrency manager exists only on the leaseholder for the range, yes?
Well, it exists on all replicas, but it's drained on all replicas but the leaseholder.
Is this the case where the raft leader is at a different node?
Yes, exactly. It's rare but fundamentally possible for the leaseholder to receive a Raft snapshot.
pkg/storage/concurrency/lock_table_waiter.go, line 161 at r1 (raw file):
Previously, sumeerbhola wrote…
Doesn't the comment about waiting "for each" contradict this change to add a delay?
The "for each" part of this is referring to waiting out the longer deadlock detection delay. I made this more explicit.
pkg/storage/concurrency/lock_table_waiter.go, line 245 at r3 (raw file):
Previously, sumeerbhola wrote…
why is
!ok
not an error anymore?
It wasn't worth asserting this and breaking a number of tests.
pkg/storage/concurrency/lock_table_waiter_test.go, line 67 at r1 (raw file):
Previously, sumeerbhola wrote…
is there a reason this is overridden to 1ms and not 0ms in tests?
No reason. Done.
pkg/storage/intentresolver/intent_resolver.go, line 277 at r5 (raw file):
Previously, sumeerbhola wrote…
s/abortint/aborting/
Done.
f6761da
to
7b0c15a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if the EndTxn is in a batch with a Get request, that
Get request does still want to be isolated.
To be fair, it couldn't (we don't support mixed batches) but it doesn't change your point (use a CPut).
split read and write timestamp when scanning lockTable
Could you spell out the example in detail (including how the txn's read and write ts drift)? That test runs inserts only, right? Am morally right that readtimestamp=writetimestamp+maxoffset in practice, even though the code suggests that transactions may do the snapshot thing of reading at T and writing at a later timestamp?
I'm very excited to see this land. Consider this basically an LGTM except I should look at it a little more after you've addressed feedback. I support merging this quickly so that you don't catch skew.
Reviewed 3 of 12 files at r1, 69 of 69 files at r9, 12 of 12 files at r10, 3 of 3 files at r11, 40 of 40 files at r12, 30 of 30 files at r13, 1 of 1 files at r14, 3 of 3 files at r15, 1 of 1 files at r16, 4 of 4 files at r17.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/storage/replica.go, line 224 at r13 (raw file):
leaseHistory *leaseHistory // concMgr sequences incoming requests and provides isoaltion between
isolation
pkg/storage/replica_raft.go, line 150 at r13 (raw file):
quotaSize := uint64(proposal.command.Size()) if maxSize := uint64(MaxCommandSize.Get(&r.store.cfg.Settings.SV)); quotaSize > maxSize { return nil, nil, 0, g, roachpb.NewError(errors.Errorf(
Is it important that g
is returned in all of these error cases? I sure hope not... definitely the method comment says it shouldn't. If it is necessary (it has to be, why else would you have done this) return a type errWithGuard struct { *roachpb.Error \n *roachpb.Guard }
instead of individual values.
See also
cockroach/pkg/storage/replica_write.go
Line 56 in fff4003
// Returns exactly one of a response, an error or re-evaluation reason. |
Ok, I see now that this is a general pattern:
cockroach/pkg/storage/replica_send.go
Lines 215 to 222 in fff4003
br, g, pErr = fn(r, ctx, ba, g) | |
switch t := pErr.GetDetail().(type) { | |
case nil: | |
// Success. | |
return br, nil | |
case *roachpb.WriteIntentError: | |
// Drop latches, but retain lock wait-queues. | |
g.AssertLatches() |
I really dislike us returning errors along with nonzero values. Not for this PR, but I strongly encourage you to consider the suggestion above (or any other one that avoids this problem).
pkg/storage/replica_read.go, line 112 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
are reads that acquire unreplicated locks only in a read-only batch?
Yes, although they will acquire write latches to synchronize properly with other SFU reads. This is a little strange, given that we're in
executeReadOnlyBatch
. I think we need to reframe this function andexecuteWriteBatch
to make it clear that the most fundamental distinction between them is that the latter can propose to Raft and the former cannot.And how does this relate to the TODO earlier in this file about releasing latches early. If the read wants to acquire unreplicated locks it can't release latches, yes? So that TODO only applies to non-locking reads?
Great point, I added to the comment.
I had always assumed that having split up the read/rw paths as we do today was mostly an artifact of the past, and that we could send reads into the write path. Is this fundamentally not true? You could imagine wanting to lift the restriction that mixed batches aren't allowed inside of KV, without implementing splitting at the Replica level - mixed batches would have to hit the rw path, so that better be the only path.
pkg/storage/replica_send.go, line 139 at r13 (raw file):
// their Raft entry has been committed but before it has been applied to // the replicated state machine. In all of these cases, responsibility // for releasing the concurrency guard is handed to Raft.
ditto about returning both a guard and an error
pkg/storage/batcheval/cmd_end_transaction.go, line 522 at r10 (raw file):
log.Fatalf(ctx, "expected resolve allowance to be exactly 0 resolving %s; got %d", intent.Span, resolveAllowance) } intent.EndKey = resumeSpan.Key
Does this case have a test?
pkg/storage/batcheval/cmd_resolve_intent_range.go, line 59 at r10 (raw file):
reply.NumKeys = numKeys if resumeSpan != nil { update.EndKey = resumeSpan.Key
ditto
pkg/storage/batcheval/result/intent.go, line 41 at r10 (raw file):
// FromWrittenIntents creates a Result communicating that the intents were // written or re-written by the given request and should be handled. func FromWrittenIntents(txn *roachpb.Transaction, keys []roachpb.Key) Result {
Why do you have two of these methods and not just a single variadic one? Perf?
pkg/storage/batcheval/result/result.go, line 35 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
could you point me to it?
They are passed to the IntentResolver here.
And why are we trying to do anything with intents that it did not conflict with -- isn't it wasteful given that those transactions may be ongoing and there is no conflict?
CleanupIntentsAsync
uses aPUSH_TOUCH
push RPC, so it simply checks whether the intents transaction is alive or not. If it is, it doesn't wait around for it to finish like aPUSH_ABORT
style push does.
For additional color, encountering but not conflicting with intents basically only (?) happens when doing inconsistent reads. I don't remember where exactly we use those but I think in the past we used them on the meta ranges (do we still?) and this code makes sure intents are cleaned up quickly.
pkg/storage/batcheval/result/result.go, line 40 at r10 (raw file):
// ResolvedIntents stores any resolved intent spans, either with finalized // or pending statuses. Unlike WrittenIntents and EncounteredIntents, values // in this slice can represent spans of intents that were resolved.
will, not can, right? If I'm only looking at WrittenIntents
and ResolvedIntents
I might wonder why a PENDING intent could be in both.
pkg/storage/concurrency/concurrency_control.go, line 276 at r15 (raw file):
// OnRangeSnapshot informs the concurrency manager that its replica has // received a snapshot from another replica in its range. OnRangeSnapshot()
nit: why Range
and not Replica
? Something like OnReplicaAppliedSnapshot()
pkg/storage/concurrency/concurrency_manager.go, line 344 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The concurrency manager exists only on the leaseholder for the range, yes?
Well, it exists on all replicas, but it's drained on all replicas but the leaseholder.
Is this the case where the raft leader is at a different node?
Yes, exactly. It's rare but fundamentally possible for the leaseholder to receive a Raft snapshot.
Mind adding that to the comment here?
How does the "rebuilding" work? We start discovering the replicated locks again as they are being read, and we simply lost all the unreplicated locks. This could go in the comment as well.
pkg/storage/concurrency/lock_table.go, line 32 at r9 (raw file):
// Default upper bound on the number of locks in a lockTable. const defaultLockTableSize = 10000
What happens above this? Also, shouldn't this be bounding the number of bytes?
pkg/storage/concurrency/lock_table_waiter.go, line 32 at r9 (raw file):
"kv.lock_table.coordinator_liveness_push_delay", "the delay before pushing in order to detect coordinator failures of conflicting transactions", 10*time.Millisecond,
Isn't this too low? I'd expect it would be ~txn timeout.
pkg/storage/concurrency/lock_table_waiter.go, line 40 at r9 (raw file):
"kv.lock_table.deadlock_detection_push_delay", "the delay before pushing in order to detect dependency cycles between transactions", 100*time.Millisecond,
Some justification for this number would be good, even if it's just a guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 69 files at r9, 6 of 30 files at r13, 4 of 4 files at r17.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/concurrency/lock_table.go, line 32 at r9 (raw file):
Also, shouldn't this be bounding the number of bytes?
Yes, that is a TODO.
Do we have a general library that does this via reflection, or do we hand-roll the memory accounting for each data-structure? And are there existing examples in the codebase to look at?
pkg/storage/concurrency/lock_table.go, line 849 at r17 (raw file):
} // Locked by some other txn. if g.readTS.Less(waitForTs) {
can you add a TODO somewhere that we need to add tests to test when readTS != writeTS
?
pkg/storage/concurrency/lock_table.go, line 1121 at r17 (raw file):
} switch sa {
I liked this change -- made it easier to understand.
pkg/storage/concurrency/lock_table_waiter.go, line 245 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It wasn't worth asserting this and breaking a number of tests.
I'm confused -- some tests take the !ok
path? Is that risky (wrt tests not finding bugs)?
This commit replaces the strategy for injecting dependencies into the concurrency and txnwait packages from custom interfaces to custom configuration structs. This is more consistent with our standard approach for dependency injection. It also makes the two packages cleaner and easier to test. While here, the commit pulls out two cluster settings: - kv.lock_table.coordinator_liveness_push_delay: the delay before pushing in order to detect coordinator failures of conflicting transactions - kv.lock_table.deadlock_detection_push_delay: the delay before pushing in order to detect dependency cycles between transactions The implication of the first setting is that, by default, even a distinguished waiter will wait 10ms before pushing a conflicting transaction and potentially discovering that it is dead. This will slow down recovery from failed transactions, but this seems well worth it given that we will now frequently avoid pushing at all in the happy path.
This allows the handling of added, updated, and resolved intents to become more reactive. It mirrors our handling of UpdatedTxns and their interaction with the TxnWaitQueue.
This hasn't been used since d41dc72, which was made on Aug 7, 2016. Our understanding of how to resolve intents without blocking conflicting transactions has dramatically evolved since then.
Up until now, we had been assuming that we could use the spans that a batch request declared to determine which conflicting latches and locks to wait on in the concurrency manager. We then tried to determine at the batch level which batches should skip a lockTable scan (see shouldWaitOnConflicts). This turned out to be a fraught effort. We need to make decisions about whether or not to scan the lockTable on a per-request basis, not a per-batch basis. This is because the desire or lack of desire for isolation during evaluation is a trait of each request type and not a trait of its batch. For an example of this, imagine an EndTxn request with a Split trigger, which causes the request to declare a latch span over the entire RHS of its proposed split. The EndTxn request does not want to wait on all locks on the RHS. However, if the EndTxn is in a batch with a CPut request, that CPut request does still want to be isolated. It was unclear how to reconcile these two desires before - should a batch that wants any isolation get full isolation (shouldWaitOnConflicts == true) or should a batch that requires less isolation get no isolation (shouldWaitOnConflicts == false)? Clearly, conflating the spans that a request wants to latch and the spans that it wants full transactional isolation over is a bad idea. This commit addresses this by allowing requests to declare which spans to scan the lockTable with individually.
Related to cockroachdb#41720. Related to cockroachdb#44976. This commit integrates the new concurrency package into the storage package. Each Replica is given a concurrency manager, which replaces its existing latch manager and txn wait queue. The change also uses the concurrency manager to simplify the role of the intent resolver. The intent resolver no longer directly handles WriteIntentErrors. As a result, we are able to delete the contention queue entirely. With this change, all requests are now sequenced through the concurrency manager. When sequencing, latches are acquired and conflicting locks are detected. If any locks are found, the requests wait in lock wait-queues for the locks to be resolved. This is a major deviation from how things currently work because today, even with the contention queue, requests end up waiting for conflicting transactions to commit/abort in the txnWaitQueue after at least one RPC. Now, requests wait directly next to the intents/locks that they are waiting on and react directly to the resolution of these intents/locks. Once requests are sequenced by the concurrency manager, they are theoretically fully isolated from all conflicting requests. However, this is not strictly true today because we have not yet pulled all replicated locks into the concurrency manager's lock table. We will do so in a future change. Until then, the concurrency manager maintains a notion of "intent discovery", which is integrated into the Replica-level concurrency retry loop. Performance numbers will be published shortly. This will be followed by performance numbers using the SELECT FOR UPDATE locking (cockroachdb#40205) improvements that this change enables.
From `btree.MakeIter`'s documentation: > It is not safe to continue using an iterator after modifications are > made to the tree. If modifications are made, create a new iterator. This was occasionally causing a nil pointer dereference because the tree's root was being released back into a memory pool while still referenced by the iterator.
This commit hooks the concurrency manager against range snapshot events. The hazard this is protecting against is that a snapshot can cause discontinuities in raft entry application. The lockTable expects to observe all lock state transitions on the range through LockManager listener methods. If there's a chance it missed a state transition, it is safer to clear the lockTable and let it be rebuilt from persistent state as intents are discovered. I have not actually seen this cause problems and this seems like it would be extremely rare because it's quite hard to create situations where the leaseholder replica receives a snapshot, but it does seem possible so it's better to be safe here.
This prevents false detection of stalls in a few cases, like when receiving on a buffered channel that already has an element in it. Before this, the first few iterations of `testutils.SucceedsSoon` were so quick that `TestConcurrencyManagerBasic/basic` saw cases where the wrong channel select was detected as a stall point. This occurred ~1/1,000 times. With this change, the test has stood up to over 10,000 iterations of stress.
…kTable This resolves an infinite loop that was possible and somewhat common in `TestMonotonicInserts`. The loop occurred because read requests were discovering intents at timestamps above their read timestamp but within their uncertainty interval. This is essential to avoid stale reads. However, it was confusing the lockTable. This commit addresses this issue by considering a transaction's uncertainty interval when scanning the lockTable. A new assertion was added that would detect this infinite loop condition.
7b0c15a
to
c3ff074
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! After we get to a state where we like this, I'm going to wait to merge this until #45567 is also ready to go in, then I'll merge them together.
Performance numbers will be published shortly.
I benchmarked this with a suite of microbenchmarks and full workload benchmarks. The full workload benchmarks look fine:
tpccbench/nodes=3/cpu=16 before |
tpccbench/nodes=3/cpu=16 after |
---|---|
1895 | 1913 |
1913 | 1913 |
1885 | 1913 |
The pkg/sql/tests
microbenchmarks took a little bit of a hit. Here's the benchdiff output from a few days ago: https://docs.google.com/spreadsheets/d/1W0PFHOw08NXBWgLHyhGc5Zky2oFpOHXCz5ZTV-Qm60o/edit#gid=4. The Bank
workload's massive drop in performance is actually a deadlock that was being broken accidentally because we don't launch heartbeat loops for 1PC txns so a txn in the deadlock was expiring. I have a PR to update the kv/contention
roachtest to disable this "optimization" so that we never accidentally allow such deadlocks to be resolved - we'd rather them deadlock indefinitely so we can catch the underlying issue, at least in tests. The issue had to do with a CPut forgetting to include WrittenIntents
on an error path. The error was being consumed (see #45562) and the CPut was succeeding but the lockTable was not being told about this. This led to a case where a lock was being discovered and added to a lockState that contained the same transaction as a queuedWriter. We weren't handling this case correctly. I fixed the WrittenIntents
bug in the second commit and have a PR for that coming for the AddDiscoveredLock
issue, which may or may not still be possible to hit.
I also have a series of micro-optimizations around the lockTable and concurrency manager to pool memory. This closes the gap on a lot of the other benchmarks. I'll wait until this PR goes in to continue tuning.
To be fair, it couldn't (we don't support mixed batches) but it doesn't change your point (use a CPut).
Done.
Could you spell out the example in detail (including how the txn's read and write ts drift)? That test runs inserts only, right?
No, the test runs a series of reads and writes in a transaction, all with a fixed commit timestamp: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/tests/monotonic_insert_test.go#L138
Am morally right that readtimestamp=writetimestamp+maxoffset in practice, even though the code suggests that transactions may do the snapshot thing of reading at T and writing at a later timestamp?
In that test, that's exactly correct because the commit timestamp is fixed at the beginning of the transaction due to the use of cluster_logical_timestamp()
.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/storage/replica.go, line 224 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
isolation
Done.
pkg/storage/replica_raft.go, line 150 at r13 (raw file):
Yeah, I don't love this either, but I had a hard time figuring out how to conditionally pass ownership of the concurrency guard (and the responsibility of freeing it at the right time) without this structure. We want Raft to assume ownership of the guard if the proposal makes it to Raft, but we want to continue holding on to the guard if the proposal doesn't make it to Raft. I played around with a few strategies here with communicating back to the caller of evalAndPropose
whether it still had ownership of the guard - such as passing a bool to signify this or trying to correlate the error value with whether the guard still needed to be freed or not. Unless I'm missing something, the errWithGuard
idea would run into the same kinds of issues. In the end, I decided that trying to hide ownership of the guard just made things more complex.
I'm very open to improving things, but I think we'll need to more aggressively restructure this logic to make any meaningful improvements. For instance, if we split evaluation and Raft proposal into two separate stages from the perspective of executeWriteBatch
then some of the complexity around no-op proposals and evaluation errors and how that maps to guard ownership would go away.
for another rotted comment
Done.
pkg/storage/replica_read.go, line 112 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I had always assumed that having split up the read/rw paths as we do today was mostly an artifact of the past, and that we could send reads into the write path. Is this fundamentally not true? You could imagine wanting to lift the restriction that mixed batches aren't allowed inside of KV, without implementing splitting at the Replica level - mixed batches would have to hit the rw path, so that better be the only path.
It should be true and we're continuing to narrow the gap between these two code paths. I think the largest remaining difference between them is that we use a more efficient LSM iterator on the read path instead of a full write batch. There's also some complexity around which kinds of retries are appropriate for reads vs. writes because writes can arbitrarily increase their timestamps without violating their latches while reads have to be more careful about doing so. Finally, I think we'd want to pull the evaluation/proposal split (currently in evalAndPropose
) up a level if we were to do so.
All of these would be good changes though, so I'm in support of such a refactor.
pkg/storage/replica_send.go, line 139 at r13 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto about returning both a guard and an error
Let's discuss in the previous comment.
pkg/storage/batcheval/cmd_end_transaction.go, line 522 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Does this case have a test?
No, not yet. It's not actually necessary for correctness, just performance. I'd like to add a lot more testing around batch evaluation and the result.Result
that we get out of it, but I'd rather not include that testing in this already very bloated PR.
pkg/storage/batcheval/cmd_resolve_intent_range.go, line 59 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto
Same response as above.
pkg/storage/batcheval/result/intent.go, line 41 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why do you have two of these methods and not just a single variadic one? Perf?
No reason, escape analysis can prove that the array passed to the variadic function won't escape when given a constant number of args, so using variadics wouldn't force any new heap allocations or anything. Done.
pkg/storage/batcheval/result/result.go, line 35 at r2 (raw file):
do we still?
We scan using a READ_UNCOMMITTED isolation level, which as the same behavior.
pkg/storage/batcheval/result/result.go, line 40 at r10 (raw file):
Previously, tbg (Tobias Grieger) wrote…
will, not can, right? If I'm only looking at
WrittenIntents
andResolvedIntents
I might wonder why a PENDING intent could be in both.
Done. "Can" was referring to the possibility of them being spans, not of them being resolved. But the wording was confusing.
pkg/storage/concurrency/concurrency_control.go, line 276 at r15 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: why
Range
and notReplica
? Something likeOnReplicaAppliedSnapshot()
Done.
pkg/storage/concurrency/concurrency_manager.go, line 344 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mind adding that to the comment here?
How does the "rebuilding" work? We start discovering the replicated locks again as they are being read, and we simply lost all the unreplicated locks. This could go in the comment as well.
Done.
pkg/storage/concurrency/lock_table.go, line 32 at r9 (raw file):
Do we have a general library that does this via reflection, or do we hand-roll the memory accounting for each data-structure?
We typically hand-roll memory accounting for each data structure. git grep 'unsafe.Sizeof' pkg/storage
will give you some examples of this approach in our codebase, such as the timestamp cache and the raftentry cache. This isn't too burdensome in practice, but I can't say we're particularly in love with the approach either, so if you've used nicer patterns in the past, this might be a good time to experiment with introducing them to the codebase.
pkg/storage/concurrency/lock_table.go, line 849 at r17 (raw file):
Previously, sumeerbhola wrote…
can you add a TODO somewhere that we need to add tests to test when
readTS != writeTS
?
Done.
pkg/storage/concurrency/lock_table_waiter.go, line 245 at r3 (raw file):
Previously, sumeerbhola wrote…
I'm confused -- some tests take the
!ok
path? Is that risky (wrt tests not finding bugs)?
Some tests are below the level where transactions get assigned observed timestamps and so they hit the !ok
path. I wouldn't say that's particularly risky, just that it's a bit unrealistic. But this is simply a heuristic to avoid extra transaction retries, so it's not very concerning either.
Still, I'll update this to handle the missing observed timestamp case more gracefully, which also helps drive home the point of the heuristic. Done in new commit.
pkg/storage/concurrency/lock_table_waiter.go, line 32 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Isn't this too low? I'd expect it would be ~txn timeout.
This is discussed in the comment on case waitForDistinguished
. If we had a cache for aborted txn IDs then I'd feel better about bumping this up, but we want to avoid waiting out a long delay on each abandoned intent, so for now, we keep it fairly short.
I also have an upcoming change that explains this even further.
EDIT: actually, I added a commit to explain this here as well.
pkg/storage/concurrency/lock_table_waiter.go, line 40 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Some justification for this number would be good, even if it's just a guess.
This is what the number is today in the contentionQueue.
Added a commit explaining this and the previous value.
…odes=4 I noticed when debugging issues in cockroachdb#45482 that unhandled deadlocks occasionally resolved themselves because txns would eventually time out. This was because we don't start the txn heartbeat loop for 1PC txns. In this kind of test, we want any unhandled deadlocks to be as loud as possible, so just like we set a very long txn expiration, we also enable the txn heartbeat loop for all txns, even those that we expect will be 1PC. This commit also drops the kv.lock_table.deadlock_detection_push_delay down for the test, since it's already touching this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a PR to update the kv/contention roachtest to disable this "optimization" so that we never accidentally allow such deadlocks to be resolved - we'd rather them deadlock indefinitely so we can catch the underlying issue, at least in tests.
This is #45568.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
This commit adds commentary around the default values for: - kv.lock_table.coordinator_liveness_push_delay - kv.lock_table.deadlock_detection_push_delay
Canceled (will resume) |
This commit updates the AddDiscoveredLock state transition in lockTableImpl to release all waiting writers that are part of the same transaction as a discovered lock. This caused issues in cockroachdb#45482, where we saw txn deadlocks in the `pkg/sql/tests.Bank` benchmark. This issue was triggered because we were forgetting to inform the lockTable of a lock acquisition in a subtle case. That is now fixed and it's not clear that we can actually hit the bug here anymore given the current policy on how wait-queues form in the lock-table. Regardless, this is worth handling correctly in case things ever change.
Build failed (retrying...) |
Build succeeded |
…odes=4 I noticed when debugging issues in cockroachdb#45482 that unhandled deadlocks occasionally resolved themselves because txns would eventually time out. This was because we don't start the txn heartbeat loop for 1PC txns. In this kind of test, we want any unhandled deadlocks to be as loud as possible, so just like we set a very long txn expiration, we also enable the txn heartbeat loop for all txns, even those that we expect will be 1PC. This commit also drops the kv.lock_table.deadlock_detection_push_delay down for the test, since it's already touching this code.
45551: ui: standardize the display of node names in node lists r=dhartunian a=petermattis Standardize the display of node names in node lists. The nodes overview list was displaying nodes as `N<id> <ip-address>` while graphs display `<ip-address> (n<id>)`. Standardize on the latter format. A similar problem existed on the statement details page which was displaying `N<id> <ip-address> (n<id>)`. The node id was being displayed twice! Release note: None 45567: storage/concurrency: push reservation holders to detect deadlocks r=nvanbenschoten a=nvanbenschoten This is a partial reversion of #45420. It turns out that there are cases where a reservation holder is a link in a dependency cycle. This can happen when a reservation holder txn is holding on to one reservation while blocking on a lock on another key. If any txns queued up behind the reservation did not push _someone_, they wouldn't detect the deadlock. ``` range A . range B . txnB . txnC txnA | . | ^________ | v . v \ v [lock X: (txnA)] . [lock Y: (txnB)] [res Z: (txnC)] ``` It turns out that this segment of the dependency cycle is always local to a single concurrency manager, so it could potentially forward the push through the reservation links to shorten the cycle and prevent non-lock holders from ever being the victim of a deadlock abort. This is tricky though, so for now, we just push. To address the issue that motivated #45420, we perform this form of push asynchronously while continuing to listen to state transitions in the lockTable. If the pusher is unblocked (see #45420 for an example of when that can happen), it simply cancels the push and proceeds with navigating the lockTable. This PR also adds a set of datadriven deadlock tests to the concurrency manager test suite: [`testdata/concurrency_manager/deadlocks`](https://github.com/cockroachdb/cockroach/pull/45567/files#diff-5934754d5a8f1086698cdbab628ee1b5). These tests construct deadlocks due to lock ordering and request ordering and demonstrate how the deadlocks are resolved. 45568: roachtest: enable txn heartbeat loops for 1PC txns in kv/contention/nodes=4 r=nvanbenschoten a=nvanbenschoten I noticed when debugging issues in #45482 that unhandled deadlocks occasionally resolved themselves because txns would eventually time out. This was because we don't start the txn heartbeat loop for 1PC txns. In this kind of test, we want any unhandled deadlocks to be as loud as possible, so just like we set a very long txn expiration, we also enable the txn heartbeat loop for all txns, even those that we expect will be 1PC. This commit also drops the kv.lock_table.deadlock_detection_push_delay down for the test, since it's already touching this code. Co-authored-by: Peter Mattis <petermattis@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Closes cockroachdb#40786. Closes cockroachdb#44336. This commit resolves a bug in distributed deadlock detection that would allow a deadlock between transactions to go undetected, stalling the workload indefinitely. The issue materialized as follows: 1. two transactions would deadlock and each enter a txnwait queue 2. they would poll their pushees record along with their own 3. deadlock detection would eventually pick this up and abort one of the txns using the pusher's copy of the txn's proto 4. however, the aborted txn has since restarted and bumped it epoch 5. the aborted txn continued to query its record, but failed to ingest any updates from it because the record was at a lower epoch than its own copy of its txn proto. So it never noticed that it was ABORTED 6. all other txns in the system including the original contending txn piled up behind the aborted txn in the contention queue, waiting for it to notice it was aborted and exit the queue 7. deadlock! I'm optimistically closing the two `kv/contention/nodes=4` issues both because I hope this is the cause of their recent troubles and also because I've been spending a lot of time with the test recently in light of cockroachdb#45482 and plan to stabilize it fully. I plan to backport this to release-19.2. This doesn't need to go all the way back to release-19.1 because this was introduces in aed892a.
45595: sql: make cleanup jobs spawned by alter primary key not cancelable r=spaskob a=rohany Fixes #45500. This PR makes the job spawned by ALTER PRIMARY KEY that cleans up indexes be uncancelable. This PR additionally fixes a related bug where ALTER PRIMARY KEY would spawn a job when it didn't actually enqueue any mutations on the table descriptor, causing future schema changes on the table to hang. Release note (sql change): This PR makes the cleanup job spawned by ALTER PRIMARY KEY in some cases uncancelable. 45603: storage/txnwait: terminate push when pusher aborted at lower epoch r=nvanbenschoten a=nvanbenschoten Closes #40786. Closes #44336. This commit resolves a bug in distributed deadlock detection that would allow a deadlock between transactions to go undetected, stalling the workload indefinitely. The issue materialized as follows: 1. two transactions would deadlock and each enter a txnwait queue 2. they would poll their pushees record along with their own 3. deadlock detection would eventually pick this up and abort one of the txns using the pusher's copy of the txn's proto 4. however, the aborted txn has since restarted and bumped it epoch 5. the aborted txn continued to query its record, but failed to ingest any updates from it because the record was at a lower epoch than its own copy of its txn proto. So it never noticed that it was ABORTED 6. all other txns in the system including the original contending txn piled up behind the aborted txn in the contention queue, waiting for it to notice it was aborted and exit the queue 7. deadlock! I'm optimistically closing the two `kv/contention/nodes=4` issues both because I hope this is the cause of their recent troubles and also because I've been spending a lot of time with the test recently in light of #45482 and plan to stabilize it fully. I plan to backport this to release-19.2. This doesn't need to go all the way back to release-19.1 because this was introduces in aed892a. Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This commit updates the AddDiscoveredLock state transition in lockTableImpl to release all waiting writers that are part of the same transaction as a discovered lock. This caused issues in cockroachdb#45482, where we saw txn deadlocks in the `pkg/sql/tests.Bank` benchmark. This issue was triggered because we were forgetting to inform the lockTable of a lock acquisition in a subtle case. That is now fixed and it's not clear that we can actually hit the bug here anymore given the current policy on how wait-queues form in the lock-table. Regardless, this is worth handling correctly in case things ever change.
This commit updates the AddDiscoveredLock state transition in lockTableImpl to release all waiting writers that are part of the same transaction as a discovered lock. This caused issues in cockroachdb#45482, where we saw txn deadlocks in the `pkg/sql/tests.Bank` benchmark. This issue was triggered because we were forgetting to inform the lockTable of a lock acquisition in a subtle case. That is now fixed and it's not clear that we can actually hit the bug here anymore given the current policy on how wait-queues form in the lock-table. Regardless, this is worth handling correctly in case things ever change.
45601: storage/concurrency: release reqs from same txn as discovered lock r=nvanbenschoten a=nvanbenschoten This commit updates the AddDiscoveredLock state transition in lockTableImpl to release all waiting writers that are part of the same transaction as a discovered lock. This caused issues in #45482, where we saw txn deadlocks in the `pkg/sql/tests.Bank` benchmark. This issue was triggered because we were forgetting to inform the lockTable of a lock acquisition in a subtle case. That is now fixed and it's not clear that we can actually hit the bug here anymore given the current policy on how wait-queues form in the lock-table. Regardless, this is worth handling correctly in case things ever change. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
commit feb3d45 Author: David Hartunian <davidh@cockroachlabs.com> Date: Mon Mar 2 22:53:52 2020 -0500 sql: add API for accessing statement diagnostics Added 3 new RPC endpoints under `_status` for manipulating Statement Diagnostics Requests: 1. `CreateStatementDiagnosticsRequest` Creates a new request for a given statement fingerprint 2. `StatementDiagnosticsRequests` Fetches all statement diagnostics requests 3. `StatementDiagnostics` Fetches a specific statement diagnostics request that has been completed and includes the JSON payload of the trace Basic implementations for these endpoints have been added to the `apiReducers.tsx` and `api.ts` files for use by the Admin UI. A minor bug where completed requests did not have `completedAt` timestamp set was fixed. commit 9d684ea Merge: 2e621c7 62c8667 Author: craig[bot] <bors@cockroachlabs.com> Date: Wed Mar 4 02:09:24 2020 +0000 Merge cockroachdb#45575 45575: cli: reveal SQL error details/hints in CLI commands r=ajwerner a=knz Found while investigating cockroachdb#43114. First commit from that PR. Previously, when a CLI command that uses SQL under the hood encountered an error, only the main error message was displayed. This patch changes it to reveal the other additional user-visible fields. Release note (cli change): Certain CLI command now provide more details and/or a hint when they encounter an error. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> commit 2e621c7 Merge: 7b64b0b 3daeced Author: craig[bot] <bors@cockroachlabs.com> Date: Wed Mar 4 00:26:33 2020 +0000 Merge cockroachdb#45520 45520: opt: FK checks for Upsert r=RaduBerinde a=RaduBerinde These commit add support for opt-driven FK checks for UPSERT and INSERT ON CONFLICT DO UPDATE. This is a simplified and cleaned up reimplementation of cockroachdb#45095. There aren't a lot of logictests around FKs and upsert. I plan to try to build a randomized testing facility for FK checks, where we generate randomized mutations and cross-check FK violations or lack there of against a separate instance of the database without FK relationships. #### Revert "opt: add first half of upsert FK checks" This reverts commit 96447c8. We are reconsidering the direction of the upsert FK checks: instead of filtering inserted vs updated rows, we can use the return columns to get the "new" values. This change also made some things harder to understand: we used to build the FK check input using just the columns corresponding to the FK instead of all columns. Release note: None #### opt: refactor optbuilder FK check code The optbuilder FK check code has become unnecessarily complicated: - there are two ways of creating a WithScan (`makeFKInputScan` and `projectOrdinals`) which are similar but take different kinds of information as input. - there are various slices of column ordinals or column IDs that are threaded through long parts of code, making it hard to follow. This change cleans this up by creating a fkCheckHelper which contains the metadata related to FK table ordinals and is capable of creating a WithScan with either the "new" or the "fetched" values. The new code should generate the same effective plans; the differences are: - we now always create the WithScan before the other table Scan so some column IDs in the plan changed; - we no longer have an unnecessary projection for Update (it was used to renumber the columns coming out of WithScan). Release note: None #### opt: add outbound FK checks for upsert This is a re-implementation of Justin's change 3d1dd0f, except that we now perform the insertion check for all new rows instead of just the inserted rows. We do this by utilizing the same column that would be used for a RETURNING clause, which in some cases is of the form `CASE WHEN canary IS NULL col1 ELSE col2`. Release note: None #### opt: add inbound FK checks for upsert This change adds inbound FK checks for upsert and switches execution over to the new style FK checks for upsert. Similar to UPDATE, the inbound FK checks run on the set difference between "old" values for the FK columns and "new" values. Release note (performance improvement): improved execution plans of foreign key checks for UPSERT and INSERT ON CONFLICT in some cases (in particular multi-region). Co-authored-by: Radu Berinde <radu@cockroachlabs.com> commit 3daeced Author: Radu Berinde <radu@cockroachlabs.com> Date: Tue Mar 3 16:23:50 2020 -0800 opt: add inbound FK checks for upsert This change adds inbound FK checks for upsert and switches execution over to the new style FK checks for upsert. Similar to UPDATE, the inbound FK checks run on the set difference between "old" values for the FK columns and "new" values. Release note (performance improvement): improved execution plans of foreign key checks for UPSERT and INSERT ON CONFLICT in some cases (in particular multi-region). commit b8ebc3a Author: Radu Berinde <radu@cockroachlabs.com> Date: Tue Mar 3 16:23:50 2020 -0800 opt: add outbound FK checks for upsert This is a re-implementation of Justin's change 3d1dd0f, except that we now perform the insertion check for all new rows instead of just the inserted rows. We do this by utilizing the same column that would be used for a RETURNING clause, which in some cases is of the form `CASE WHEN canary IS NULL col1 ELSE col2`. Release note: None commit 75e6009 Author: Radu Berinde <radu@cockroachlabs.com> Date: Tue Mar 3 16:23:50 2020 -0800 opt: refactor optbuilder FK check code The optbuilder FK check code has become unnecessarily complicated: - there are two ways of creating a WithScan (`makeFKInputScan` and `projectOrdinals`) which are similar but take different kinds of information as input. - there are various slices of column ordinals or column IDs that are threaded through long parts of code, making it hard to follow. This change cleans this up by creating a fkCheckHelper which contains the metadata related to FK table ordinals and is capable of creating a WithScan with either the "new" or the "fetched" values. The new code should generate the same effective plans; the differences are: - we now always create the WithScan before the other table Scan so some column IDs in the plan changed; - we no longer have an unnecessary projection for Update (it was used to renumber the columns coming out of WithScan). Release note: None commit aa2295d Author: Radu Berinde <radu@cockroachlabs.com> Date: Tue Mar 3 16:23:50 2020 -0800 Revert "opt: add first half of upsert FK checks" This reverts commit 96447c8. We are reconsidering the direction of the upsert FK checks: instead of filtering inserted vs updated rows, we can use the return columns to get the "new" values. This change also made some things harder to understand: we used to build the FK check input using just the columns corresponding to the FK instead of all columns. Release note: None commit 7b64b0b Merge: ede97bc 5b1598d Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 23:33:09 2020 +0000 Merge cockroachdb#45513 45513: sql: disable primary key changes when a schema change is in progress r=lucy-zhang a=rohany Fixes cockroachdb#45362. Release note (sql change): This PR disables primary key changes when a concurrent schema change is executing on the same table, or if a schema change has been started on the same table in the current transaction. Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> commit ede97bc Merge: bbbb26b 7383e9d Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 22:44:11 2020 +0000 Merge cockroachdb#45597 45597: opt: extend WithUses and improve NeededMutationCols r=RaduBerinde a=RaduBerinde This change extends the WithUses rule prop to keep track of the columns actually used by `WithScan`s. This set is then used by NeededMutationCols to make sure we never prune mutation input columns that are used by FK checks. Currently this never happens, but because of fairly subtle reasons: FKs happen to require indexes on both sides, and those indexes force the mutation operator to require values for those columns. This won't be the case anymore with upsert FK checks, for which this fix is required. The new information will also be used (in a future change) to prune unused columns of `With` itself. Release note: None Co-authored-by: Radu Berinde <radu@cockroachlabs.com> commit bbbb26b Merge: e1e1f2c d80411f Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 21:58:08 2020 +0000 Merge cockroachdb#45397 45397: sql: disallow other schema changes during an in progress primary key change r=lucy-zhang a=rohany Fixes cockroachdb#45363. Release note (sql change): This commit disables the use of schema changes when a primary key change is in progress. This includes the following: * running a primary key change in a transaction, and then starting another schema change in the same transaction. * starting a primary key change on one connection, and then starting a schema change on the same table on another connection while the initial primary key change is currently executing. Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> commit e1e1f2c Merge: 064c4ea f38087b 66d502f 2b7aa46 35095d9 Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 20:36:07 2020 +0000 Merge cockroachdb#45472 cockroachdb#45607 cockroachdb#45609 cockroachdb#45613 45472: storage/engine: Teeing engine fixes r=itsbilal a=itsbilal This change introduces some misc teeing engine fixes, such as proper cache initialization, copying of SSTs before ingestions (now that pebble also deletes SSTs after an Ingest), and better null msg handling in GetProto. The cache part fixes a segfault in GetStats. Fixes cockroachdb#42654 . Release note: None. 45607: sql: fix usages of ActiveVersionOrEmpty in {create, alter}_table.go r=irfansharif a=rohany Fixes cockroachdb#45519. This PR removes an incorrect usage of ActiveVersionOrEmpty in alter_table.go, and updates a comment in create_table.go detailing its usage. Release note: None 45609: sql: support star-expanding label-less tuples + numeric tuple indexing r=RaduBerinde a=knz Requested by @RaduBerinde / @ajwerner . Previously, it was impossible to use `(t).*` to expand a single tuple-typed column into multiple projections if the tuple was not labeled to start with. This limitation was caused by another limitation, that it was not possible to refer to a single column in a tuple if the tuple was not already labeled. This patch addresses both limitations. Release note (sql change): CockroachDB now supports expanding all columns of a tuple using the `.*` notation, for example `SELECT (t).* FROM (SELECT (1,'b',2.3) AS t)`. This is a CockroachDB-specific extension. Release note (sql change): CockroachDB now supports accessing the Nth column in a column with tuple type using the syntax `(...).@N`, for example `SELECT (t).@2 FROM (SELECT (1,'b',2.3) AS t)`. This is a CockroachDB-specific extension. 45613: pgwire: fix decimal decoding with trailing zeroes r=jordanlewis a=otan Resolves cockroachdb#25460 In addition to the release note, I have also modified the docker runfile to support elixir tests, and also updated the elixir tests to be run. Release note (bug fix): In previous PRs, drivers that do not truncate trailing zeroes for decimals in the binary format end up having inaccuracies of up to 10^4 during the decode step. In this PR, we fix the error by truncating the trailing zeroes as appropriate. This fixes known incorrect decoding cases with Postgrex in Elixir. Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Oliver Tan <otan@cockroachlabs.com> commit 35095d9 Author: Oliver Tan <otan@cockroachlabs.com> Date: Mon Mar 2 15:40:45 2020 -0800 pgwire: fix decimal decoding with trailing zeroes In addition to the release note, I have also modified the docker runfile to support elixir tests, and also updated the elixir tests to be run. Release note (bug fix): In previous PRs, drivers that do not truncate trailing zeroes for decimals in the binary format end up having inaccuracies of up to 10^4 during the decode step. In this PR, we fix the error by truncating the trailing zeroes as appropriate. This fixes known incorrect decoding cases with Postgrex in Elixir. commit 064c4ea Merge: ff2b605 6653127 Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 19:08:46 2020 +0000 Merge cockroachdb#45630 45630: storageccl: rework TestRandomKeyAndTimestampExport to be shorter r=pbardea a=ajwerner The test was taking a very long time due to the tiny page sizes. This commit changes the test to scale the total number of keys based on the page size. Before: ``` --- PASS: TestRandomKeyAndTimestampExport (19.06s) ``` After: ``` --- PASS: TestRandomKeyAndTimestampExport (2.30s) ``` Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> commit f38087b Author: Bilal Akhtar <bilal@cockroachlabs.com> Date: Wed Feb 26 15:45:56 2020 -0500 storage/engine: Teeing engine fixes This change introduces some misc teeing engine fixes, such as proper cache initialization, copying of SSTs before ingestions (now that pebble also deletes SSTs after an Ingest), and better null msg handling in GetProto. The cache part fixes a segfault in GetStats. It also unifies file handling between in-memory and on-disk teeing engines, by ensuring we write to files in each engine's aux directory if we're writing to one engine's aux directory. Fixes cockroachdb#42654 . Release note: None. commit 5b1598d Author: Rohan Yadav <rohany@alumni.cmu.edu> Date: Thu Feb 27 12:42:24 2020 -0500 sql: disable primary key changes when a schema change is in progress Fixes cockroachdb#45362. Release note (sql change): This PR disables primary key changes when a concurrent schema change is executing on the same table, or if a schema change has been started on the same table in the current transaction. commit d80411f Author: Rohan Yadav <rohany@alumni.cmu.edu> Date: Tue Feb 25 10:59:17 2020 -0500 sql: disallow other schema changes an in progress primary key change Fixes cockroachdb#45363. Release note (sql change): This commit disables the use of schema changes when a primary key change is in progress. This includes the following: * running a primary key change in a transaction, and then starting another schema change in the same transaction. * starting a primary key change on one connection, and then starting a schema change on the same table on another connection while the initial primary key change is currently executing. commit ff2b605 Merge: aeb41dc 9dcab01 cee9355 Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 16:59:33 2020 +0000 Merge cockroachdb#45347 cockroachdb#45502 45347: sql: make secondary indexes not write empty k/v's + bugfixes for primary key changes r=jordanlewis a=rohany Fixes cockroachdb#45223. Depends on cockroachdb#45226 to land first. This PR fixes many bugs around secondary index encodings and CRUD operations k/v reads and writes. * Fixes a problem secondary indexes would write empty k/v's if it contained a family that had all null values. * Fix multiple bugs where updates to a table during an online primary key change could result an inconsistent new primary key. * Fix an assumption in the updater that assumed indexes always had the same number of k/v's. The logic has been updated to perform a sort of merge operation to decide what k/v's to insert/delete during the update operation. * Increased testing around secondary indexes k/vs and schema change operations. The release note is None because these are all bugs introduced in 20.1. Release note: None 45502: sql: allow rename database for sequences without a db name reference r=rohany a=otan Resolves immediate concern from cockroachdb#45411 Refs: cockroachdb#34416 See release note for description. This PR should be included ahead of the more "general" fix of changing the DefaultExpr with the new database as it unblocks people using `experimental_serial_normalization=sql_sequence` from using the database rename feature. Release note (sql change): Previously, when we renamed a database, any table referencing a sequence would be blocked from being able to rename the table. This is to block cases where if the table's reference to the sequence contains the database name, and the database name changes, we have no way of overwriting the table's reference to the sequence in the new database. However, if no database name is included in the sequence reference, we should continue to allow the database to rename, as is implemented with this change. Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Oliver Tan <otan@cockroachlabs.com> commit 6653127 Author: Andrew Werner <ajwerner@cockroachlabs.com> Date: Tue Mar 3 00:33:17 2020 -0500 storageccl: rework TestRandomKeyAndTimestampExport to be shorter The test was taking a very long time due to the tiny page sizes. This commit changes the test to scale the total number of keys based on the page size. Before: ``` --- PASS: TestRandomKeyAndTimestampExport (19.06s) ``` After: ``` --- PASS: TestRandomKeyAndTimestampExport (2.30s) ``` Release note: None commit 66d502f Author: Rohan Yadav <rohany@alumni.cmu.edu> Date: Mon Mar 2 18:01:59 2020 -0500 sql: fix usages of ActiveVersionOrEmpty in {create, alter}_table.go Fixes cockroachdb#45519. This PR removes an incorrect usage of ActiveVersionOrEmpty in alter_table.go, and updates a comment in create_table.go detailing its usage. Release note: None commit aeb41dc Merge: ada086e 3d7aeb3 0a1f251 Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 16:10:07 2020 +0000 Merge cockroachdb#45595 cockroachdb#45603 45595: sql: make cleanup jobs spawned by alter primary key not cancelable r=spaskob a=rohany Fixes cockroachdb#45500. This PR makes the job spawned by ALTER PRIMARY KEY that cleans up indexes be uncancelable. This PR additionally fixes a related bug where ALTER PRIMARY KEY would spawn a job when it didn't actually enqueue any mutations on the table descriptor, causing future schema changes on the table to hang. Release note (sql change): This PR makes the cleanup job spawned by ALTER PRIMARY KEY in some cases uncancelable. 45603: storage/txnwait: terminate push when pusher aborted at lower epoch r=nvanbenschoten a=nvanbenschoten Closes cockroachdb#40786. Closes cockroachdb#44336. This commit resolves a bug in distributed deadlock detection that would allow a deadlock between transactions to go undetected, stalling the workload indefinitely. The issue materialized as follows: 1. two transactions would deadlock and each enter a txnwait queue 2. they would poll their pushees record along with their own 3. deadlock detection would eventually pick this up and abort one of the txns using the pusher's copy of the txn's proto 4. however, the aborted txn has since restarted and bumped it epoch 5. the aborted txn continued to query its record, but failed to ingest any updates from it because the record was at a lower epoch than its own copy of its txn proto. So it never noticed that it was ABORTED 6. all other txns in the system including the original contending txn piled up behind the aborted txn in the contention queue, waiting for it to notice it was aborted and exit the queue 7. deadlock! I'm optimistically closing the two `kv/contention/nodes=4` issues both because I hope this is the cause of their recent troubles and also because I've been spending a lot of time with the test recently in light of cockroachdb#45482 and plan to stabilize it fully. I plan to backport this to release-19.2. This doesn't need to go all the way back to release-19.1 because this was introduces in aed892a. Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> commit ada086e Merge: b0be21a 80894c3 4caee85 7e0ba7c Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 15:26:58 2020 +0000 Merge cockroachdb#44130 cockroachdb#45589 cockroachdb#45604 44130: sql: postgresql dollar-quoted string support r=knz a=damienhollis Added support for postgresql dollar-quoted strings in the scanner. A dollar-quoted string constant consists of a dollar sign ($), an optional "tag" of zero or more characters, another dollar sign, an arbitrary sequence of characters that makes up the string content, a dollar sign, the same tag that began this dollar quote, and a final dollar sign. The scanner uses the existing token type of SCONST for dollar-quoted strings. As a result, when the AST is formatted as a string, there is no knowledge that the original input was dollar-quoted and it is therefore formatted as either a plain string or an escaped string (depending on the content). Fixes cockroachdb#41777. Release Note (sql change): CockroachDB now supports string and byte array literals using the dollar-quoted notation, as documented here: https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING 45589: sql: update error message for primary key change on an interleave parent r=otan a=rohany Fixes cockroachdb#45537. This PR updates the error message when trying to perform a primary key change on an interleaved parent to include the name of the table as well as the names of the interleaved children. Release note: None 45604: opt: factor limit hints into scan and lookup join costs r=rytaft a=rytaft This PR is a continuation of cockroachdb#43415. The first commit is copied directly from that PR, and the second commit includes a minor fix as well as a number of test cases. Fixes cockroachdb#34811; the example query in this issue now chooses a lookup join as desired. The coster now takes limit hints into account when costing scans and lookup joins, and propagates limit hints through lookup joins. Release note (sql change): The optimizer now considers the likely number of rows an operator will need to provide, and might choose query plans based on this. In particular, the optimizer might prefer lookup joins over alternatives in some situations where all rows of the join will probably not be needed. Co-authored-by: damien.hollis <damien.hollis@unimarket.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Céline O'Neil <celineloneil@gmail.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com> commit 7e0ba7c Author: Rebecca Taft <becca@cockroachlabs.com> Date: Mon Mar 2 15:35:36 2020 -0600 opt: make lookup join limit hint consistent with coster, add tests This commit adds a new helper function called lookupJoinInputLimitHint, which is called by both the coster and the physical props code for limit hint calculation. This helper function ensures that both places take into account the batch size of 100, and require that the calculated limit hint is a multiple of this batch size. This commit also adds more tests related to costing of limit hints for scans and lookup joins. Release note: None commit 9dcab01 Author: Rohan Yadav <rohany@alumni.cmu.edu> Date: Thu Feb 20 15:37:29 2020 -0500 sql: make secondary indexes not write empty k/v's + bugfixes for primary key changes Fixes cockroachdb#45223. Depends on cockroachdb#45226 to land first. This PR fixes many bugs around secondary index encodings and CRUD operations k/v reads and writes. * Fixes a problem secondary indexes would write empty k/v's if it contained a family that had all null values. * Fix multiple bugs where updates to a table during an online primary key change could result an inconsistent new primary key. * Fix an assumption in the updater that assumed indexes always had the same number of k/v's. The logic has been updated to perform a sort of merge operation to decide what k/v's to insert/delete during the update operation. * Increased testing around secondary indexes k/vs and schema change operations. The release note is None because these are all bugs introduced in 20.1. Release note: None commit 80894c3 Author: damien.hollis <damien.hollis@unimarket.com> Date: Sun Jan 19 21:16:13 2020 +1300 sql: support postgresql dollar-quoted strings Added support for postgresql dollar-quoted strings in the scanner. A dollar-quoted string constant consists of a dollar sign ($), an optional "tag" of zero or more characters, another dollar sign, an arbitrary sequence of characters that makes up the string content, a dollar sign, the same tag that began this dollar quote, and a final dollar sign. The scanner uses the existing token type of SCONST for dollar-quoted strings. As a result, when the AST is formatted as a string, there is no knowledge that the original input was dollar-quoted and it is therefore formatted as either a plain string or an escaped string (depending on the content). Fixes cockroachdb#41777. Release Note (sql change): CockroachDB now supports string and byte array literals using the dollar-quoted notation, as documented here: https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING commit 3d7aeb3 Author: Rohan Yadav <rohany@alumni.cmu.edu> Date: Mon Mar 2 13:46:35 2020 -0500 sql: make cleanup jobs spawned by alter primary key not cancelable Fixes cockroachdb#45500. This PR makes the job spawned by ALTER PRIMARY KEY that cleans up indexes be uncancelable. This PR additionally fixes a related bug where ALTER PRIMARY KEY would spawn a job when it didn't actually enqueue any mutations on the table descriptor, causing future schema changes on the table to hang. Release note (sql change): This PR makes the cleanup job spawned by ALTER PRIMARY KEY in some cases uncancelable. commit b0be21a Merge: 8ffade8 bbac108 Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 14:48:15 2020 +0000 Merge cockroachdb#45621 45621: sql: allow inverted indexes on mixed-case cols r=jordanlewis a=jordanlewis Closes cockroachdb#42944. Previously, a bug prevented creation of inverted indexes on columns with mixed-case identifiers. Now, the bug is fixed. Release note (bug fix): it is now possible to create inverted indexes on columns whose names are mixed-case. Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> commit 2b7aa46 Author: Raphael 'kena' Poss <knz@thaumogen.net> Date: Tue Mar 3 00:38:55 2020 +0100 sql: support numeric tuple indexing Previously, it was impossible to use `(t).*` to expand a single tuple-typed column into multiple projections if the tuple was not labeled to start with. This limitation was caused by another limitation, that it was not possible to refer to a single column in a tuple if the tuple was not already labeled. This patch addresses both limitations. Release note (sql change): CockroachDB now supports expanding all columns of a tuple using the `.*` notation, for example `SELECT (t).* FROM (SELECT (1,'b',2.3) AS t)`. This is a CockroachDB-specific extension. Release note (sql change): CockroachDB now supports accessing the Nth column in a column with tuple type using the syntax `(...).@N`, for example `SELECT (t).@2 FROM (SELECT (1,'b',2.3) AS t)`. This is a CockroachDB-specific extension. commit 8ffade8 Merge: 120d53f 79ca338 Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 14:08:12 2020 +0000 Merge cockroachdb#45139 45139: intervalccl: optimize OverlapCoveringMerge r=dt a=C0rWin Change OverlapCoveringMerge instead of iterating over coverings for each input range to flatten coverings, sort results and produce ranges with one pass. Reducing total complexity from O(nm) to O(n*log(n)). Signed-off-by: Artem Barger <bartem@il.ibm.com> Release note: none Co-authored-by: Artem Barger <bartem@il.ibm.com> commit 120d53f Merge: 0414b69 e2da8bd Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 11:44:30 2020 +0000 Merge cockroachdb#45587 45587: log: secondary logger fixes r=ajwerner a=knz Needed by / will be exercised by cockroachdb#45193 Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> commit 0414b69 Merge: 47259c2 3433fbd Author: craig[bot] <bors@cockroachlabs.com> Date: Tue Mar 3 09:35:26 2020 +0000 Merge cockroachdb#45549 45549: changefeedccl: pick up TargetBytes for backfill ScanRequest r=ajwerner a=ajwerner Now that cockroachdb#44925 has merged, we can use a byte limit rather than a row limit. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> commit 0a1f251 Author: Nathan VanBenschoten <nvanbenschoten@gmail.com> Date: Mon Mar 2 16:26:24 2020 -0500 storage/txnwait: terminate push when pusher aborted at lower epoch Closes cockroachdb#40786. Closes cockroachdb#44336. This commit resolves a bug in distributed deadlock detection that would allow a deadlock between transactions to go undetected, stalling the workload indefinitely. The issue materialized as follows: 1. two transactions would deadlock and each enter a txnwait queue 2. they would poll their pushees record along with their own 3. deadlock detection would eventually pick this up and abort one of the txns using the pusher's copy of the txn's proto 4. however, the aborted txn has since restarted and bumped it epoch 5. the aborted txn continued to query its record, but failed to ingest any updates from it because the record was at a lower epoch than its own copy of its txn proto. So it never noticed that it was ABORTED 6. all other txns in the system including the original contending txn piled up behind the aborted txn in the contention queue, waiting for it to notice it was aborted and exit the queue 7. deadlock! I'm optimistically closing the two `kv/contention/nodes=4` issues both because I hope this is the cause of their recent troubles and also because I've been spending a lot of time with the test recently in light of cockroachdb#45482 and plan to stabilize it fully. I plan to backport this to release-19.2. This doesn't need to go all the way back to release-19.1 because this was introduces in aed892a. commit bbac108 Author: Jordan Lewis <jordanthelewis@gmail.com> Date: Mon Mar 2 23:16:33 2020 -0500 sql: allow inverted indexes on mixed-case cols Previously, a bug prevented creation of inverted indexes on columns with mixed-case identifiers. Now, the bug is fixed. Release note (bug fix): it is now possible to create inverted indexes on columns whose names are mixed-case. commit cee9355 Author: Oliver Tan <otan@cockroachlabs.com> Date: Thu Feb 27 10:11:31 2020 -0800 sql: allow rename database for sequences without a db name reference See release note for description. This PR should be included ahead of the more "general" fix of changing the DefaultExpr with the new database as it unblocks people using `experimental_serial_normalization=sql_sequence` from using the database rename feature. Release note (sql change): Previously, when we renamed a database, any table referencing a sequence would be blocked from being able to rename the table. This is to block cases where if the table's reference to the sequence contains the database name, and the database name changes, we have no way of overwriting the table's reference to the sequence in the new database. However, if no database name is included in the sequence reference, we should continue to allow the database to rename, as is implemented with this change. commit 7383e9d Author: Radu Berinde <radu@cockroachlabs.com> Date: Mon Mar 2 11:15:12 2020 -0800 opt: extend WithUses and improve NeededMutationCols This change extends the WithUses rule prop to keep track of the columns actually used by `WithScan`s. This set is then used by NeededMutationCols to make sure we never prune mutation input columns that are used by FK checks. Currently this never happens, but because of fairly subtle reasons: FKs happen to require indexes on both sides, and those indexes force the mutation operator to require values for those columns. This won't be the case anymore with upsert FK checks, for which this fix is required. The new information will also be used (in a future change) to prune unused columns of `With` itself. Release note: None commit e2da8bd Author: Raphael 'kena' Poss <knz@thaumogen.net> Date: Mon Mar 2 17:04:52 2020 +0100 log: ensure that the test log scopes works with secondary loggers This patch ensures that the file redirection performed by log.Scope / log.ScopeWithoutShowLogs is effective with secondary loggers. Release note: None commit 3433fbd Author: Andrew Werner <ajwerner@cockroachlabs.com> Date: Sat Feb 29 13:39:17 2020 -0500 changefeedccl: pick up TargetBytes for backfill ScanRequest Now that cockroachdb#44925 has merged, we can use a byte limit rather than a row limit. Release note: None commit 589cfa1 Author: Raphael 'kena' Poss <knz@thaumogen.net> Date: Mon Mar 2 17:03:18 2020 +0100 log: ensure secondary loggers get cleaned up Previously, if two consecutive tests in a package were using secondary loggers, the second test would see the secondary loggers of the first test in its logger registry. This was problematic because it really made the log files/messages of the first test available to the logging file retrieval API. This patch cleans this up by de-registering secondary loggers when their context is cancelled, which maps to stoppers going down. Release note: None commit 4caee85 Author: Rohan Yadav <rohany@alumni.cmu.edu> Date: Mon Mar 2 11:46:04 2020 -0500 sql: update error message for primary key change on an interleave parent Fixes cockroachdb#45537. This PR updates the error message when trying to perform a primary key change on an interleaved parent to include the name of the table as well as the names of the interleaved children. Release note: None commit 9ac5dbc Author: Céline O'Neil <celineloneil@gmail.com> Date: Thu Dec 19 14:02:53 2019 -0500 opt: factor limit hints into scan and lookup join costs Fixes cockroachdb#34811; the example query in this issue now chooses a lookup join as desired. The coster now takes limit hints into account when costing scans and lookup joins, and propagates limit hints through lookup joins. Release note (sql change): The optimizer now considers the likely number of rows an operator will need to provide, and might choose query plans based on this. In particular, the optimizer might prefer lookup joins over alternatives in some situations where all rows of the join will probably not be needed. commit 62c8667 Author: Raphael 'kena' Poss <knz@thaumogen.net> Date: Mon Mar 2 14:35:30 2020 +0100 cli: reveal SQL error details/hints in CLI commands Previously, when a CLI command that uses SQL under the hood encountered an error, only the main error message was displayed. This patch changes it to reveal the other additional user-visible fields. Release note (cli change): Certain CLI command now provide more details and/or a hint when they encounter an error. commit f978076 Author: Raphael 'kena' Poss <knz@thaumogen.net> Date: Wed Dec 11 16:59:31 2019 +0100 *: allow periods (.) in usernames Requested by a user: > Currently, there is a restriction for the database username which > will limit the certificate-based authentication. It's very common to > include .local (e.g.: internal-service2.local) in the CN (Common Name) > of a certificate. The AWS Certificate Manager (ACM) won't even issue > a certificate if the "dot" (.) is not present. Release note (sql change): Usernames can now contain periods, for compatibility with certificate managers that require domain names to be used as usernames. Release note (security update): The validation rule for principal names was extended to support periods and thus allow domainname-like principals. For reference, the validation rule is currently the regular expression `^[\p{Ll}0-9_][---\p{Ll}0-9_.]+$` and a size limit of 63 UTF-8 bytes (larger usernames are rejected with an error); for comparison, PostgreSQL allows many more characters and truncates at 63 chacters silently. commit 79ca338 Author: Artem Barger <bartem@il.ibm.com> Date: Sun Feb 16 17:28:58 2020 +0200 intervalccl: optimize OverlapCoveringMerge Change OverlapCoveringMerge instead of iterating over coverings for each input range to flatten coverings, sort results and produce ranges with one pass. Reducing total complexity from O(nm) to O(n*log(n)). Signed-off-by: Artem Barger <bartem@il.ibm.com> Release note: none Release note (<category, see below>): <what> <show> <why>
Related to #41720.
Related to #44976.
This commit integrates the new concurrency package into the storage package. Each Replica is given a concurrency manager, which replaces its existing latch manager and txn wait queue. The change also uses the concurrency manager to simplify the role of the intent resolver. The intent resolver no longer directly handles WriteIntentErrors. As a result, we are able to delete the contention queue entirely.
With this change, all requests are now sequenced through the concurrency manager. When sequencing, latches are acquired and conflicting locks are detected. If any locks are found, the requests wait in lock wait-queues for the locks to be resolved. This is a major deviation from how things currently work because today, even with the contention queue, requests end up waiting for conflicting transactions to commit/abort in the txnWaitQueue after at least one RPC. Now, requests wait directly next to the intents/locks that they are waiting on and react directly to the resolution of these intents/locks.
Once requests are sequenced by the concurrency manager, they are theoretically fully isolated from all conflicting requests. However, this is not strictly true today because we have not yet pulled all replicated locks into the concurrency manager's lock table. We will do so in a future change. Until then, the concurrency manager maintains a notion of "intent discovery", which is integrated into the Replica-level concurrency retry loop.
Performance numbers will be published shortly. This will be followed by performance numbers using the SELECT FOR UPDATE locking (#40205) improvements that this change enables.