Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor sql/executor to support retries of batches of statements. #4576

Merged
merged 2 commits into from
Mar 1, 2016

Conversation

andreimatei
Copy link
Contributor

Previously the only SQL txns that we automatically retry were implicit
txns (single-statements outside of a user txn). With this change, we're
retrying batches of statements.

In a future change, we'll bubble up retriable errors for txns that we
don't auto-retry to the user, and allow her to manually retry.

  • unify the handling of implicit and explicit sql txns
  • automatically retry the batch of statements that contain BEGIN (the
    prefix of a txn sent as one batch)
  • made the executor and pgwire not look inside the kv txn proto any more
    to determine txn state. This was ugly. Instead, the executor and
    planner cooperate to maintain their own view of the sql txn status
  • made schema change errors from async processing insert errors into the
    Results at the appropriate position (the position of the statement
    that queued up the schema change)

Review on Reviewable

@andreimatei
Copy link
Contributor Author

I will add a test for restarts in the current PR. Turns out deterministically triggering restarts of txns for which all statements are received in one batch is not so easy.

@tamird
Copy link
Contributor

tamird commented Feb 22, 2016

Looks like a number of changes are combo'd up into one commit here.


Reviewed 16 of 18 files at r1.
Review status: 16 of 18 files reviewed at latest revision, 14 unresolved discussions.


client/txn.go, line 404 [r1] (raw file):
the closure


client/txn.go, line 406 [r1] (raw file):
double space?


client/txn.go, line 408 [r1] (raw file):
space?


client/txn.go, line 422 [r1] (raw file):
if the transaction has aborted, why is the closure being called again?


client/txn.go, line 428 [r1] (raw file):
typically we call things fn rather than closure.


client/txn.go, line 439 [r1] (raw file):
%t is not the right verb.


client/txn.go, line 449 [r1] (raw file):
instead of using firstTry, you could put


client/txn.go, line 465 [r1] (raw file):
combine the "retrying ...." log you added above with this one (and the one below).


sql/drop_test.go, line 189 [r1] (raw file):
extract "foo"


sql/plan.go, line 61 [r1] (raw file):
why change it to a pointer, then?


sql/session.proto, line 31 [r1] (raw file):
why maintain this extra information instead of reading from the kv txn proto? your commit message says it is "ugly", but that's subjective. I think duplicating this state is uglier.


sql/testdata/datetime, line 187 [r1] (raw file):
cool, but what does this have to do with the rest of this change?


sql/testdata/txn, line 116 [r1] (raw file):
what is this testing?


sql/txn.go, line 41 [r1] (raw file):
padding the git stats?


Comments from the review on Reviewable.io

@RaduBerinde
Copy link
Member

You can try using the new parallel test. Having multiple clients run transactions that modify the same data in parallel will trigger restarts. I am assuming you can have a multiple statements on a single line with ; in a logic test file.


Comments from the review on Reviewable.io

@tbg
Copy link
Member

tbg commented Feb 22, 2016

Review status: 16 of 18 files reviewed at latest revision, 46 unresolved discussions.


client/db.go, line 464 [r1] (raw file):
s/opt/_/ and remove the comment below.


client/txn.go, line 412 [r1] (raw file):
s/cl/client/?


client/txn.go, line 434 [r1] (raw file):
can't you set this unconditionally?


client/txn.go, line 479 [r1] (raw file):
when would txn be nil here?


sql/executor.go, line 225 [r1] (raw file):
why pass a pointer in? If Session needs to be mutated, probably better to return it.


sql/executor.go, line 281 [r1] (raw file):
:GoRename schemaChanger sc.


sql/executor.go, line 295 [r1] (raw file):
Use prose instead of Args: results: stuff.


sql/executor.go, line 306 [r1] (raw file):

if len(schemaChangers.schemaChangers) == 0 || disableSyncSchemaChangeExec {
  return
}
// saved indentation everywhere

sql/executor.go, line 310 [r1] (raw file):
factor this out into a var at the top of this file while you're at it.


sql/executor.go, line 340 [r1] (raw file):
what happens if there are multiple results per statement? Could we end up writing to the wrong result?


sql/executor.go, line 365 [r1] (raw file):
what about something like

type sqlTransactionState int
const (
  noTransaction sqlTransactionState = iota
  openTransaction
  failingTransaction // waiting for COMMIT/ROLLBACK
  completedTransaction // is this needed? Could just be noTransaction?
)

func (s txnState) state() sqlTransactionState {
  // return corresponding state
}

sql/executor.go, line 372 [r1] (raw file):
I'm hoping that we don't need this (#4394), but we can see about that in that issue.


sql/executor.go, line 458 [r1] (raw file):
s/exec/and exec/


sql/executor.go, line 461 [r1] (raw file):
s/may/will/?


sql/executor.go, line 470 [r1] (raw file):
not a big fan of this Args: thing.


sql/executor.go, line 502 [r1] (raw file):
can't you update the remaining statements for the next loop right here?


sql/executor.go, line 532 [r1] (raw file):
s/CL/CK/


sql/executor.go, line 534 [r1] (raw file):
s/Resume/Exec/


sql/executor.go, line 536 [r1] (raw file):
is .aborted set? Who sets it? Kind of would have expected that to happen here.


sql/executor.go, line 544 [r1] (raw file):
why?


sql/executor.go, line 552 [r1] (raw file):
but it will still have to read the descriptor. Can't you just not execute them in the first place if you know there's an error? That seems preferable.


sql/executor.go, line 560 [r1] (raw file):
I like the confused ?!?.


sql/executor.go, line 579 [r1] (raw file):
if you want an inline comment, try results *[]Result /* return values */.


sql/executor.go, line 604 [r1] (raw file):
don't need the parentheses.


sql/executor.go, line 609 [r1] (raw file):
can the case planMaker.txn == nil && opt.AutoCommit occur?


sql/executor.go, line 623 [r1] (raw file):
Ok, with these argument lists structure makes more sense. The indentation seems random though. Mixing tabs and spaces perhaps?


sql/executor.go, line 635 [r1] (raw file):
s/an/a/


sql/executor.go, line 643 [r1] (raw file):
Hmm, don't think this ought to be counting retries individually. Put a TODO here for Cuong.


sql/executor.go, line 659 [r1] (raw file):
why use time.Now() here? I thought you had a transaction already. Won't that give multi-phase txns varying timestamps (and is that how it ought to be)?


sql/executor.go, line 679 [r1] (raw file):
var remaining parser.StatementList


sql/executor.go, line 680 [r1] (raw file):
i+1 < len(stmts) for symmetry with the line below


sql/session.proto, line 31 [r1] (raw file):
I think this makes sense as long as txnAborted == true implies txn==nil - the KV transaction and SQL transaction have different lifetimes. The former ends before the latter if there was an error. I encouraged Andrei to keep this state separate because I think that trying to encode SQL-state on the KV-proto is confusing and prone to mishandling.


sql/session.proto, line 46 [r1] (raw file):
why isn't this in the transaction message?


Comments from the review on Reviewable.io

@tbg
Copy link
Member

tbg commented Feb 22, 2016

@radu I'd like to see some more controlled unit tests. Randomized testing is welcome too, but only as a next step.


Reviewed 18 of 18 files at r1.
Review status: all files reviewed at latest revision, 46 unresolved discussions.


Comments from the review on Reviewable.io

@andreimatei
Copy link
Contributor Author

Review status: 10 of 18 files reviewed at latest revision, 46 unresolved discussions, some commit checks failed.


client/db.go, line 464 [r1] (raw file):
done


client/txn.go, line 404 [r1] (raw file):
Done.


client/txn.go, line 406 [r1] (raw file):
Done.


client/txn.go, line 408 [r1] (raw file):
Done.


client/txn.go, line 412 [r1] (raw file):
Done.


client/txn.go, line 422 [r1] (raw file):
useful when the SQL txn is still alive in an aborted state, but the KV txn is long gone. Added a comment.


client/txn.go, line 428 [r1] (raw file):
Done.


client/txn.go, line 434 [r1] (raw file):
perhaps... but that'd be misleading, suggesting that it will be used.


client/txn.go, line 439 [r1] (raw file):
Done.


client/txn.go, line 449 [r1] (raw file):
see if you like it


client/txn.go, line 465 [r1] (raw file):
done


client/txn.go, line 479 [r1] (raw file):
expanded the method comment about txn being nil


sql/drop_test.go, line 189 [r1] (raw file):
extract it from where?


sql/executor.go, line 225 [r1] (raw file):
a session is by definition an object that needs to be mutated by various people along the way. Why copy it 100 times, instead of sharing it?


sql/executor.go, line 281 [r1] (raw file):
Done.


sql/executor.go, line 295 [r1] (raw file):
It's much more clear this way. To structure the same thing as a phrase would be awkward. Like... what? "Takes in the results from all..."? Or "results is the results of..."?


sql/executor.go, line 306 [r1] (raw file):
Done.


sql/executor.go, line 310 [r1] (raw file):
Done.


sql/executor.go, line 340 [r1] (raw file):
there is one result per stmt


sql/executor.go, line 365 [r1] (raw file):
Done.


sql/executor.go, line 458 [r1] (raw file):
don't you like the tension-building crescendo?


sql/executor.go, line 461 [r1] (raw file):
Done.


sql/executor.go, line 502 [r1] (raw file):
i tried, got more spaghetti


sql/executor.go, line 532 [r1] (raw file):
Done.


sql/executor.go, line 534 [r1] (raw file):
Done.


sql/executor.go, line 536 [r1] (raw file):
can't set it here, sadly. The error might have happened when executing a COMMIT/ROLLBACK, in which case the state is not aborted.


sql/executor.go, line 544 [r1] (raw file):
cleanup. I've removed it.


sql/executor.go, line 552 [r1] (raw file):
I don't really know. The code used to always run them. @vivekmenezes should I not run them?


sql/executor.go, line 579 [r1] (raw file):
the comment is supposed to apply to all subsequent params


sql/executor.go, line 604 [r1] (raw file):
Done.


sql/executor.go, line 609 [r1] (raw file):
no. What rewrite do you suggest?


sql/executor.go, line 623 [r1] (raw file):
Done.


sql/executor.go, line 635 [r1] (raw file):
Done.


sql/executor.go, line 643 [r1] (raw file):
yeah, I was just talking to @cuongdo about this the other day. You're the proud the owner of a TODO, if you don't mind :)

Btw, we use @cockroach names in TODOs, not github names, right?


sql/executor.go, line 659 [r1] (raw file):
that's the statement timestamp, != txn timestamp


sql/executor.go, line 679 [r1] (raw file):
Done.


sql/executor.go, line 680 [r1] (raw file):
Done.


sql/plan.go, line 61 [r1] (raw file):
because I don't want to copy the session into the planner and then out from the planner. The point of the session is to be a per-connection object to be shared and updated by randos.


sql/session.proto, line 31 [r1] (raw file):
yeah... It's about separation of sql txns and KV txns. Munging them too much was making the code confusing.


sql/session.proto, line 46 [r1] (raw file):
in the what now?


sql/testdata/datetime, line 187 [r1] (raw file):
plenty - the restoration of txn_timestamp is handled by this change


sql/testdata/txn, line 116 [r1] (raw file):
improved the comment


sql/txn.go, line 41 [r1] (raw file):
:) rolled back


Comments from the review on Reviewable.io

@tamird
Copy link
Contributor

tamird commented Feb 23, 2016

Reviewed 1 of 18 files at r1, 8 of 9 files at r2.
Review status: 17 of 18 files reviewed at latest revision, 37 unresolved discussions, some commit checks failed.


sql/drop_test.go, line 189 [r1] (raw file):
meaning extract a local variable - you are repeating yourself here


sql/pgwire/v3.go, line 199 [r1] (raw file):
as stated elsewhere, i'm 👎 on this.


sql/plan.go, line 61 [r1] (raw file):
so it's about saving copies? sounds dubious to me.


sql/testdata/datetime, line 187 [r1] (raw file):
hm, ok. i haven't reviewed the meat of the change yet, but at least i don't see it from the title.


Comments from the review on Reviewable.io

@RaduBerinde
Copy link
Member

Pretty sweet change! I like the core idea of having execStmtsInCurrentTxn figure out where it needs to stop. Only minor comments so far. I still have to understand a couple of things better, esp. the Session stuff.


Reviewed 10 of 18 files at r1, 8 of 9 files at r2.
Review status: all files reviewed at latest revision, 50 unresolved discussions, some commit checks failed.


client/txn.go, line 436 [r2] (raw file):
Can txn be nil here or we will never have AutoRetry set in that case? (if the latter, explain the requirement in the fn comment)


client/txn.go, line 459 [r2] (raw file):
Could just do if !opt.AutoRetry { break } and add a default: break in the switch. Then we won't need willRetry and the code below is simplified. It also clarifies the different conditions which cause us to not retry.


client/txn.go, line 469 [r2] (raw file):
If you keep this continue block , I think it would be more clear to also move the r.Next() call here.


sql/drop_test.go, line 189 [r1] (raw file):
I suggest you implement a TestInvolvingIndex interface with a GetTestIndexName method. j/k


sql/executor.go, line 680 [r1] (raw file):
What's the point of the check? i will never be more than len(stmts)-1 and stmts[len(stmts):] is a valid expression returning an empty slice.


sql/executor.go, line 16 [r2] (raw file):
Shouldn't have forgotten that password.. :)


sql/executor.go, line 276 [r2] (raw file):
"for statements that have been received from the client in the same batch, the group consists of all statements in the same transaction."


sql/executor.go, line 280 [r2] (raw file):
"within its group" instead of "relative to.."


sql/executor.go, line 318 [r2] (raw file):
We're not really using the map as a map. Why not just store a []struct { idx int , sc SchemaChanger }? (If they get added in statement order we won't even need to sort these)


sql/executor.go, line 474 [r2] (raw file):
Do we no longer need testingWaitForMetadata? (If so, remove this variable altogether)


sql/executor.go, line 520 [r2] (raw file):
I would move this inside the if !inTxn check below


sql/executor.go, line 575 [r2] (raw file):
StatementList(nil)


sql/executor.go, line 577 [r2] (raw file):
can't we do stmts = stmts[1:] all the time?


sql/executor.go, line 636 [r2] (raw file):
This is a bit ambiguous - it could be interpreted that we require stmts to have statements belonging to a single SQL transactions. Maybe say that it consumes a prefix of stmts or something to that effect.


sql/executor.go, line 698 [r2] (raw file):
StatementList(nil) (or just nil might do)


Comments from the review on Reviewable.io

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 55 unresolved discussions, some commit checks failed.


sql/drop_test.go, line 189 [r1] (raw file):
The repetition is minimal and doesn't seem worth fixing.


sql/executor.go, line 225 [r1] (raw file):
Is the session ever mutated by Prepare? That seems wrong.

Performing a shallow-copy of the Session struct should be very fast, though I could see an argument for convenient sharing of state. It is strange that Prepare now takes a *Session while ExecuteStatements takes Session.


sql/executor.go, line 233 [r2] (raw file):
FYI, this was avoiding an allocation when creating a closure from session.getLocation. Not terribly important.


sql/executor.go, line 379 [r2] (raw file):
We're in the sql package, so this type is sql.sqlTransactionState which stutters (go vet would complain if it was exported). Perhaps remove the sql prefix: s/sqlTransactionState/transactionState/g.


sql/executor.go, line 415 [r2] (raw file):
This causes an allocation. I think ExecuteStatements should take a *Session and the Response.Session should be removed. Or you should undo the change to make planner.session a pointer.


sql/plan.go, line 63 [r2] (raw file):
I imagine you're setting a non-empty session here so that direct calls to makePlanner have the session field initialized. But releasePlanner clears the session field which means that planners retrieved using plannerPool.Get() will sometimes have a session and sometimes the field will be clear.


sql/session.proto, line 33 [r2] (raw file):
Some tabs snuck there way into this .proto file.


Comments from the review on Reviewable.io

@andreimatei
Copy link
Contributor Author

Review status: 11 of 20 files reviewed at latest revision, 54 unresolved discussions, some commit checks failed.


client/txn.go, line 436 [r2] (raw file):
right, autoRetry and txn==nil cannot happen. Added an assertion and a method comment.


client/txn.go, line 459 [r2] (raw file):
true that. Except I needed a labeled break in the switch


client/txn.go, line 469 [r2] (raw file):
no more continue


sql/executor.go, line 225 [r1] (raw file):
Hopefully the session is indeed not mutated by Prepare, as you'll notice that we weren't passing the session to the planner before. However, I don't see a fundamental reason why prepare should never mutate it. Anyway, if it'd be truly wrong for it to mutate it, we should have a mechanism for trapping changes, rather than silently ignoring them.

Principally, I think the session should be shared by everybody. A shallow copy is not doing anybody any favors.


sql/executor.go, line 680 [r1] (raw file):
you're right; I didn't realize s[len(s):] works


sql/executor.go, line 16 [r2] (raw file):
if you remember the password for andreimatei@gmail.com, let me know


sql/executor.go, line 276 [r2] (raw file):
Done.


sql/executor.go, line 280 [r2] (raw file):
Done.


sql/executor.go, line 318 [r2] (raw file):
done. anonymous structs - nice. Although the append in []anonymous becomes exciting


sql/executor.go, line 379 [r2] (raw file):
Done.


sql/executor.go, line 415 [r2] (raw file):
yup, session is now a pointer and no longer returned through Response


sql/executor.go, line 474 [r2] (raw file):
Hmm good catch. Removing this was not intentional. But I don't really know what this is and how to get it back. I'll talk to @tamird who I think introduced it.


sql/executor.go, line 520 [r2] (raw file):
Done.


sql/executor.go, line 575 [r2] (raw file):
Done.


sql/executor.go, line 577 [r2] (raw file):
Done.


sql/executor.go, line 636 [r2] (raw file):
Done.


sql/executor.go, line 698 [r2] (raw file):
yeah, just nil works. Not sure I fully understand why :)


sql/plan.go, line 61 [r1] (raw file):
it's not about saving copies as a performance optimization, it's about sharing this session as God intended. This way we make sure we don't miss updates to the session.


sql/plan.go, line 63 [r2] (raw file):
In the executor, where we get planners from the executor-specific pool, we always overwrite this session, so we don't care about it being set here. This is about all the weirdos outside of the executor that create planners, for questionable reasons.


sql/session.proto, line 31 [r1] (raw file):
The difference in lifetimes will show up most visibly when txn.Exec() will pass up retriable errors. At that point, the executor, after calling txn.Exec() will have a Txn in a non-aborted state (the Txn has a new epoch and is ready to be used). However, the sql state will be something like "waiting to a RETRY statement`.

Even without this, I've thought about it again. Introducing this struct makes code cleaner. pgwire, for example, should not be looking into a KV txn proto; it should know nothing about KV.


sql/session.proto, line 46 [r1] (raw file):
oh right, moved


sql/session.proto, line 33 [r2] (raw file):
Done.


Comments from the review on Reviewable.io

@andreimatei
Copy link
Contributor Author

Brought back testingWaitForMetadata, added a test.


Review status: 10 of 23 files reviewed at latest revision, 43 unresolved discussions, some commit checks failed.


sql/executor.go, line 474 [r2] (raw file):
brought this guy back, except now the mutex is held in reader mode to not synchronize statements.


Comments from the review on Reviewable.io

@RaduBerinde
Copy link
Member

LGTM


Review status: 10 of 23 files reviewed at latest revision, 45 unresolved discussions, some commit checks failed.


sql/executor.go, line 541 [r4] (raw file):
that


sql/txn_restart_test.go, line 63 [r4] (raw file):
This is pretty clever. But we probably need some better infrastructure for these kinds of tests going forward.


Comments from the review on Reviewable.io

@andreimatei andreimatei force-pushed the sql-batch-retry branch 2 times, most recently from 003bd06 to b10ab02 Compare February 26, 2016 19:27
@andreimatei
Copy link
Contributor Author

Review status: 10 of 24 files reviewed at latest revision, 45 unresolved discussions, some commit checks failed.


sql/executor.go, line 541 [r4] (raw file):
Done.


Comments from the review on Reviewable.io

@andreimatei
Copy link
Contributor Author

@tamird I had to relax that testingWaitForGossipUpdate guy to not bork if it had to process more than one gossip message until the callback verifies. This is because it sometimes has to because when we start waiting there's a leftover gossip update message that hasn't been processed yet. For example, this comes from an GRANT that doesn't use this mechanism to block for the update to be processed. And it'd be a bit hard to add a check for GRANT, particularly since it sometimes doesn't actually change anything in the descriptors. We could probably invent some mechanism to flush that channel on which updates come before we start waiting, but I don't think it's worth it.


Reviewed 1 of 21 files at r5.
Review status: 6 of 27 files reviewed at latest revision, 44 unresolved discussions.


Comments from the review on Reviewable.io

Previously the only SQL txns that we automatically retry were implicit
txns (single-statements outside of a user txn). With this change, we're
retrying batches of statements.

In a future change, we'll bubble up retriable errors for txns that we
don't auto-retry to the user, and allow her to manually retry.

- unify the handling of implicit and explicit sql txns
- automatically retry the batch of statements that contain BEGIN (the
  prefix of a txn sent as one batch)
- made the executor and pgwire not look inside the kv txn proto any more
  to determine txn state. This was ugly. Instead, the executor and
  planner cooperate to maintain their own view of the sql txn status
- made schema change errors from async processing insert errors into the
  Results at the appropriate position (the position of the statement
  that queued up the schema change)
andreimatei added a commit that referenced this pull request Mar 1, 2016
Refactor sql/executor to support retries of batches of statements.
@andreimatei andreimatei merged commit fed42a2 into cockroachdb:master Mar 1, 2016
@tbg
Copy link
Member

tbg commented Mar 1, 2016

Reviewed 1 of 9 files at r2, 1 of 9 files at r3, 19 of 21 files at r5, 1 of 1 files at r7, 1 of 1 files at r8.
Review status: all files reviewed at latest revision, 28 unresolved discussions.


sql/executor.go, line 225 [r1] (raw file):

Principally, I think the session should be shared by everybody. A shallow copy is not doing anybody any favors.

Having a global which is shared among potentially concurrent (maybe not now, but soon?) actors can also be surprising and/or dangerous. Passing it around forces you to join up potentially conflicting information, which is something that's easy to just not do when sharing. I think it's ok so far, but should keep an eye on it.


sql/executor.go, line 340 [r1] (raw file):
The comment above is slightly misleading then since it makes it sound like one result per statement is just a sketchy assumption.


sql/executor.go, line 643 [r1] (raw file):
I don't think we ever clarified that. I think of it as my Github handle.


sql/executor.go, line 361 [r8] (raw file):
s/default/defaultSchemaChanger/


sql/executor.go, line 426 [r8] (raw file):
feels like this should move below s.txn == nil. Sure, they're not both going to be true, but leaving s.aborted = true after cleaning out the txn seems reasonable. It also seems like - sigh - that maybe it is better to just use s.txn.Status == roachpb.ABORTED instead of s.aborted. After all, you've encapsulated this logic away in transactionState, and you always have a non-nil s.txn when you're in state noTransaction, and to get into that state you could have (*txnState).setAborted().


sql/executor.go, line 738 [r8] (raw file):
This was motivated by Postgres' behavior, right? Can't hurt to mention that.


sql/executor.go, line 749 [r8] (raw file):
s/In/When/


sql/executor.go, line 773 [r8] (raw file):
this stuff looks like test code - what's the plan?


sql/metric_test.go, line 12 [r8] (raw file):
nit: a != e


sql/metric_test.go, line 18 [r8] (raw file):
nit: a < e


sql/txn_restart_test.go, line 64 [r8] (raw file):
I think you already figured this out, but TestingCommandFilter needs to be set before anything starts.


sql/txn_restart_test.go, line 78 [r8] (raw file):
you're saying there is code to do that already? Why not do what you describe then?


sql/txn_restart_test.go, line 133 [r8] (raw file):
testutils.IsError


testutils/sqlutils/pg_url.go, line 41 [r8] (raw file):
nit territory, but still not terribly fond of this format unless there's a reasonable amount of args complexity.


Comments from the review on Reviewable.io

@andreimatei andreimatei deleted the sql-batch-retry branch April 27, 2018 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants