Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: perform local calls in a goroutine #11196

Merged

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Nov 23, 2016

Perform local calls in a goroutine so that they can be cancelled and so
that they don't block the caller indefinitely.

See #10427


This change is Reviewable

@petermattis
Copy link
Collaborator Author

I'm going to try this PR on delta sans #10931.

@petermattis petermattis force-pushed the pmattis/cancellable-local-calls branch from 13379af to 88eb869 Compare November 23, 2016 14:10
@tamird
Copy link
Contributor

tamird commented Nov 23, 2016

Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 1 unresolved discussion.


pkg/kv/transport.go, line 184 at r1 (raw file):

		}

		go func() {

it would be mildly cleaner to share this with the below go routine

batchFn := client.client.Batch
if localServer := gt.rpcContext.GetLocalInternalServerForAddr(addr); enableLocalCalls && localServer != nil {
		batchFn = localServer.Batch
		// Clone the request. At the time of writing, Replica may mutate it
		// during command execution which can lead to data races.
		//
		// TODO(tamird): we should clone all of client.args.Header, but the
		// assertions in protoutil.Clone fire and there seems to be no
		// reasonable workaround.
		origTxn := client.args.Txn
		if origTxn != nil {
			clonedTxn := origTxn.Clone()
			client.args.Txn = &clonedTxn
		}
}
go func() {
  reply, err := batchFn(..)
  ...
}()

Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 1 of 3 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/storage/replica.go, line 1876 at r2 (raw file):

				// Synchronously process any intents that need resolving here in order
				// to apply back pressure on the client which generated them.
				r.store.intentResolver.processIntents(r, propResult.Intents)

This is actually semi-synchronous as it uses RunLimitedAsyncTask internally.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/cancellable-local-calls branch from 7a6dd5c to d0bda68 Compare November 23, 2016 15:12
@petermattis
Copy link
Collaborator Author

I'm going to push a build with this second commit to delta.


Review status: 1 of 3 files reviewed at latest revision, 2 unresolved discussions.


Comments from Reviewable

Perform local calls in a goroutine so that they can be cancelled and so
that they don't block the caller indefinitely.

See cockroachdb#10427
@petermattis petermattis force-pushed the pmattis/cancellable-local-calls branch from d0bda68 to 377e23d Compare November 23, 2016 15:13
@tamird
Copy link
Contributor

tamird commented Nov 23, 2016

second commit seems good, but hard to predict what effect it will have!


Reviewed 2 of 2 files at r3, 2 of 2 files at r4.
Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.


pkg/storage/replica.go, line 1669 at r4 (raw file):

	}
	if result.Local.intents != nil && len(*result.Local.intents) > 0 {
		log.Eventf(ctx, "submitting %d intents to asynchronous processing",

this has rotted


pkg/storage/replica.go, line 1671 at r4 (raw file):

		log.Eventf(ctx, "submitting %d intents to asynchronous processing",
			len(*result.Local.intents))
		r.store.intentResolver.processIntents(r, *result.Local.intents)

should this be result.Local.detachIntents()? if so the check above should probably be simplified: if intents := result.Local.detachIntents(); len(intents) > 0 { ...


pkg/storage/replica.go, line 1873 at r4 (raw file):

			// in processRaftCommand.
			endCmds = nil
			if propResult.Intents != nil {

any reason this doesn't check length?


pkg/storage/replica.go, line 4239 at r4 (raw file):

		// to nudge the intents in case they're expired; next time around we'll
		// hopefully have more luck.
		r.store.intentResolver.processIntents(r, *result.Local.intents)

should this be result.Local.detachIntents()? if so the check above should probably be simplified: if intents := result.Local.detachIntents(); len(intents) > 0 { ...


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/cancellable-local-calls branch 2 times, most recently from 2e532e9 to 6e31b79 Compare November 23, 2016 15:35
@petermattis
Copy link
Collaborator Author

Review status: 2 of 5 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.


pkg/kv/transport.go, line 184 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote… > it would be mildly cleaner to share this with the below go routine > ``` > batchFn := client.client.Batch > if localServer := gt.rpcContext.GetLocalInternalServerForAddr(addr); enableLocalCalls && localServer != nil { > batchFn = localServer.Batch > // Clone the request. At the time of writing, Replica may mutate it > // during command execution which can lead to data races. > // > // TODO(tamird): we should clone all of client.args.Header, but the > // assertions in protoutil.Clone fire and there seems to be no > // reasonable workaround. > origTxn := client.args.Txn > if origTxn != nil { > clonedTxn := origTxn.Clone() > client.args.Txn = &clonedTxn > } > } > go func() { > reply, err := batchFn(..) > ... > }() > ```
That

pkg/storage/replica.go, line 1669 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote… > this has rotted
Its only sort of inaccurate. `processIntents` is semi-synchronous (or is that semi-asynchronous). I've renamed that method back to `processIntentsAsync` now that `processIntentsAsync` was no longer being used.

pkg/storage/replica.go, line 1671 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote… > should this be `result.Local.detachIntents()`? if so the check above should probably be simplified: `if intents := result.Local.detachIntents(); len(intents) > 0 { ...`
Done.

pkg/storage/replica.go, line 1873 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote… > any reason this doesn't check length?
Done.

pkg/storage/replica.go, line 4239 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote… > should this be `result.Local.detachIntents()`? if so the check above should probably be simplified: `if intents := result.Local.detachIntents(); len(intents) > 0 { ...`
Done.

Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 2 of 5 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.


pkg/kv/transport.go, line 184 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote… > That
`client.client.Batch` and `localServer.Batch` have different signatures.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 23, 2016

:lgtm:


Reviewed 3 of 3 files at r5.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/kv/transport.go, line 184 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote… > `client.client.Batch` and `localServer.Batch` have different signatures.
True, it would have to begin with ``` batchFn := func(ctx context.Context, args *roachpb.BatchRequest) error { return client.client.Batch(ctx, args) } ```

Comments from Reviewable

@@ -600,22 +609,6 @@ func (r *Replica) handleLocalEvalResult(
// Non-state updates and actions.
// ======================

if originReplica.StoreID == r.store.StoreID() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe a more direct assertion here makes sense. Otherwise we're just going to assert the lResult is empty, and then only if shouldAssert is true. Might be better to call it out directly and add a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call what out directly? That lResult.intents should be nil?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, suggestion was to make sure it's detached here with an assertion and a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1869,6 +1869,14 @@ func (r *Replica) tryAddWriteCmd(
// Set endCmds to nil because they have already been invoked
// in processRaftCommand.
endCmds = nil
if len(propResult.Intents) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem is that this isn't always the path that a finished request follows. Original proposals which succeed but are re-proposed and cancelled commands will both leave intents to the GC. Maybe that's OK, although knowing our luck, there'll be some degenerate use case that creates massive amounts of intent garbage. After some thought I'm not coming up with any good solutions. However, let's add a comment to the effect that these kinds of requests will generate GC-ready intents.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

reply, err := localServer.Batch(gt.opts.ctx, &client.args)
gt.setPending(client.args.Replica, false)
done <- BatchCall{Reply: reply, Err: err}
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any impact on the single-node benchmarks? Just curious; I think this is the right thing to do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check.

@@ -302,26 +265,6 @@ func (ir *intentResolver) maybePushTransactions(
// differently and would be better served by different entry points,
// but combining them simplifies the plumbing necessary in Replica.
func (ir *intentResolver) processIntentsAsync(r *Replica, intents []intentsWithArg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we're passing wait=true here to RunLimitedAsyncTask means that something which horks the intent queue will starve other work. I think you might consider what's being done DistSender.sendPartialBatchAsync. There, we pass wait=false to Stopper.RunLimitedAsyncTask, and on stop.ErrThrottled, we synchronously send the partial batch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there isn't much point in not using the blocked goroutine. I'd like to do this in a follow-up PR, though. I'll file an issue.

@petermattis petermattis force-pushed the pmattis/cancellable-local-calls branch from 6e31b79 to 99c8f31 Compare November 23, 2016 16:45
@petermattis
Copy link
Collaborator Author

The change to local call handling is across the board negative for the single-node benchmarks:

name               old time/op  new time/op  delta
KVInsert1_SQL-8     341µs ± 3%   377µs ± 5%  +10.57%   (p=0.000 n=9+10)
KVInsert10_SQL-8    497µs ± 1%   521µs ± 2%   +4.74%    (p=0.000 n=8+9)
KVInsert100_SQL-8  1.49ms ± 1%  1.54ms ± 1%   +2.88%  (p=0.000 n=10+10)
KVUpdate1_SQL-8     510µs ± 2%   579µs ± 1%  +13.49%   (p=0.000 n=10+8)
KVUpdate10_SQL-8    758µs ± 1%   836µs ± 1%  +10.22%   (p=0.000 n=10+8)
KVUpdate100_SQL-8  2.96ms ± 3%  3.02ms ± 2%   +1.92%  (p=0.007 n=10+10)
KVDelete1_SQL-8     459µs ± 4%   513µs ± 4%  +11.74%  (p=0.000 n=10+10)
KVDelete10_SQL-8    664µs ± 1%   736µs ± 1%  +10.77%   (p=0.000 n=10+9)
KVDelete100_SQL-8  2.40ms ± 3%  2.50ms ± 1%   +3.87%   (p=0.000 n=10+8)
KVScan1_SQL-8       215µs ± 3%   227µs ± 1%   +5.62%   (p=0.000 n=10+9)
KVScan10_SQL-8      246µs ± 2%   252µs ± 1%   +2.37%   (p=0.000 n=10+9)
KVScan100_SQL-8     476µs ± 2%   518µs ± 1%   +8.80%  (p=0.000 n=10+10)

I'm actually somewhat surprised the effect is this large. And I expected the effect to diminish with larger batch sizes such as the 10 and 100 benchmarks. I also looked at disabling the local call optimization completely, but that is a 30% perf hit.

@petermattis petermattis force-pushed the pmattis/cancellable-local-calls branch from 99c8f31 to 2fa544e Compare November 23, 2016 17:13
Process intents synchronously on the goroutine which generated them.
@petermattis petermattis force-pushed the pmattis/cancellable-local-calls branch from 2fa544e to 9858253 Compare November 23, 2016 17:17
@spencerkimball
Copy link
Member

Wow, that's really a nasty performance hit. On the other hand, optimizing this seems like a low priority.

@petermattis
Copy link
Collaborator Author

I'll file an issue about revisiting the performance hit. It is somewhat higher than I expected. And more variable than I expected. I would have expected an across-the-board impact on these benchmarks which are almost always performing either 1PC transactions or read-only operations. Why does KVScan1 only slow down by 12µs while KVDelete1 slows down by 54µs?

@petermattis petermattis merged commit 9045576 into cockroachdb:master Nov 23, 2016
@petermattis petermattis deleted the pmattis/cancellable-local-calls branch November 23, 2016 17:57
@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 1 of 1 files at r1, 2 of 3 files at r5, 2 of 2 files at r6.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants