kv,storage: use a BytesMonitor to track memory allocations for scans #66362

sumeerbhola · 2021-06-11T15:15:59Z

The goal is to track memory allocations for:

non-local SQL=>KV requests: this can happen with joins,
multi-tenant clusters, and if ranges move between DistSQL
planning and execution.
local SQL=>KV requests for the first request by a fetcher:
in this case the fetcher reserves a modest 1KB which can
be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs #19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.

cockroach-teamcity · 2021-06-11T15:16:06Z

This change is

sumeerbhola

This lacks tests -- I'll add after some initial comments.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)

jordanlewis · 2021-06-11T15:52:04Z

Wow, excellent. Thanks for doing this, I'm very excited to see it. I will review shortly. Conceptually, this is protection for the remote nodes in a scan, either a misplanned table reader or a lookup, index, or zigzag join, which we've never had.

As an idea for a full system test, let's try creating a table with very large rows, and pin the leaseholders of it's primary index to node 1. Then, create a secondary index and pin it to node 2. Then, run a workload that forces an index join over the entire table and point it at node 2, so the db will be forced to retrieve all of the data from node 1 via remote batches.

I think you should be able to induce an OOM with this workload against node 1 without too much trouble by turning up the concurrency of this workload. After your change, I'd expect node 1 to be protected from OOMs.

Though you might have trouble making this work without running into the memory protection on node 2, actually. Anyway, I think there is something here hopefully.

ajwerner · 2021-06-11T16:05:28Z

For testing can't we do something much simpler? Consider creating two servers with very different memory budgets, put all the data on the server will a small budget, disable distsql, issue the queries from the big budget server, make sure you get failures?

sumeerbhola

Conceptually, this is protection for the remote nodes in a scan, either a misplanned table reader or a lookup, index, or zigzag join, which we've never had.

Also, it should hopefully help for multi-tenant settings where there is no local sql execution, though a BytesMonitor for that is not hooked up yet.

There was also the repeated OOMing issue with high number of bytes in pebbleMVCCScanner https://github.com/cockroachlabs/support/issues/930#issuecomment-827798775 that was 20.2 and we never got to a proper root cause. @jordanlewis I believe your change #52496 was in 20.2, yes? So maybe it was the fact that the first fetch only reserved 1KB that was to blame. Or somehow things became non-local -- long running query where ranges moved after the plan. I'm speculating wildly.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)

sumeerbhola · 2021-06-11T17:28:06Z

This tracking is only done for non-local requests. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response.

I think this first request under-accounting for local requests can be narrowed by passing a no-memory-reserved-at-client bit, using the AdmissionHeader (this would happen in txnKVFetcher.Fetch). Well, 1KB is reserved but we can treat it as 0 unless the number of concurrently executing queries N is high enough that N*1KB becomes a significant fraction of the sql memory limit. There will still be a gap when KV releases the memory and before txnKVFetcher acquires it, but fixing that requires more intrusive changes that need to understand fanout, and I think belongs as part of a comprehensive memory based flow control that will hopefully happen as part of #54680

jordanlewis · 2021-06-11T17:54:48Z

There was also the repeated OOMing issue with high number of bytes in pebbleMVCCScanner cockroachlabs/support#930 (comment) that was 20.2 and we never got to a proper root cause.

In 20.2, we had further bugs that hadn't been discovered yet, such as #65881 and #66376 which I only just found yesterday. Honestly I think that most of our OOM issues are caused by imprecise lifetime control of batch pointers on the SQL side - there are a lot of bugs here that we're just discovering and I feel like if those hadn't existed we might not be in such an OOM-y place.

jordanlewis · 2021-06-11T18:01:10Z

I think this first request under-accounting for local requests can be narrowed by passing a no-memory-reserved-at-client bit, using the AdmissionHeader (this would happen in txnKVFetcher.Fetch). Well, 1KB is reserved but we can treat it as 0 unless the number of concurrently executing queries N is high enough that N*1KB becomes a significant fraction of the sql memory limit.

This would be great. Let's do it.

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):

		rootMonitor = r.store.getRootMemoryMonitorForKV()
		if rootMonitor == nil {
			log.Infof(ctx, "pebbleMVCCScanner: no rootMonitor")

nit: assuming this is debug leftovers?

pkg/kv/kvserver/replica_read.go, line 197 at r1 (raw file):

	var boundAccount mon.BoundAccount
	if rootMonitor != nil {
		monitor := mon.NewMonitorInheritWithLimit("scanner-mem", 0 /* limit */, rootMonitor)

I think that it might be cleaner to create a monitor that lives for the lifetime of some long-lived object (like a replica? a range?) and create and release bound accounts based on that long-lived monitor. It feels like a slight smell to have a monitor created and destroyed on every KV request. We do it that way for SQL operators, to be fair, but those are longer lived.

Let's make sure we don't get any kind of performance hit due to the slight bit of serialization this will cause on the root monitor, at least. I think a basic kv workload would help determine this.

pkg/kv/kvserver/store.go, line 2899 at r1 (raw file):

func (s *Store) getRootMemoryMonitorForKV() *mon.BytesMonitor {
	if s.cfg.MemoryMonitorForKVProvider != nil {
		return s.cfg.MemoryMonitorForKVProvider.GetRootMemoryMonitor()

I think we should give the Store its own monitor. It feels wrong to have the scan monitors be connected directly to the root monitor. As I'm saying this I don't have a strong justification, just that it feels smelly to not have a part of the hierarchy that is explicitly KV-related.

pkg/roachpb/api.proto, line 2021 at r1 (raw file):

    FROM_SQL = 1;
    ROOT_KV = 2;
    LOCAL = 4;

This is very helpful, there are other circumstances where I wished I'd had this (I'm thinking of #52863, which stalled because the extra serialization was removing all advantages in the local-only path - I needed to find a way to pass a batch pointer instead)

pkg/storage/pebble_mvcc_scanner.go, line 472 at r1 (raw file):

		if p.err = p.memMonitor.Grow(
			context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil {
			log.Infof(context.TODO(), "intent memory monitor error: %s", p.err.Error())

Will the error be propagated here? I'd be interested to see what the output error in SQL-land would look like if this is triggered. We might conceivably want to annotate this error with which node had the OOM condition? Which range? More details on the op being executed?

pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):

		if p.err = p.memMonitor.Grow(
			context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil {
			log.Infof(context.TODO(), "intent memory monitor error: %s", p.err.Error())

ditto above

tbg

This looks good, thank you Sumeer!
I think @jordanlewis is right about the monitors not being set up in just the right place. Unfortunately I'm not familiar with them myself, but naively I would have hung one monitor into *StoreConfig (shared across all Stores, i.e. it's really owned by the Node). The upshot of that will be that we can adopt memory monitoring in the rest of KV easily. This is already almost the code you have, except that you're hanging s.sqlServer.rootSQLMemoryMonitor off the store config. I'm curious why you did that instead of making a leaf monitor right there.

tbg · 2021-06-14T11:11:53Z

pkg/kv/kvserver/replica_read.go

+ if !roachpb.IsAdmissionHeaderLOCAL(ba.AdmissionHeader.Source) {
+ rootMonitor = r.store.getRootMemoryMonitorForKV()
+ if rootMonitor == nil {
+ log.Infof(ctx, "pebbleMVCCScanner: no rootMonitor")


This is worthy of a return roachpb.NewError(errors.AssertionFailedf("x")), isn't it? If the monitor is missing that's bad, and logging at high frequency doesn't make it better. If this is truly optional, we shouldn't log.

tbg · 2021-06-14T11:15:45Z

pkg/kv/kvserver/replica_read.go

+ // least the batch is fully evaluated. Ideally we would like to release
+ // after grpc has sent the response, but there are no interceptors at that
+ // stage. The interceptors execute before the response is marshaled in
+ // Server.processUnaryRPC by calling sendResponse.


In theory we could use finalizers for this purpose, but I'd much prefer it if we didn't. I think this imperfect tracking is about as good as it gets.

It might be worth mentioning in this comment that we're intentionally not using finalizers because they delay GC and because they have had bugs in the past and can also prevent GC of objects with cyclic references.

tbg · 2021-06-14T11:32:41Z

pkg/roachpb/api.go

@@ -1527,3 +1527,11 @@ func (c *ContentionEvent) SafeFormat(w redact.SafePrinter, _ rune) {
 func (c *ContentionEvent) String() string {
 return redact.StringWithoutMarkers(c)
 }
+
+func IsAdmissionHeaderOTHER(source AdmissionHeader_Source) bool {


It's intentional that IsAdmissionHeaderOTHER(100), i.e. with the LOCAL flag set, is true, right?

tbg · 2021-06-14T11:41:26Z

pkg/roachpb/api.proto

 // BatchRequests can cause other BatchRequests (e.g. to PushTransaction),
 // which may result in yet more BatchRequests (e.g. to
 // ResolveIndeterminateCommit). These can cause a deadlock when using slot
 // based admission control. Only FROM_SQL and ROOT_KV requests are subject
 // to admission control. This behavior errs on the side of no deadlock, if
 // we overlooked an instrumentation point, since the default value of OTHER
 // bypasses admission control.
+ // LOCAL is a special bit that is only set on codepaths that are calling the
+ // KV api from the same node (not using gRPC). If FROM_SQL and ROOT_KV are
+ // not set, it represents OTHER. This allows codepaths that don't initialize


"it represents OTHER" as in "LOCAL represents OTHER"? I get what you mean (100 == OTHER == 000) but it sounds a bit weird. What do you mean by "This allows codepaths that don't initialize the source to function as other"? If you don't initialize the source you have int32(0) and the comment above already explains this is OTHER. You could just say

LOCAL is a special bit that indicates that the Source is local to the KV node (i.e. bypassing the network). It has no effect if neither FROM_SQL nor ROOT_KV are set.

It's also generally confusing, to have the flags partially match the meaning of the bitfield representation (I would've expected type Source int32 and then the flags are just consts). Also, is this worth keeping as a bitfield?

Why not

message Source { enum { OTHER = 0; FROM_SQL = 1; ROOT_KV = 2; } s = 1; bool local = 2;

Is the expectation that we'll pick up "many" other sources? But then won't we run out of bits and the above is better anyway?

tbg · 2021-06-14T11:57:10Z

pkg/storage/mvcc.go

+// tracking in general is optimistic in when it stops tracking. So this one
+// case of pessimism if fine and simplifies the code. An empty initialized
+// ScannerMemoryMonitor is safe to use.
+type ScannerMemoryMonitor struct {


Doesn't matter, but you could use

type ScannerMemoryMonitor mon.BoundAccount

and then

func (b *ScannerMemoryMonitor) Grow(ctx context.Context, x int64) error { if b == nil { return nil } return (*mon.BoundAccount)(b).Grow(ctx, x) }

and it would look maybe a tiny bit cleaner but also you don't have to worry about accidentally moving ScannerMemoryMonitor to the heap. In particular, you can pass it through an interface "for free", so you could also hand it deeper into pebble should that become useful one day.

tbg · 2021-06-14T12:00:39Z

pkg/storage/pebble_mvcc_scanner.go

@@ -76,6 +78,10 @@ func (p *pebbleResults) put(key []byte, value []byte) {
 if len(p.repr) > 0 {
 p.bufs = append(p.bufs, p.repr)
 }
+ if err := monitor.Grow(context.TODO(), int64(newSize)); err != nil {
+ log.Infof(context.TODO(), "put memory monitor error: %s", err.Error())


This is debug logging, right? You're already returning the error.

tbg · 2021-06-14T12:03:17Z

pkg/storage/pebble_mvcc_scanner.go

@@ -459,6 +467,11 @@ func (p *pebbleMVCCScanner) getAndAdvance() bool {
 // that lie before the resume key.
 return false
 }
+ if p.err = p.memMonitor.Grow(
+ context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil {
+ log.Infof(context.TODO(), "intent memory monitor error: %s", p.err.Error())


tbg · 2021-06-14T12:03:23Z

pkg/storage/pebble_mvcc_scanner.go

@@ -479,6 +492,11 @@ func (p *pebbleMVCCScanner) getAndAdvance() bool {
 // Note that this will trigger an error higher up the stack. We
 // continue scanning so that we can return all of the intents
 // in the scan range.
+ if p.err = p.memMonitor.Grow(
+ context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil {
+ log.Infof(context.TODO(), "intent memory monitor error: %s", p.err.Error())


Whenever we return an error from the account, we should wrap it with information about the mvccScanner: the number of bytes in the pebbleResults and the p.intents. I hope the error from the monitor already includes the current allocations and how much was requested. We can then check that these are in line with expectations when we see such an error (and spot bugs if there are any).

I also agree with @jordanlewis that this should be a structured error, so that SQL can annotate it programmatically if desired.

tbg · 2021-06-14T12:04:58Z

pkg/storage/pebble_mvcc_scanner.go

@@ -479,6 +492,11 @@ func (p *pebbleMVCCScanner) getAndAdvance() bool {
 // Note that this will trigger an error higher up the stack. We
 // continue scanning so that we can return all of the intents
 // in the scan range.
+ if p.err = p.memMonitor.Grow(


This looked wrong to me until I realized that p.intents is a pebble.Batch and not a pebbleResults. Might be worth pointing out in a comment, I liked that we didn't have calls to Grow all over the place and I assume this is the only place where we write to p.intents so it's fine.

sumeerbhola

TFTRs!
I've addressed all the comments.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is worthy of a return roachpb.NewError(errors.AssertionFailedf("x")), isn't it? If the monitor is missing that's bad, and logging at high frequency doesn't make it better. If this is truly optional, we shouldn't log.

I've made this optional since (a) I am not sure if I've captured all StoreConfigs by the code in NewServer, and more importantly (b) Server is relying on there being a sqlServer -- is that lacking for serverless multi-tenant KV nodes?, (c) I am not sure if some KV commands will start executing at some point in NewServer prior to construction of the sqlServer. Though if (c) is true there is a race condition with GetRootMemoryMonitor which needs to be fixed.

pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

nit: assuming this is debug leftovers?

Removed

pkg/kv/kvserver/replica_read.go, line 197 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think that it might be cleaner to create a monitor that lives for the lifetime of some long-lived object (like a replica? a range?) and create and release bound accounts based on that long-lived monitor. It feels like a slight smell to have a monitor created and destroyed on every KV request. We do it that way for SQL operators, to be fair, but those are longer lived.

Let's make sure we don't get any kind of performance hit due to the slight bit of serialization this will cause on the root monitor, at least. I think a basic kv workload would help determine this.

Changed to Server creating a long-lived child kvMemoryMonitor.

pkg/kv/kvserver/replica_read.go, line 204 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

In theory we could use finalizers for this purpose, but I'd much prefer it if we didn't. I think this imperfect tracking is about as good as it gets.

It might be worth mentioning in this comment that we're intentionally not using finalizers because they delay GC and because they have had bugs in the past and can also prevent GC of objects with cyclic references.

Done

pkg/kv/kvserver/store.go, line 2899 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think we should give the Store its own monitor. It feels wrong to have the scan monitors be connected directly to the root monitor. As I'm saying this I don't have a strong justification, just that it feels smelly to not have a part of the hierarchy that is explicitly KV-related.

Done

pkg/roachpb/api.proto, line 2015 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It's also generally confusing, to have the flags partially match the meaning of the bitfield representation (I would've expected type Source int32 and then the flags are just consts). Also, is this worth keeping as a bitfield?

Why not
message Source {
  enum {
    OTHER = 0;
    FROM_SQL = 1;
    ROOT_KV = 2;
  } s = 1;
  bool local = 2;
Is the expectation that we'll pick up "many" other sources? But then won't we run out of bits and the above is better anyway?

I was not feeling clean about this either and glad you pushed on this. I've moved it to a separate field.

pkg/roachpb/api.proto, line 2021 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

This is very helpful, there are other circumstances where I wished I'd had this (I'm thinking of #52863, which stalled because the extra serialization was removing all advantages in the local-only path - I needed to find a way to pass a batch pointer instead)

Ack

pkg/storage/mvcc.go, line 2497 at r1 (raw file):

but also you don't have to worry about accidentally moving ScannerMemoryMonitor to the heap

because it is already on the heap? Versus what I did here to place BoundAccount on the heap while leaving ScannerMemoryMonitor as pass-by-value? I'm not clear on why one is better than the other -- could you clarify?

so you could also hand it deeper into pebble should that become useful one day

If you are suggesting prolonging its lifetime, I'm not sure about that since we want to close it when the batch finishes executing. If not, I am not sure what prevents passing the current ScannerMemoryMonitor deeper -- it would need to be wrapped in an interface of course. I guess I don't understand what is meant by "pass it through an interface for free". Could you elaborate?

pkg/storage/pebble_mvcc_scanner.go, line 82 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is debug logging, right? You're already returning the error.

Yes, removed

pkg/storage/pebble_mvcc_scanner.go, line 472 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ditto

Removed

pkg/storage/pebble_mvcc_scanner.go, line 472 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Will the error be propagated here? I'd be interested to see what the output error in SQL-land would look like if this is triggered. We might conceivably want to annotate this error with which node had the OOM condition? Which range? More details on the op being executed?

I've added the start key of the scan and the used bytes in the BoundAccount. The error from the BoundAccount includes some state from the underlying BytesMonitor.
You'll recognize this example :)

root@:26257/defaultdb> SELECT * FROM foo@foo_attribute_idx WHERE attribute=10 AND blob LIKE 'blah%';
ERROR: scan with start key /Table/64/1/2/0: used bytes 67108864: kv-mem: memory budget exceeded: 67108864 bytes requested, 201326592 currently allocated, 209715200 bytes in budget
SQLSTATE: 53200

(this was with a hacky version that reduced the memory for the KV BytesMonitor to 200MB)

pkg/storage/pebble_mvcc_scanner.go, line 495 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This looked wrong to me until I realized that p.intents is a pebble.Batch and not a pebbleResults. Might be worth pointing out in a comment, I liked that we didn't have calls to Grow all over the place and I assume this is the only place where we write to p.intents so it's fine.

There are two places -- there is the one just above this. I've added a comment in both.

pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Whenever we return an error from the account, we should wrap it with information about the mvccScanner: the number of bytes in the pebbleResults and the p.intents. I hope the error from the monitor already includes the current allocations and how much was requested. We can then check that these are in line with expectations when we see such an error (and spot bugs if there are any).

I also agree with @jordanlewis that this should be a structured error, so that SQL can annotate it programmatically if desired.

I've added the start key of the scan and the used bytes in the BoundAccount. The error from the BoundAccount includes some state from the underlying BytesMonitor.
Do you mean a structured error defined in roachpb/errors.proto?

pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

ditto above

Removed

pkg/roachpb/api.go, line 1531 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It's intentional that IsAdmissionHeaderOTHER(100), i.e. with the LOCAL flag set, is true, right?

Yes, that was intentional. This is cleaned up now.

sumeerbhola

added a simple logic test based on an example that @jordanlewis used in an issue. We don't need a distributed setup since the first request by a txnKVFetcher sets AdmissionHeader.NoMemoryReservedAtSource

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)

jordanlewis

Curious what happens if you try a TPCH workload with this patch. #64906 has some instructions to set that up. Try playing with query 4 (which uses lookup joins) at high concurrency, this has reliably OOMed a cluster quickly in my testing. If you don't feel like doing this it's fine, I will try it out sooner or later, but I'd be interested in getting your eyes on the workload as well.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/sql/logictest/testdata/logic_test/mem_limit, line 40 at r3 (raw file):


query error scan with start key .* memory budget exceeded
SELECT * FROM foo@foo_attribute_idx WHERE attribute=10 AND blob LIKE 'blah%'

Thanks for constructing this example. This highlights something I didn't realize earlier - that this behavior change can lead to individual queries being uncompleteable even in a non-concurrent environment.

I think this is undesired behavior. In the ideal world, 2 things would be true:

the server would never OOM
queries would never return budget exceeded errors, by virtue of either spilling to disk or using smaller batches and more roundtrips.

I think this case should, ideally, by handled by having KV return a partial result, so SQL can process the partial result and ask for more when it's flushed memory back to the client.

That being said, this is probably impossible currently because of #54680, which is preventing us from returning partial results in cases like index join where SQL requests a parallel scan...

Thoughts? I guess in practice this will only kick in when the server is running low on memory anyway, so probably it won't be easily observable.

pkg/storage/pebble_mvcc_scanner.go, line 82 at r1 (raw file):

Previously, sumeerbhola wrote…

Yes, removed

nit: should we plumb a real context?

pkg/storage/pebble_mvcc_scanner.go, line 472 at r1 (raw file):

Previously, sumeerbhola wrote…

I've added the start key of the scan and the used bytes in the BoundAccount. The error from the BoundAccount includes some state from the underlying BytesMonitor.
You'll recognize this example :)
root@:26257/defaultdb> SELECT * FROM foo@foo_attribute_idx WHERE attribute=10 AND blob LIKE 'blah%';
ERROR: scan with start key /Table/64/1/2/0: used bytes 67108864: kv-mem: memory budget exceeded: 67108864 bytes requested, 201326592 currently allocated, 209715200 bytes in budget
SQLSTATE: 53200
(this was with a hacky version that reduced the memory for the KV BytesMonitor to 200MB)

👍

pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):

Previously, sumeerbhola wrote…

I've added the start key of the scan and the used bytes in the BoundAccount. The error from the BoundAccount includes some state from the underlying BytesMonitor.
Do you mean a structured error defined in roachpb/errors.proto?

I'm continually confused about error redaction, but do we need to worry that we're logging a raw key here? That would be PII.

andreimatei

Thanks for working on this Sumeer!

local SQL=>KV requests for the first request by a fetcher:

Have you considered tracking this kind of allocation in the respective SQL session's monitor? We already have per-session monitors, and within those we have a per-txn one I believe. I think it'd be good to track as much of the session's usage as possible in those.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):

Previously, sumeerbhola wrote…

I've made this optional since (a) I am not sure if I've captured all StoreConfigs by the code in NewServer, and more importantly (b) Server is relying on there being a sqlServer -- is that lacking for serverless multi-tenant KV nodes?, (c) I am not sure if some KV commands will start executing at some point in NewServer prior to construction of the sqlServer. Though if (c) is true there is a race condition with GetRootMemoryMonitor which needs to be fixed.

I, for one, think we should resist as much as reasonable this kinds of fields being optional. I think there's such a thing as an unlimited monitor (if we really need to sometimes have stores that don't have this limit, which I also question), so the fact that some field can be nil I think only makes the code harder to read. Perhaps I can help answer some of the questions above to make this non-nilable?

pkg/kv/kvserver/replica_read.go, line 184 at r3 (raw file):

}

type evalContextWithMonitor struct {

I find the need for this struct dubious. At the very least, add a good comment pls.
But, why can't the replica's implementation of EvalContext supply the right monitor always? I think it might be worth pushing on that front.

pkg/kv/kvserver/replica_read.go, line 210 at r3 (raw file):

	// locally, or for local request has reserved no memory. Local requests
	// (typically DistSQL, though we may not have instrumented the source as SQL
	// in all cases, so some may be flowing in as OTHER), do memory accounting

I no longer see an "OTHER". Did this comment rot?

pkg/kv/kvserver/replica_read.go, line 214 at r3 (raw file):

	// request in fetcher is small (the NoMemoryReservedAtSource=true case),
	// subsequent ones use the size of the response for subsequent requests (see
	// https://github.com/cockroachdb/cockroach/pull/52496). This scheme could

you sure you want to put a link to a PR in this comment? I haven't seen that before... Consider replacing the link with more words or code references.

pkg/kv/kvserver/replica_read.go, line 238 at r3 (raw file):

	for retries := 0; ; retries++ {
		if retries > 0 {
			// It is safe to call Clear on an uninitialized BoundAccount.

nit: I don't think this comment is necessary; if someone really has that question, they can answer it very quickly by reading the Clear code. But as I was saying above, I think the code would look better if this account always existed.

pkg/kv/kvserver/store.go, line 728 at r3 (raw file):

	ProtectedTimestampCache protectedts.Cache

	// MemoryMonitorForKVProvider can be nil.

please add more to this comment, saying what this thing is.
And also say what happens if it's nil.

pkg/roachpb/api.proto, line 2036 at r3 (raw file):

  Source source = 3;

  // SourceLocation specifies physically where the call originated. LOCAL is

I think this field would benefit from more hints about how it's used

pkg/roachpb/api.proto, line 2045 at r3 (raw file):

  SourceLocation source_location = 4;

  // NoMemoryReservedAtSource is read only when SourceLocation=LOCAL, where

please add more comments to this; what you have doesn't really explain what the deal is

pkg/server/server.go, line 182 at r3 (raw file):

	gcoord          *admission.GrantCoordinator
	kvMemoryMonitor *mon.BytesMonitor

please put a comment on this guy explaining how it's tied into the other monitors, and what it serves for

pkg/server/server.go, line 612 at r3 (raw file):

		ExternalStorageFromURI:     externalStorageFromURI,
		ProtectedTimestampCache:    protectedtsProvider,
		MemoryMonitorForKVProvider: lateBoundServer,

Do we really need this dependency from the StoreConfig to the Server? If possible, I think it'd be better to initialize the monitor on all stores at Server.Start time, or until the Server is constructed below. I hope it's not needed until then, and it would save the need for the Provider interface.
Also see below for another orchestration suggestion.

pkg/server/server.go, line 757 at r3 (raw file):

	kvMemoryMonitor := mon.NewMonitorInheritWithLimit(
		"kv-mem", 0 /* limit */, sqlServer.rootSQLMemoryMonitor)
	kvMemoryMonitor.Start(ctx, sqlServer.rootSQLMemoryMonitor, mon.BoundAccount{})

if you're tying this to the "sql` monitor, consider sprinkling some comments around that guy explaining that it's not just "sql " any more

pkg/server/server.go, line 757 at r3 (raw file):

	kvMemoryMonitor := mon.NewMonitorInheritWithLimit(
		"kv-mem", 0 /* limit */, sqlServer.rootSQLMemoryMonitor)
	kvMemoryMonitor.Start(ctx, sqlServer.rootSQLMemoryMonitor, mon.BoundAccount{})

instead of tying this initialization of the Server to the kvServer, I think it'd be better if we lifted the initialization of the rootSQLMemoryMonitor out of the sqlServer, and into the orchestration code around here. Have the sqlServer take in its monitor, just like the Server now does. So, change SQLConfig to take in a monitor instead of a MemoryPoolSize.

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

}

// ScannerMemoryMonitor is used to track memory allocations when using

Do we really need this struct? Even if the scan doesn't use all of BoundAccount, would it be harmful to have access to it? BoundAccount is not a heavy-weight structure.
Between the MemoryMonitor and the BoundAccount, our memory tracking infra is confusing enough. I for one would not introduce more structs. The fact that you're naming this a "monitor" but it's actually an "account" doesn't help :)

pkg/storage/pebble_mvcc_scanner.go, line 82 at r1 (raw file):

nit: should we plumb a real context?

At the very least, I think you can easily lift the context.TODO sever layers up, which is always better in my book.

pkg/storage/pebble_mvcc_scanner.go, line 472 at r3 (raw file):

		// by only accounting for the key and value bytes.
		if p.err = p.memMonitor.Grow(
			context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil {

I think it's not too hard to plumb a ctx here. Try to fight against context.TODO() :P

sumeerbhola

Have you considered tracking this kind of allocation in the respective SQL session's monitor? We already have per-session monitors, and within those we have a per-txn one I believe. I think it'd be good to track as much of the session's usage as possible in those.

Do we have per-txn ones even for the leaf nodes in DistSQL evaluation? For multi-tenant KV I think we need a child that is per-tenant -- I've added a TODO in replica_read.go.

I still owe an attempt with TPCH.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I, for one, think we should resist as much as reasonable this kinds of fields being optional. I think there's such a thing as an unlimited monitor (if we really need to sometimes have stores that don't have this limit, which I also question), so the fact that some field can be nil I think only makes the code harder to read. Perhaps I can help answer some of the questions above to make this non-nilable?

Made this non-optional in StoreConfig and added a panic here.

pkg/kv/kvserver/replica_read.go, line 184 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I find the need for this struct dubious. At the very least, add a good comment pls.
But, why can't the replica's implementation of EvalContext supply the right monitor always? I think it might be worth pushing on that front.

There are two reasons:

we do not want a real ScannerMemoryMonitor for all requests (e.g. don't want it for LOCAL request where there is memory reserved at source).
the ScannerMemoryMonitor needs to be specific to this request so we can close it in executeReadOnlyBatchWithServersideRefreshes and we need to hang that per-request monitor somewhere, since it can't be Replica.

I've added a comment.

pkg/kv/kvserver/replica_read.go, line 210 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I no longer see an "OTHER". Did this comment rot?

It hasn't rotted but the mention of OTHER was probably confusing. The original intention was to answer the question in a reader's mind on why we additionally did not gate this on FROM_SQL. I've expanded the comment to make it clearer.

pkg/kv/kvserver/replica_read.go, line 214 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you sure you want to put a link to a PR in this comment? I haven't seen that before... Consider replacing the link with more words or code references.

Replaced with a reference to txnKVFetcher.

pkg/kv/kvserver/replica_read.go, line 238 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: I don't think this comment is necessary; if someone really has that question, they can answer it very quickly by reading the Clear code. But as I was saying above, I think the code would look better if this account always existed.

I'm not sure what we gain with an unlimited BoundAccount compared to a zero-initialized one. I think this issue is separable from a store not having a BytesMonitor.

pkg/kv/kvserver/store.go, line 728 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please add more to this comment, saying what this thing is.
And also say what happens if it's nil.

Done

pkg/roachpb/api.proto, line 2036 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think this field would benefit from more hints about how it's used

Won't that become stale and unwieldy as we add more use cases?
@jordanlewis would like to use this to optimize not serializing vectorized execution batches when constructing those batches inside the storage package.

pkg/roachpb/api.proto, line 2045 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please add more comments to this; what you have doesn't really explain what the deal is

Done

pkg/server/server.go, line 182 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please put a comment on this guy explaining how it's tied into the other monitors, and what it serves for

Done

pkg/server/server.go, line 612 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Do we really need this dependency from the StoreConfig to the Server? If possible, I think it'd be better to initialize the monitor on all stores at Server.Start time, or until the Server is constructed below. I hope it's not needed until then, and it would save the need for the Provider interface.
Also see below for another orchestration suggestion.

Done (the suggestion below)

pkg/server/server.go, line 757 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

if you're tying this to the "sql` monitor, consider sprinkling some comments around that guy explaining that it's not just "sql " any more

Done (in server_sql.go)

pkg/server/server.go, line 757 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

instead of tying this initialization of the Server to the kvServer, I think it'd be better if we lifted the initialization of the rootSQLMemoryMonitor out of the sqlServer, and into the orchestration code around here. Have the sqlServer take in its monitor, just like the Server now does. So, change SQLConfig to take in a monitor instead of a MemoryPoolSize.

Done

pkg/sql/logictest/testdata/logic_test/mem_limit, line 40 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Thanks for constructing this example. This highlights something I didn't realize earlier - that this behavior change can lead to individual queries being uncompleteable even in a non-concurrent environment.

I think this is undesired behavior. In the ideal world, 2 things would be true:

the server would never OOM

queries would never return budget exceeded errors, by virtue of either spilling to disk or using smaller batches and more roundtrips.

I think this case should, ideally, by handled by having KV return a partial result, so SQL can process the partial result and ask for more when it's flushed memory back to the client.

That being said, this is probably impossible currently because of #54680, which is preventing us from returning partial results in cases like index join where SQL requests a parallel scan...

Thoughts? I guess in practice this will only kick in when the server is running low on memory anyway, so probably it won't be easily observable.

Yes, I think we need #54680. The test reflects current behavior. I've added a comment stating that this test will need to be rewritten.

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Do we really need this struct? Even if the scan doesn't use all of BoundAccount, would it be harmful to have access to it? BoundAccount is not a heavy-weight structure.

It's not heavyweight but it is confusing to have all the methods accessible when we want to make sure the caller only calls Grow. And there is the nil case for cases (like the local one) where we are not doing memory accounting.
I am happy to take suggestions for better names. I've always been confused by the BoundAccount terminology. This is more like a QueryMemoryReserver.

pkg/storage/pebble_mvcc_scanner.go, line 82 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: should we plumb a real context?

At the very least, I think you can easily lift the context.TODO sever layers up, which is always better in my book.

I should not have been lazy. Fixed.

pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I'm continually confused about error redaction, but do we need to worry that we're logging a raw key here? That would be PII.

The code comment for Wrapf says Use errors.Wrapf(err, "%s", <unsafestring>) for errors strings that may contain PII information., so I believe we are ok.

pkg/storage/pebble_mvcc_scanner.go, line 472 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think it's not too hard to plumb a ctx here. Try to fight against context.TODO() :P

Done

jordanlewis

Dismissed @tbg from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/sql/logictest/testdata/logic_test/mem_limit, line 40 at r3 (raw file):

Previously, sumeerbhola wrote…

Yes, I think we need #54680. The test reflects current behavior. I've added a comment stating that this test will need to be rewritten.

What I am anxious about is whether releasing this change will, in the next release, manifest as worse behavior for a user. The trouble is that if we make a situation in which a single query can, on its own, cause a budget exceeded error, there is really no recourse for a user that I can think of, which makes me uncomfortable.

Such a query, today, would either OOM a server or "get lucky" and simply succeed. Typically it would be getting lucky because 25% of system memory gives quite a bit of headroom. It doesn't seem incredibly unlikely that we have use cases in the wild that are happily running, getting lucky, without any trouble - and we'd be ending those happy cases with this patch.

I really wish we could get #54680 in for this reason, since a failure to reserve memory would simply manifest in that case as extra round trips. I almost think we should disable this patch if, by next release, we don't have #54680. But I suppose that's up for debate.

pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):

Previously, sumeerbhola wrote…

The code comment for Wrapf says Use errors.Wrapf(err, "%s", <unsafestring>) for errors strings that may contain PII information., so I believe we are ok.

👍

sumeerbhola

(ftr, adding stuff from side discussion with @jordanlewis on TPCH)

Running ./workload run tpch --queries 4 --concurrency=1024 --tolerate-errors immediately OOMs the node. The probable cause for query 4 OOMs was and continues to be in the SQL layer. KV accounting clearly was not the magic bullet to fix it.
Even though most OOMs are due to accounting gaps in SQL, I think it is still worthwhile to have this PR since it introduces plumbing for memory accounting were there was none, and can be used as a building block for introducing more memory accounting In KV.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)

pkg/sql/logictest/testdata/logic_test/mem_limit, line 40 at r3 (raw file):

Such a query, today, would either OOM a server or "get lucky" and simply succeed. Typically it would be getting lucky because 25% of system memory gives quite a bit of headroom. It doesn't seem incredibly unlikely that we have use cases in the wild that are happily running, getting lucky, without any trouble - and we'd be ending those happy cases with this patch.

I'm fine with a decision to hide this behind a cluster setting that is off by default if #54680 is not addressed. And another option would be to decrease the 25% headroom, though I'm not sure what all (including non-per-query state like the ts cache, latches, lockTable etc.) are relying on that and whether decreasing that is safe either.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)

pkg/kv/txn.go, line 962 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

(reviewable is terrible at PR-level threads, so using this instead)

I think I have finally understood the deal with the "local SQL=>KV requests for the first request by a fetcher" case, and with the new NoMemoryReservedAtSource field. That field is not explained enough, and I think that's the case because it is pretty hard to explain. I think we can simplify the tracking model a lot if we simply pass an account from the client to the server. The BoundAccount can model reservations (it has a reserved field and a separate used field). So, instead of the NoMemoryReservedAtSource field and the SourceLocation field, wouldn't it be both simpler and better for BatchRequests to take an optional reservation in the form of a *BoundAccount?
This way:

The distinction between local/remote sources is turned into the distinction between a reservation existing or not existing. It's possible for the DistSender to split a request into multiple local parts, so the question becomes how to deal with the reservation then. I think we need to either split it or we'd need a thread-safe version of BoundAccount.

The distinction between Local+NoMemoryReservedAtSource and Local goes away. They're both requests with a reservation, which reservation is higher or lower.

In the Local cases, the scanner allocations will be accounted to the correct flow (and separately we should make flows running on the gateway account to their SQL sessions)

In the case that's currently represented by a reservation on the client and no tracking on the server, we'd no longer be just hoping that the reservation is large enough. Instead, once the reservation is exhausted, the server would try to grow it and fail if it can't.

BatchRequest being a proto API, we have to use a trick to essentially pass a pointer to a Go object through it (or otherwise expand the RPC interface for the local-caller case), but it's all doable.

You raise some interesting alternatives.
I am inclined to not pass an account from client to server through a proto API. It seems hacky. The advantage of accounting to the "correct flow" I think is small enough, and only works for the local case. If we want txn level accounting in general, we need it to work for both local and remote. The main advantage is eliminating the NoMemoryReservedAtSource which admittedly is crude, and not something we would necessarily want to be the long term solution. But I am also not keen to start making DistSender changes, given that we are going to layer Streamer+DistSender in the medium term.

Long term, I think the NoMemoryReservedAtSource field goes away, since the caller should have reserved memory locally for N bytes and told the callee to not send more than N bytes (via TargetBytes or some other field). So if the callee is local it knows how many bytes are reserved. We cannot have callers not reserving up front and passing that info to the callee -- this will mean coordination across Streamer and DistSender and universal adoption (TargetBytes is currently ignored for things like retrieving intents). I've documented this in the comment for NoMemoryReservedAtSource. Is this a decent compromise?

pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

		// rootMonitor will not be nil in production settings, but we don't check
		// here due to tests that do not have a monitor.
I don't quite understand what this comment is telling me. Is it saying that, in the tests where it's nil, the code still somehow works? Consider rephrasing, cause "but we don't check" is weird (why would we check? we don't generally check pointers). Or just delete it.

I've simplified the comment.

pkg/kv/kvserver/replica_read.go, line 184 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I would find it much cleaner if we switched away from Replica implementing EvalContext and, instead, creating a dedicated struct for it that delegates to a Replica for most things (but not for this monitor). Setting or not setting a field on that struct versus switching or not switching the EvalContext implementation seems a lot simpler to trace.

Is this something optional for this PR? I am asking since I am worried about over-engineering abstractions for unknown future uses. We don't know where else we will add memory accounting, and when we do we can generalize then. I've added a TODO here with your suggestion.

pkg/kv/kvserver/replica_read.go, line 247 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

could we take this line out of the if and perform it unconditionally by creating an unbounded account if rootMonitor is nil? I think that'd be better; it'd avoid the implementation of the evalCtx being different between whatever tests and production. The less we let tests diverge, the better.

The implementation fo evalCtx is different not just between tests and production but also within the latter depending on whether we are going to use a real account or not. I don't know what an "unbounded account" is -- there is MakeStandaloneBudget but it is explicitly for root monitors and calls to Grow do not check that the BytesMonitor is non-nil (they will work, with a huge value, but it seems an abuse).
Is this something you feel strongly about? If yes, we should discuss this offline, since I don't quite understand why it matters much.

pkg/kv/kvserver/store.go, line 728 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: s/Must not be nil/Is not nil

But I'm confused, since the above you seemed to say that you couldn't get all tests to make this non-nil. Perhaps a better comment would be "Can be nil in some tests" or such.

Done

pkg/kv/kvserver/store.go, line 2898 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this new hero deserves a comment

Done

pkg/kv/kvserver/batcheval/eval_context.go, line 135 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Is referring to MVCC scans in this general-purpose interface a good idea? Can't we talk more generally about memory accounting? E.g. simply GetMemoryAccount().
If the fact that this monitor is only sometimes initialized makes it only suitable for scans and not for accounting any other memory (which I don't think is the case), please explain that in a comment.

And I think we should name the method, as well as the type Account rather than Monitor. Because it's an Account, not a Monitor :).

Renamed to ResponseMemoryAccount and stated that this is currently only used for scans.

pkg/roachpb/api.proto, line 2036 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Well, in an API that's generally supposed to be agnostic to the collocation of the client and the server, I feel like you got to give some explanation about why such a surprising abstraction-breaking feature exists. We don't have to keep the comment up to date with every new use, but at list a hint would help. A natural question when seeing this field is what happens if the field is not set right. We have tests and tools that do RPCs in many ways; the internalClientAdapter is not supposed to be a load-bearing component.

Btw, I was thinking that, instead of this, if we introduced the client's node id in the BatchHeader, I wouldn't have to ask many questions. It seems that that'd be more generally useful. And it could also be set at a higher level (in the DistSender instead of being set in that weird internalClientAdapter).

Is Header.GatewayNodeID already doing the job, and slightly misnamed in that it may not be the gateway node for leaf distsql flows?

pkg/roachpb/api.proto, line 2037 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: s/LOCAL is set on codepaths that are calling the KV API from the same node (not using gRPC)./LOCAL means the client is collocated on the same node as the server.

Done

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/kv/txn.go, line 962 at r6 (raw file):

Previously, sumeerbhola wrote…

You raise some interesting alternatives.
I am inclined to not pass an account from client to server through a proto API. It seems hacky. The advantage of accounting to the "correct flow" I think is small enough, and only works for the local case. If we want txn level accounting in general, we need it to work for both local and remote. The main advantage is eliminating the NoMemoryReservedAtSource which admittedly is crude, and not something we would necessarily want to be the long term solution. But I am also not keen to start making DistSender changes, given that we are going to layer Streamer+DistSender in the medium term.

Long term, I think the NoMemoryReservedAtSource field goes away, since the caller should have reserved memory locally for N bytes and told the callee to not send more than N bytes (via TargetBytes or some other field). So if the callee is local it knows how many bytes are reserved. We cannot have callers not reserving up front and passing that info to the callee -- this will mean coordination across Streamer and DistSender and universal adoption (TargetBytes is currently ignored for things like retrieving intents). I've documented this in the comment for NoMemoryReservedAtSource. Is this a decent compromise?

sounds good

pkg/kv/kvserver/replica_read.go, line 184 at r3 (raw file):

Previously, sumeerbhola wrote…

Is this something optional for this PR? I am asking since I am worried about over-engineering abstractions for unknown future uses. We don't know where else we will add memory accounting, and when we do we can generalize then. I've added a TODO here with your suggestion.

optional :)

pkg/roachpb/api.proto, line 2036 at r3 (raw file):

Previously, sumeerbhola wrote…

Is Header.GatewayNodeID already doing the job, and slightly misnamed in that it may not be the gateway node for leaf distsql flows?

I forgot that existed. I think it might indicate the gateway even from requests from leaves based on this comment at the txn level:

cockroach/pkg/kv/txn.go

Lines 52 to 56 in 4e4f31a

 // gatewayNodeID, if != 0, is the ID of the node on whose behalf this 

 // transaction is running. Normally this is the current node, but in the case 

 // of Txns created on remote nodes by DistSQL this will be the gateway. 

 // It will be attached to all requests sent through this transaction. 

 gatewayNodeID roachpb.NodeID

anyway, as discussed offline, I think these fields are fine for now.

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Previously, sumeerbhola wrote…

Do we really need this struct? Even if the scan doesn't use all of BoundAccount, would it be harmful to have access to it? BoundAccount is not a heavy-weight structure.

It's not heavyweight but it is confusing to have all the methods accessible when we want to make sure the caller only calls Grow. And there is the nil case for cases (like the local one) where we are not doing memory accounting.
I am happy to take suggestions for better names. I've always been confused by the BoundAccount terminology. This is more like a QueryMemoryReserver.

discussed offline; make we can make this a BoundAccount after all.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

discussed offline; make we can make this a BoundAccount after all.

@andreimatei @jordanlewis

Jordan pointed me to NewUnlimitedMonitor, which is a BytesMonitor and not a BoundAccount. I'm assuming it does not need to be stopped.

But there are many difficulties there:

I don't wish to create one for every read that does not need memory accounting, since it is wasted allocations. So I'd need to create and stash one that lives longer. Potentially put it in StoreConfig itself, since we are already remembering to create one for it in server.go. That means plumbing to grab it from a Replica, which means adding another method. We'd still pay the BoundAccount allocation for paths that don't need to track memory, but that isn't a big deal. It's the unnecessary plumbing that is most bothersome. Also the plumbing will be missing for some tests -- the same tests for which Replica.store.getRootMemoryMonitorForKV() currently returns nil. So we'd need to find all those and fix them.

Then we also need to make sure that no one calls one of these mvcc methods without initializing MVCC{Scan,Get}Options.MemoryAccount to a *BoundAccount (this includes tests for the MVCC methods and pebbleMVCCScanner). This all seems unnecessarily complicated if a default initialized ResponseMemoryAccount works. I think the big objection to this was the long comment on the lack of a Shrink method -- I've added one to address that objection (even though it isn't used). IMO, between the choice of complex plumbing that achieves the same default behavior as a simple wrapper, one should choose the latter.

I've added some of this in the code comment for ResponseMemoryAccount.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Previously, sumeerbhola wrote…

@andreimatei @jordanlewis

Jordan pointed me to NewUnlimitedMonitor, which is a BytesMonitor and not a BoundAccount. I'm assuming it does not need to be stopped.

But there are many difficulties there:

I don't wish to create one for every read that does not need memory accounting, since it is wasted allocations. So I'd need to create and stash one that lives longer. Potentially put it in StoreConfig itself, since we are already remembering to create one for it in server.go. That means plumbing to grab it from a Replica, which means adding another method. We'd still pay the BoundAccount allocation for paths that don't need to track memory, but that isn't a big deal. It's the unnecessary plumbing that is most bothersome. Also the plumbing will be missing for some tests -- the same tests for which Replica.store.getRootMemoryMonitorForKV() currently returns nil. So we'd need to find all those and fix them.

Then we also need to make sure that no one calls one of these mvcc methods without initializing MVCC{Scan,Get}Options.MemoryAccount to a *BoundAccount (this includes tests for the MVCC methods and pebbleMVCCScanner). This all seems unnecessarily complicated if a default initialized ResponseMemoryAccount works. I think the big objection to this was the long comment on the lack of a Shrink method -- I've added one to address that objection (even though it isn't used). IMO, between the choice of complex plumbing that achieves the same default behavior as a simple wrapper, one should choose the latter.

I've added some of this in the code comment for ResponseMemoryAccount.

I'm fine with it like this, since it's reached the point where it's not worth arguing.
But let me take a crack at it myself. Will report back soon.

andreimatei

but is the new functionality tested at all? Cause as some point when I was hacking around I've made scans use unlimited accounts instead of accounts tied to the kvMonitor, and nothing seemed to fail. It's possible that I was holding it wrong but, if not, I do think some test should be added checking that account limits are respected. At least a unit test at the level of the pebbleMVCCScanner.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm fine with it like this, since it's reached the point where it's not worth arguing.
But let me take a crack at it myself. Will report back soon.

See the last two commits here; squash what you like out of them

I've gotten rid of ResponseMemoryAccount by making the zero value of *BoundAccount act as a no-op. There's a difference between a no-op account and an unlimited account - you'd be able to query the latter for its current usage - and for a while I thought about having both, but I let go of the unbounded one cause it's not really needed at the moment.

I've made my peace with evalContextWithAccount and left it alone - but added a sync.Pool for it.

sumeerbhola

is the new functionality tested at all?

There is the new logic test that fails due to the memory accounting, which validates some of the end-to-end plumbing.
I was waiting for the discussion to settle before adding a test for pebbleMVCCScanner.

I am not sure if/where to add KV tests for testing get/scan/reverse-scan -- what do you think?

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

See the last two commits here; squash what you like out of them

Where?

andreimatei

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Previously, sumeerbhola wrote…

See the last two commits here; squash what you like out of them

Where?

here, sorry

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

here, sorry

@jordanlewis do you have any concerns about the BoundAccount changes in andreimatei@8c23104#diff-dc061ae803d00d0e6b1857ea6479a03f89f1562f844b09066dcec4238481d1ca

jordanlewis · 2021-07-15T15:08:44Z

@jordanlewis do you have any concerns about the BoundAccount changes in andreimatei/cockroach@8c23104#diff-dc061ae803d00d0e6b1857ea6479a03f89f1562f844b09066dcec4238481d1ca

I'm uncomfortable with changing BoundAccount to be pass by reference as part of this PR, that's a pretty big change. The memory accounting stuff is fragile and I'm frankly scared of a big change like that. We can do it if you insist but maybe in another PR? I'm also a little concerned w/ performance - maybe changing to pass by reference will cause a bunch of these things to escape to the heap? We do store a ton of them in various structs (like aggregation buckets).

This is definitely fearmongering so I apologize but I don't think it's a trivial thing to do this change.

andreimatei

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):
(copying comment from wrong place here)

I'm uncomfortable with changing BoundAccount to be pass by reference as part of this PR, that's a pretty big change. The memory accounting stuff is fragile and I'm frankly scared of a big change like that. We can do it if you insist but maybe in another PR? I'm also a little concerned w/ performance - maybe changing to pass by reference will cause a bunch of these things to escape to the heap? We do store a ton of them in various structs (like aggregation buckets).

This is definitely fearmongering so I apologize but I don't think it's a trivial thing to do this change.

Just to make sure we're on the same page - what pass-by-reference are you talking about?

jordanlewis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

(copying comment from wrong place here)

I'm uncomfortable with changing BoundAccount to be pass by reference as part of this PR, that's a pretty big change. The memory accounting stuff is fragile and I'm frankly scared of a big change like that. We can do it if you insist but maybe in another PR? I'm also a little concerned w/ performance - maybe changing to pass by reference will cause a bunch of these things to escape to the heap? We do store a ton of them in various structs (like aggregation buckets).

This is definitely fearmongering so I apologize but I don't think it's a trivial thing to do this change.

Just to make sure we're on the same page - what pass-by-reference are you talking about?

The BoundAccount receiver was changed to a pointer receiver. This means that the BoundAccount is passed to the methods as a pointer, not a struct. That's the change I'm referring to.

andreimatei

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

The BoundAccount receiver was changed to a pointer receiver. This means that the BoundAccount is passed to the methods as a pointer, not a struct. That's the change I'm referring to.

the methods on the BoundAccount that matter always had a pointer receiver; they have to as the thing is obviously mutable. I've only changed Used/Monitor/allocated - minuscule leaf methods that are probably inlined everywhere. I've checked, and neither of the 3 causes the receiver (well, technically the value pointed-to by the receiver) to escape.
My patch was rushed - the way I had written it, there was no point in making the 3 methods take a ptr receiver because they were not protecting against a nil pointer. That's fixed; now they do (and I switched one to a value receiver cause it was an internal method).
Btw, I don't technically have to move User/Monitor to ptr receiver because nobody's currently calling them on a possibly-nil pointer (for the purposes of this PR, only Grow matters). But, if we give these no-op semantics to the nil ptr, I think we should do it all the way.

Makes sense?

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, @tbg, and @yuzefovich)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

@jordanlewis do you have any concerns about the BoundAccount changes in andreimatei/cockroach@8c23104#diff-dc061ae803d00d0e6b1857ea6479a03f89f1562f844b09066dcec4238481d1ca

@yuzefovich would you be able to chime in with thoughts on this (@jordanlewis is swamped with other stuff, and said you were an expert on such memory accounting matters)?

yuzefovich

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/mvcc.go, line 2486 at r3 (raw file):

Previously, sumeerbhola wrote…

@jordanlewis do you have any concerns about the BoundAccount changes in andreimatei/cockroach@8c23104#diff-dc061ae803d00d0e6b1857ea6479a03f89f1562f844b09066dcec4238481d1ca

@yuzefovich would you be able to chime in with thoughts on this (@jordanlewis is swamped with other stuff, and said you were an expert on such memory accounting matters)?

I don't see a problem in changing the BoundAccount methods to have a pointer receiver since we actually store the accounts both ways (i.e. by value and by pointer). In fact, in the newer vectorized engine we use the pointers, and I don't think we've seen any evidence of that having some negative performance impact. Andrei's justification above also sounds reasonable to me, so I'm on board with this change.

sumeerbhola

@andreimatei

The ResponseMemoryAccount is no more and we are instead using *BoundAccount as you had suggested (mon: nil BoundAccount acts as an unlimited account #68340 made the relevant nil BoundAccount change)
There is a unit test for pebbleMVCCScanner.

PTAL
I plan to merge when CI is green since merge conflicts with other PRs are quite frequent.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)

The goal is to track memory allocations for: - non-local SQL=>KV requests: this can happen with joins, multi-tenant clusters, and if ranges move between DistSQL planning and execution. - local SQL=>KV requests for the first request by a fetcher: in this case the fetcher reserves a modest 1KB which can be significantly exceeded by KV allocations. Only allocations in pebbleMVCCScanner for kv pairs and intents are tracked. The memory is released when returning from executeReadOnlyBatchWithServersideRefreshes since the chain of returns will end up in gRPC response processing and we can't hook into where that memory is released. This should still help for some cases of OOMs, and give some signal of memory overload that we can use elsewhere (e.g. admission control). The BytesMonitor is used to construct a BoundAccount that is that is passed via the EvalContext interface. The other alternative would be for the engine to have a BytesMonitor at initialization time that it can use to construct a BoundAccount for each MVCC scan, and pass it back via MVCCScanResult. This would mean multiple BoundAccounts for a batch (since we don't want to release memory until all the requests in the batch are processed), and would be harder to extend to track additional request types compared to embedding in EvalContext. The rootSQLMonitor is reused for this memory allocation tracking. This tracking is always done for non-local requests, and for the first request by a fetcher for a local request. This is to avoid double-counting, the first request issued by a SQL fetcher only reserves 1KB, but subsequent ones have already reserved what was returned in the first response. So there is room to tighten this if we knew what had been reserved on the local client (there are complications because the batch may have been split to send to different nodes, only one of which was local). The AdmissionHeader.SourceLocation field is used to mark local requests and is set in rpc.internalClientAdapter. The first request is marked using the AdmissionHeader.NoMemoryReservedAtSource bit. Informs cockroachdb#19721 Release note (ops change): The memory pool used for SQL is now also used to cover KV memory used for scans.

sumeerbhola · 2021-08-04T17:13:30Z

bors r+

craig · 2021-08-04T18:36:57Z

Build succeeded:

GitHub CI (Cockroach)

sumeerbhola requested review from jordanlewis, itsbilal, tbg and nvanbenschoten June 11, 2021 15:15

sumeerbhola commented Jun 11, 2021

View reviewed changes

jordanlewis reviewed Jun 11, 2021

View reviewed changes

tbg reviewed Jun 14, 2021

View reviewed changes

sumeerbhola force-pushed the kv_mem branch 2 times, most recently from 32b1c42 to 76c95ff Compare June 22, 2021 17:03

sumeerbhola mentioned this pull request Jun 22, 2021

kvcoord: Consider increasing max_intents_bytes default #54029

Closed

sumeerbhola force-pushed the kv_mem branch from 76c95ff to 6611d4b Compare June 22, 2021 21:39

sumeerbhola commented Jun 22, 2021

View reviewed changes

sumeerbhola force-pushed the kv_mem branch 2 times, most recently from 156c3d0 to 3cdf06b Compare June 26, 2021 01:15

sumeerbhola commented Jun 26, 2021

View reviewed changes

jordanlewis reviewed Jun 26, 2021

View reviewed changes

andreimatei reviewed Jun 27, 2021

View reviewed changes

sumeerbhola force-pushed the kv_mem branch 2 times, most recently from 2e91cdd to 54891ff Compare June 28, 2021 20:31

sumeerbhola commented Jun 28, 2021

View reviewed changes

sumeerbhola force-pushed the kv_mem branch from 54891ff to 14a361f Compare June 29, 2021 02:23

jordanlewis reviewed Jun 29, 2021

View reviewed changes

sumeerbhola commented Jun 29, 2021

View reviewed changes

sumeerbhola force-pushed the kv_mem branch from 14a361f to fd02c3d Compare July 12, 2021 19:42

sumeerbhola commented Jul 12, 2021

View reviewed changes

andreimatei reviewed Jul 13, 2021

View reviewed changes

sumeerbhola force-pushed the kv_mem branch 2 times, most recently from a690006 to 1b97b29 Compare July 14, 2021 12:47

sumeerbhola commented Jul 14, 2021

View reviewed changes

andreimatei reviewed Jul 14, 2021

View reviewed changes

andreimatei approved these changes Jul 14, 2021

View reviewed changes

sumeerbhola commented Jul 15, 2021

View reviewed changes

andreimatei reviewed Jul 15, 2021

View reviewed changes

sumeerbhola commented Jul 15, 2021

View reviewed changes

andreimatei reviewed Jul 15, 2021

View reviewed changes

jordanlewis reviewed Jul 15, 2021

View reviewed changes

andreimatei reviewed Jul 15, 2021

View reviewed changes

sumeerbhola commented Jul 15, 2021

View reviewed changes

yuzefovich reviewed Jul 15, 2021

View reviewed changes

sumeerbhola mentioned this pull request Aug 2, 2021

mon: nil BoundAccount acts as an unlimited account #68340

Merged

sumeerbhola force-pushed the kv_mem branch from 1b97b29 to 02acb54 Compare August 2, 2021 19:42

sumeerbhola requested a review from a team as a code owner August 2, 2021 19:42

sumeerbhola force-pushed the kv_mem branch from 02acb54 to 557d3c1 Compare August 4, 2021 13:01

sumeerbhola requested a review from a team as a code owner August 4, 2021 13:01

sumeerbhola force-pushed the kv_mem branch from 557d3c1 to 72f7ff3 Compare August 4, 2021 14:27

sumeerbhola commented Aug 4, 2021

View reviewed changes

sumeerbhola force-pushed the kv_mem branch from 72f7ff3 to a5708ab Compare August 4, 2021 15:25

craig bot merged commit 1443ced into cockroachdb:master Aug 4, 2021

jseldess mentioned this pull request Sep 8, 2021

kv,storage: use a BytesMonitor to track memory allocations for scans cockroachdb/docs#11477

Open

	// gatewayNodeID, if != 0, is the ID of the node on whose behalf this
	// transaction is running. Normally this is the current node, but in the case
	// of Txns created on remote nodes by DistSQL this will be the gateway.
	// It will be attached to all requests sent through this transaction.
	gatewayNodeID roachpb.NodeID

kv,storage: use a BytesMonitor to track memory allocations for scans #66362

kv,storage: use a BytesMonitor to track memory allocations for scans #66362

Conversation

sumeerbhola commented Jun 11, 2021 • edited Loading

cockroach-teamcity commented Jun 11, 2021

sumeerbhola left a comment

Choose a reason for hiding this comment

jordanlewis commented Jun 11, 2021

ajwerner commented Jun 11, 2021

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Jun 11, 2021

jordanlewis commented Jun 11, 2021

jordanlewis commented Jun 11, 2021

jordanlewis left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jordanlewis commented Jul 15, 2021

andreimatei left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

yuzefovich left a comment • edited Loading

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Aug 4, 2021

craig bot commented Aug 4, 2021

sumeerbhola commented Jun 11, 2021 •

edited

Loading

yuzefovich left a comment •

edited

Loading