-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,storage: use a BytesMonitor to track memory allocations for scans #66362
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lacks tests -- I'll add after some initial comments.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)
Wow, excellent. Thanks for doing this, I'm very excited to see it. I will review shortly. Conceptually, this is protection for the remote nodes in a scan, either a misplanned table reader or a lookup, index, or zigzag join, which we've never had. As an idea for a full system test, let's try creating a table with very large rows, and pin the leaseholders of it's primary index to node 1. Then, create a secondary index and pin it to node 2. Then, run a workload that forces an index join over the entire table and point it at node 2, so the db will be forced to retrieve all of the data from node 1 via remote batches. I think you should be able to induce an OOM with this workload against node 1 without too much trouble by turning up the concurrency of this workload. After your change, I'd expect node 1 to be protected from OOMs. Though you might have trouble making this work without running into the memory protection on node 2, actually. Anyway, I think there is something here hopefully. |
For testing can't we do something much simpler? Consider creating two servers with very different memory budgets, put all the data on the server will a small budget, disable distsql, issue the queries from the big budget server, make sure you get failures? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, this is protection for the remote nodes in a scan, either a misplanned table reader or a lookup, index, or zigzag join, which we've never had.
Also, it should hopefully help for multi-tenant settings where there is no local sql execution, though a BytesMonitor for that is not hooked up yet.
There was also the repeated OOMing issue with high number of bytes in pebbleMVCCScanner
https://github.com/cockroachlabs/support/issues/930#issuecomment-827798775 that was 20.2 and we never got to a proper root cause. @jordanlewis I believe your change #52496 was in 20.2, yes? So maybe it was the fact that the first fetch only reserved 1KB that was to blame. Or somehow things became non-local -- long running query where ranges moved after the plan. I'm speculating wildly.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)
I think this first request under-accounting for local requests can be narrowed by passing a no-memory-reserved-at-client bit, using the AdmissionHeader (this would happen in txnKVFetcher.Fetch). Well, 1KB is reserved but we can treat it as 0 unless the number of concurrently executing queries N is high enough that N*1KB becomes a significant fraction of the sql memory limit. There will still be a gap when KV releases the memory and before |
In 20.2, we had further bugs that hadn't been discovered yet, such as #65881 and #66376 which I only just found yesterday. Honestly I think that most of our OOM issues are caused by imprecise lifetime control of batch pointers on the SQL side - there are a lot of bugs here that we're just discovering and I feel like if those hadn't existed we might not be in such an OOM-y place. |
This would be great. Let's do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):
rootMonitor = r.store.getRootMemoryMonitorForKV() if rootMonitor == nil { log.Infof(ctx, "pebbleMVCCScanner: no rootMonitor")
nit: assuming this is debug leftovers?
pkg/kv/kvserver/replica_read.go, line 197 at r1 (raw file):
var boundAccount mon.BoundAccount if rootMonitor != nil { monitor := mon.NewMonitorInheritWithLimit("scanner-mem", 0 /* limit */, rootMonitor)
I think that it might be cleaner to create a monitor that lives for the lifetime of some long-lived object (like a replica? a range?) and create and release bound accounts based on that long-lived monitor. It feels like a slight smell to have a monitor created and destroyed on every KV request. We do it that way for SQL operators, to be fair, but those are longer lived.
Let's make sure we don't get any kind of performance hit due to the slight bit of serialization this will cause on the root monitor, at least. I think a basic kv workload would help determine this.
pkg/kv/kvserver/store.go, line 2899 at r1 (raw file):
func (s *Store) getRootMemoryMonitorForKV() *mon.BytesMonitor { if s.cfg.MemoryMonitorForKVProvider != nil { return s.cfg.MemoryMonitorForKVProvider.GetRootMemoryMonitor()
I think we should give the Store its own monitor. It feels wrong to have the scan monitors be connected directly to the root monitor. As I'm saying this I don't have a strong justification, just that it feels smelly to not have a part of the hierarchy that is explicitly KV-related.
pkg/roachpb/api.proto, line 2021 at r1 (raw file):
FROM_SQL = 1; ROOT_KV = 2; LOCAL = 4;
This is very helpful, there are other circumstances where I wished I'd had this (I'm thinking of #52863, which stalled because the extra serialization was removing all advantages in the local-only path - I needed to find a way to pass a batch pointer instead)
pkg/storage/pebble_mvcc_scanner.go, line 472 at r1 (raw file):
if p.err = p.memMonitor.Grow( context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil { log.Infof(context.TODO(), "intent memory monitor error: %s", p.err.Error())
Will the error be propagated here? I'd be interested to see what the output error in SQL-land would look like if this is triggered. We might conceivably want to annotate this error with which node had the OOM condition? Which range? More details on the op being executed?
pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):
if p.err = p.memMonitor.Grow( context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil { log.Infof(context.TODO(), "intent memory monitor error: %s", p.err.Error())
ditto above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thank you Sumeer!
I think @jordanlewis is right about the monitors not being set up in just the right place. Unfortunately I'm not familiar with them myself, but naively I would have hung one monitor into *StoreConfig
(shared across all Store
s, i.e. it's really owned by the Node
). The upshot of that will be that we can adopt memory monitoring in the rest of KV easily. This is already almost the code you have, except that you're hanging s.sqlServer.rootSQLMemoryMonitor
off the store config. I'm curious why you did that instead of making a leaf monitor right there.
pkg/kv/kvserver/replica_read.go
Outdated
if !roachpb.IsAdmissionHeaderLOCAL(ba.AdmissionHeader.Source) { | ||
rootMonitor = r.store.getRootMemoryMonitorForKV() | ||
if rootMonitor == nil { | ||
log.Infof(ctx, "pebbleMVCCScanner: no rootMonitor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worthy of a return roachpb.NewError(errors.AssertionFailedf("x"))
, isn't it? If the monitor is missing that's bad, and logging at high frequency doesn't make it better. If this is truly optional, we shouldn't log.
pkg/kv/kvserver/replica_read.go
Outdated
// least the batch is fully evaluated. Ideally we would like to release | ||
// after grpc has sent the response, but there are no interceptors at that | ||
// stage. The interceptors execute before the response is marshaled in | ||
// Server.processUnaryRPC by calling sendResponse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we could use finalizers for this purpose, but I'd much prefer it if we didn't. I think this imperfect tracking is about as good as it gets.
It might be worth mentioning in this comment that we're intentionally not using finalizers because they delay GC and because they have had bugs in the past and can also prevent GC of objects with cyclic references.
pkg/roachpb/api.go
Outdated
@@ -1527,3 +1527,11 @@ func (c *ContentionEvent) SafeFormat(w redact.SafePrinter, _ rune) { | |||
func (c *ContentionEvent) String() string { | |||
return redact.StringWithoutMarkers(c) | |||
} | |||
|
|||
func IsAdmissionHeaderOTHER(source AdmissionHeader_Source) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intentional that IsAdmissionHeaderOTHER(100)
, i.e. with the LOCAL
flag set, is true, right?
pkg/roachpb/api.proto
Outdated
// BatchRequests can cause other BatchRequests (e.g. to PushTransaction), | ||
// which may result in yet more BatchRequests (e.g. to | ||
// ResolveIndeterminateCommit). These can cause a deadlock when using slot | ||
// based admission control. Only FROM_SQL and ROOT_KV requests are subject | ||
// to admission control. This behavior errs on the side of no deadlock, if | ||
// we overlooked an instrumentation point, since the default value of OTHER | ||
// bypasses admission control. | ||
// LOCAL is a special bit that is only set on codepaths that are calling the | ||
// KV api from the same node (not using gRPC). If FROM_SQL and ROOT_KV are | ||
// not set, it represents OTHER. This allows codepaths that don't initialize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it represents OTHER" as in "LOCAL represents OTHER"? I get what you mean (100 == OTHER == 000
) but it sounds a bit weird. What do you mean by "This allows codepaths that don't initialize the source to function as other"? If you don't initialize the source you have int32(0)
and the comment above already explains this is OTHER. You could just say
LOCAL is a special bit that indicates that the Source is local to the KV node (i.e. bypassing the network). It has no effect if neither FROM_SQL nor ROOT_KV are set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also generally confusing, to have the flags partially match the meaning of the bitfield representation (I would've expected type Source int32
and then the flags are just consts). Also, is this worth keeping as a bitfield?
Why not
message Source {
enum {
OTHER = 0;
FROM_SQL = 1;
ROOT_KV = 2;
} s = 1;
bool local = 2;
Is the expectation that we'll pick up "many" other sources? But then won't we run out of bits and the above is better anyway?
pkg/storage/mvcc.go
Outdated
// tracking in general is optimistic in when it stops tracking. So this one | ||
// case of pessimism if fine and simplifies the code. An empty initialized | ||
// ScannerMemoryMonitor is safe to use. | ||
type ScannerMemoryMonitor struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter, but you could use
type ScannerMemoryMonitor mon.BoundAccount
and then
func (b *ScannerMemoryMonitor) Grow(ctx context.Context, x int64) error {
if b == nil { return nil }
return (*mon.BoundAccount)(b).Grow(ctx, x)
}
and it would look maybe a tiny bit cleaner but also you don't have to worry about accidentally moving ScannerMemoryMonitor
to the heap. In particular, you can pass it through an interface "for free", so you could also hand it deeper into pebble should that become useful one day.
pkg/storage/pebble_mvcc_scanner.go
Outdated
@@ -76,6 +78,10 @@ func (p *pebbleResults) put(key []byte, value []byte) { | |||
if len(p.repr) > 0 { | |||
p.bufs = append(p.bufs, p.repr) | |||
} | |||
if err := monitor.Grow(context.TODO(), int64(newSize)); err != nil { | |||
log.Infof(context.TODO(), "put memory monitor error: %s", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is debug logging, right? You're already returning the error.
pkg/storage/pebble_mvcc_scanner.go
Outdated
@@ -459,6 +467,11 @@ func (p *pebbleMVCCScanner) getAndAdvance() bool { | |||
// that lie before the resume key. | |||
return false | |||
} | |||
if p.err = p.memMonitor.Grow( | |||
context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil { | |||
log.Infof(context.TODO(), "intent memory monitor error: %s", p.err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
pkg/storage/pebble_mvcc_scanner.go
Outdated
@@ -479,6 +492,11 @@ func (p *pebbleMVCCScanner) getAndAdvance() bool { | |||
// Note that this will trigger an error higher up the stack. We | |||
// continue scanning so that we can return all of the intents | |||
// in the scan range. | |||
if p.err = p.memMonitor.Grow( | |||
context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil { | |||
log.Infof(context.TODO(), "intent memory monitor error: %s", p.err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever we return an error from the account, we should wrap it with information about the mvccScanner: the number of bytes in the pebbleResults
and the p.intents
. I hope the error from the monitor already includes the current allocations and how much was requested. We can then check that these are in line with expectations when we see such an error (and spot bugs if there are any).
I also agree with @jordanlewis that this should be a structured error, so that SQL can annotate it programmatically if desired.
pkg/storage/pebble_mvcc_scanner.go
Outdated
@@ -479,6 +492,11 @@ func (p *pebbleMVCCScanner) getAndAdvance() bool { | |||
// Note that this will trigger an error higher up the stack. We | |||
// continue scanning so that we can return all of the intents | |||
// in the scan range. | |||
if p.err = p.memMonitor.Grow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looked wrong to me until I realized that p.intents
is a pebble.Batch
and not a pebbleResults
. Might be worth pointing out in a comment, I liked that we didn't have calls to Grow
all over the place and I assume this is the only place where we write to p.intents
so it's fine.
32b1c42
to
76c95ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
I've addressed all the comments.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is worthy of a
return roachpb.NewError(errors.AssertionFailedf("x"))
, isn't it? If the monitor is missing that's bad, and logging at high frequency doesn't make it better. If this is truly optional, we shouldn't log.
I've made this optional since (a) I am not sure if I've captured all StoreConfig
s by the code in NewServer
, and more importantly (b) Server
is relying on there being a sqlServer
-- is that lacking for serverless multi-tenant KV nodes?, (c) I am not sure if some KV commands will start executing at some point in NewServer
prior to construction of the sqlServer
. Though if (c) is true there is a race condition with GetRootMemoryMonitor
which needs to be fixed.
pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
nit: assuming this is debug leftovers?
Removed
pkg/kv/kvserver/replica_read.go, line 197 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I think that it might be cleaner to create a monitor that lives for the lifetime of some long-lived object (like a replica? a range?) and create and release bound accounts based on that long-lived monitor. It feels like a slight smell to have a monitor created and destroyed on every KV request. We do it that way for SQL operators, to be fair, but those are longer lived.
Let's make sure we don't get any kind of performance hit due to the slight bit of serialization this will cause on the root monitor, at least. I think a basic kv workload would help determine this.
Changed to Server
creating a long-lived child kvMemoryMonitor
.
pkg/kv/kvserver/replica_read.go, line 204 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
In theory we could use finalizers for this purpose, but I'd much prefer it if we didn't. I think this imperfect tracking is about as good as it gets.
It might be worth mentioning in this comment that we're intentionally not using finalizers because they delay GC and because they have had bugs in the past and can also prevent GC of objects with cyclic references.
Done
pkg/kv/kvserver/store.go, line 2899 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I think we should give the Store its own monitor. It feels wrong to have the scan monitors be connected directly to the root monitor. As I'm saying this I don't have a strong justification, just that it feels smelly to not have a part of the hierarchy that is explicitly KV-related.
Done
pkg/roachpb/api.proto, line 2015 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It's also generally confusing, to have the flags partially match the meaning of the bitfield representation (I would've expected
type Source int32
and then the flags are just consts). Also, is this worth keeping as a bitfield?Why not
message Source { enum { OTHER = 0; FROM_SQL = 1; ROOT_KV = 2; } s = 1; bool local = 2;
Is the expectation that we'll pick up "many" other sources? But then won't we run out of bits and the above is better anyway?
I was not feeling clean about this either and glad you pushed on this. I've moved it to a separate field.
pkg/roachpb/api.proto, line 2021 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
This is very helpful, there are other circumstances where I wished I'd had this (I'm thinking of #52863, which stalled because the extra serialization was removing all advantages in the local-only path - I needed to find a way to pass a batch pointer instead)
Ack
pkg/storage/mvcc.go, line 2497 at r1 (raw file):
but also you don't have to worry about accidentally moving ScannerMemoryMonitor to the heap
because it is already on the heap? Versus what I did here to place BoundAccount
on the heap while leaving ScannerMemoryMonitor
as pass-by-value? I'm not clear on why one is better than the other -- could you clarify?
so you could also hand it deeper into pebble should that become useful one day
If you are suggesting prolonging its lifetime, I'm not sure about that since we want to close it when the batch finishes executing. If not, I am not sure what prevents passing the current ScannerMemoryMonitor
deeper -- it would need to be wrapped in an interface of course. I guess I don't understand what is meant by "pass it through an interface for free". Could you elaborate?
pkg/storage/pebble_mvcc_scanner.go, line 82 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is debug logging, right? You're already returning the error.
Yes, removed
pkg/storage/pebble_mvcc_scanner.go, line 472 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto
Removed
pkg/storage/pebble_mvcc_scanner.go, line 472 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Will the error be propagated here? I'd be interested to see what the output error in SQL-land would look like if this is triggered. We might conceivably want to annotate this error with which node had the OOM condition? Which range? More details on the op being executed?
I've added the start key of the scan and the used bytes in the BoundAccount. The error from the BoundAccount includes some state from the underlying BytesMonitor.
You'll recognize this example :)
root@:26257/defaultdb> SELECT * FROM foo@foo_attribute_idx WHERE attribute=10 AND blob LIKE 'blah%';
ERROR: scan with start key /Table/64/1/2/0: used bytes 67108864: kv-mem: memory budget exceeded: 67108864 bytes requested, 201326592 currently allocated, 209715200 bytes in budget
SQLSTATE: 53200
(this was with a hacky version that reduced the memory for the KV BytesMonitor to 200MB)
pkg/storage/pebble_mvcc_scanner.go, line 495 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This looked wrong to me until I realized that
p.intents
is apebble.Batch
and not apebbleResults
. Might be worth pointing out in a comment, I liked that we didn't have calls toGrow
all over the place and I assume this is the only place where we write top.intents
so it's fine.
There are two places -- there is the one just above this. I've added a comment in both.
pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Whenever we return an error from the account, we should wrap it with information about the mvccScanner: the number of bytes in the
pebbleResults
and thep.intents
. I hope the error from the monitor already includes the current allocations and how much was requested. We can then check that these are in line with expectations when we see such an error (and spot bugs if there are any).I also agree with @jordanlewis that this should be a structured error, so that SQL can annotate it programmatically if desired.
I've added the start key of the scan and the used bytes in the BoundAccount
. The error from the BoundAccount
includes some state from the underlying BytesMonitor
.
Do you mean a structured error defined in roachpb/errors.proto?
pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
ditto above
Removed
pkg/roachpb/api.go, line 1531 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It's intentional that
IsAdmissionHeaderOTHER(100)
, i.e. with theLOCAL
flag set, is true, right?
Yes, that was intentional. This is cleaned up now.
156c3d0
to
3cdf06b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a simple logic test based on an example that @jordanlewis used in an issue. We don't need a distributed setup since the first request by a txnKVFetcher
sets AdmissionHeader.NoMemoryReservedAtSource
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what happens if you try a TPCH workload with this patch. #64906 has some instructions to set that up. Try playing with query 4 (which uses lookup joins) at high concurrency, this has reliably OOMed a cluster quickly in my testing. If you don't feel like doing this it's fine, I will try it out sooner or later, but I'd be interested in getting your eyes on the workload as well.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/sql/logictest/testdata/logic_test/mem_limit, line 40 at r3 (raw file):
query error scan with start key .* memory budget exceeded SELECT * FROM foo@foo_attribute_idx WHERE attribute=10 AND blob LIKE 'blah%'
Thanks for constructing this example. This highlights something I didn't realize earlier - that this behavior change can lead to individual queries being uncompleteable even in a non-concurrent environment.
I think this is undesired behavior. In the ideal world, 2 things would be true:
- the server would never OOM
- queries would never return budget exceeded errors, by virtue of either spilling to disk or using smaller batches and more roundtrips.
I think this case should, ideally, by handled by having KV return a partial result, so SQL can process the partial result and ask for more when it's flushed memory back to the client.
That being said, this is probably impossible currently because of #54680, which is preventing us from returning partial results in cases like index join where SQL requests a parallel scan...
Thoughts? I guess in practice this will only kick in when the server is running low on memory anyway, so probably it won't be easily observable.
pkg/storage/pebble_mvcc_scanner.go, line 82 at r1 (raw file):
Previously, sumeerbhola wrote…
Yes, removed
nit: should we plumb a real context?
pkg/storage/pebble_mvcc_scanner.go, line 472 at r1 (raw file):
Previously, sumeerbhola wrote…
I've added the start key of the scan and the used bytes in the BoundAccount. The error from the BoundAccount includes some state from the underlying BytesMonitor.
You'll recognize this example :)root@:26257/defaultdb> SELECT * FROM foo@foo_attribute_idx WHERE attribute=10 AND blob LIKE 'blah%'; ERROR: scan with start key /Table/64/1/2/0: used bytes 67108864: kv-mem: memory budget exceeded: 67108864 bytes requested, 201326592 currently allocated, 209715200 bytes in budget SQLSTATE: 53200
(this was with a hacky version that reduced the memory for the KV BytesMonitor to 200MB)
👍
pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):
Previously, sumeerbhola wrote…
I've added the start key of the scan and the used bytes in the
BoundAccount
. The error from theBoundAccount
includes some state from the underlyingBytesMonitor
.
Do you mean a structured error defined in roachpb/errors.proto?
I'm continually confused about error redaction, but do we need to worry that we're logging a raw key here? That would be PII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Sumeer!
local SQL=>KV requests for the first request by a fetcher:
Have you considered tracking this kind of allocation in the respective SQL session's monitor? We already have per-session monitors, and within those we have a per-txn one I believe. I think it'd be good to track as much of the session's usage as possible in those.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):
Previously, sumeerbhola wrote…
I've made this optional since (a) I am not sure if I've captured all
StoreConfig
s by the code inNewServer
, and more importantly (b)Server
is relying on there being asqlServer
-- is that lacking for serverless multi-tenant KV nodes?, (c) I am not sure if some KV commands will start executing at some point inNewServer
prior to construction of thesqlServer
. Though if (c) is true there is a race condition withGetRootMemoryMonitor
which needs to be fixed.
I, for one, think we should resist as much as reasonable this kinds of fields being optional. I think there's such a thing as an unlimited monitor (if we really need to sometimes have stores that don't have this limit, which I also question), so the fact that some field can be nil I think only makes the code harder to read. Perhaps I can help answer some of the questions above to make this non-nilable?
pkg/kv/kvserver/replica_read.go, line 184 at r3 (raw file):
} type evalContextWithMonitor struct {
I find the need for this struct dubious. At the very least, add a good comment pls.
But, why can't the replica's
implementation of EvalContext
supply the right monitor always? I think it might be worth pushing on that front.
pkg/kv/kvserver/replica_read.go, line 210 at r3 (raw file):
// locally, or for local request has reserved no memory. Local requests // (typically DistSQL, though we may not have instrumented the source as SQL // in all cases, so some may be flowing in as OTHER), do memory accounting
I no longer see an "OTHER". Did this comment rot?
pkg/kv/kvserver/replica_read.go, line 214 at r3 (raw file):
// request in fetcher is small (the NoMemoryReservedAtSource=true case), // subsequent ones use the size of the response for subsequent requests (see // https://github.com/cockroachdb/cockroach/pull/52496). This scheme could
you sure you want to put a link to a PR in this comment? I haven't seen that before... Consider replacing the link with more words or code references.
pkg/kv/kvserver/replica_read.go, line 238 at r3 (raw file):
for retries := 0; ; retries++ { if retries > 0 { // It is safe to call Clear on an uninitialized BoundAccount.
nit: I don't think this comment is necessary; if someone really has that question, they can answer it very quickly by reading the Clear
code. But as I was saying above, I think the code would look better if this account always existed.
pkg/kv/kvserver/store.go, line 728 at r3 (raw file):
ProtectedTimestampCache protectedts.Cache // MemoryMonitorForKVProvider can be nil.
please add more to this comment, saying what this thing is.
And also say what happens if it's nil.
pkg/roachpb/api.proto, line 2036 at r3 (raw file):
Source source = 3; // SourceLocation specifies physically where the call originated. LOCAL is
I think this field would benefit from more hints about how it's used
pkg/roachpb/api.proto, line 2045 at r3 (raw file):
SourceLocation source_location = 4; // NoMemoryReservedAtSource is read only when SourceLocation=LOCAL, where
please add more comments to this; what you have doesn't really explain what the deal is
pkg/server/server.go, line 182 at r3 (raw file):
gcoord *admission.GrantCoordinator kvMemoryMonitor *mon.BytesMonitor
please put a comment on this guy explaining how it's tied into the other monitors, and what it serves for
pkg/server/server.go, line 612 at r3 (raw file):
ExternalStorageFromURI: externalStorageFromURI, ProtectedTimestampCache: protectedtsProvider, MemoryMonitorForKVProvider: lateBoundServer,
Do we really need this dependency from the StoreConfig
to the Server
? If possible, I think it'd be better to initialize the monitor on all stores at Server.Start
time, or until the Server
is constructed below. I hope it's not needed until then, and it would save the need for the Provider
interface.
Also see below for another orchestration suggestion.
pkg/server/server.go, line 757 at r3 (raw file):
kvMemoryMonitor := mon.NewMonitorInheritWithLimit( "kv-mem", 0 /* limit */, sqlServer.rootSQLMemoryMonitor) kvMemoryMonitor.Start(ctx, sqlServer.rootSQLMemoryMonitor, mon.BoundAccount{})
if you're tying this to the "sql` monitor, consider sprinkling some comments around that guy explaining that it's not just "sql " any more
pkg/server/server.go, line 757 at r3 (raw file):
kvMemoryMonitor := mon.NewMonitorInheritWithLimit( "kv-mem", 0 /* limit */, sqlServer.rootSQLMemoryMonitor) kvMemoryMonitor.Start(ctx, sqlServer.rootSQLMemoryMonitor, mon.BoundAccount{})
instead of tying this initialization of the Server
to the kvServer
, I think it'd be better if we lifted the initialization of the rootSQLMemoryMonitor
out of the sqlServer
, and into the orchestration code around here. Have the sqlServer
take in its monitor, just like the Server
now does. So, change SQLConfig
to take in a monitor instead of a MemoryPoolSize
.
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
} // ScannerMemoryMonitor is used to track memory allocations when using
Do we really need this struct? Even if the scan doesn't use all of BoundAccount
, would it be harmful to have access to it? BoundAccount
is not a heavy-weight structure.
Between the MemoryMonitor
and the BoundAccount
, our memory tracking infra is confusing enough. I for one would not introduce more structs. The fact that you're naming this a "monitor" but it's actually an "account" doesn't help :)
pkg/storage/pebble_mvcc_scanner.go, line 82 at r1 (raw file):
nit: should we plumb a real context?
At the very least, I think you can easily lift the context.TODO
sever layers up, which is always better in my book.
pkg/storage/pebble_mvcc_scanner.go, line 472 at r3 (raw file):
// by only accounting for the key and value bytes. if p.err = p.memMonitor.Grow( context.TODO(), int64(len(p.curRawKey)+len(p.curValue))); p.err != nil {
I think it's not too hard to plumb a ctx here. Try to fight against context.TODO()
:P
2e91cdd
to
54891ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered tracking this kind of allocation in the respective SQL session's monitor? We already have per-session monitors, and within those we have a per-txn one I believe. I think it'd be good to track as much of the session's usage as possible in those.
Do we have per-txn ones even for the leaf nodes in DistSQL evaluation? For multi-tenant KV I think we need a child that is per-tenant -- I've added a TODO in replica_read.go
.
I still owe an attempt with TPCH.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I, for one, think we should resist as much as reasonable this kinds of fields being optional. I think there's such a thing as an unlimited monitor (if we really need to sometimes have stores that don't have this limit, which I also question), so the fact that some field can be nil I think only makes the code harder to read. Perhaps I can help answer some of the questions above to make this non-nilable?
Made this non-optional in StoreConfig and added a panic here.
pkg/kv/kvserver/replica_read.go, line 184 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I find the need for this struct dubious. At the very least, add a good comment pls.
But, why can't thereplica's
implementation ofEvalContext
supply the right monitor always? I think it might be worth pushing on that front.
There are two reasons:
- we do not want a real ScannerMemoryMonitor for all requests (e.g. don't want it for LOCAL request where there is memory reserved at source).
- the ScannerMemoryMonitor needs to be specific to this request so we can close it in
executeReadOnlyBatchWithServersideRefreshes
and we need to hang that per-request monitor somewhere, since it can't beReplica
.
I've added a comment.
pkg/kv/kvserver/replica_read.go, line 210 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I no longer see an "OTHER". Did this comment rot?
It hasn't rotted but the mention of OTHER was probably confusing. The original intention was to answer the question in a reader's mind on why we additionally did not gate this on FROM_SQL. I've expanded the comment to make it clearer.
pkg/kv/kvserver/replica_read.go, line 214 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
you sure you want to put a link to a PR in this comment? I haven't seen that before... Consider replacing the link with more words or code references.
Replaced with a reference to txnKVFetcher.
pkg/kv/kvserver/replica_read.go, line 238 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: I don't think this comment is necessary; if someone really has that question, they can answer it very quickly by reading the
Clear
code. But as I was saying above, I think the code would look better if this account always existed.
I'm not sure what we gain with an unlimited BoundAccount
compared to a zero-initialized one. I think this issue is separable from a store not having a BytesMonitor.
pkg/kv/kvserver/store.go, line 728 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please add more to this comment, saying what this thing is.
And also say what happens if it's nil.
Done
pkg/roachpb/api.proto, line 2036 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think this field would benefit from more hints about how it's used
Won't that become stale and unwieldy as we add more use cases?
@jordanlewis would like to use this to optimize not serializing vectorized execution batches when constructing those batches inside the storage package.
pkg/roachpb/api.proto, line 2045 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please add more comments to this; what you have doesn't really explain what the deal is
Done
pkg/server/server.go, line 182 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please put a comment on this guy explaining how it's tied into the other monitors, and what it serves for
Done
pkg/server/server.go, line 612 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Do we really need this dependency from the
StoreConfig
to theServer
? If possible, I think it'd be better to initialize the monitor on all stores atServer.Start
time, or until theServer
is constructed below. I hope it's not needed until then, and it would save the need for theProvider
interface.
Also see below for another orchestration suggestion.
Done (the suggestion below)
pkg/server/server.go, line 757 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
if you're tying this to the "sql` monitor, consider sprinkling some comments around that guy explaining that it's not just "sql " any more
Done (in server_sql.go)
pkg/server/server.go, line 757 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
instead of tying this initialization of the
Server
to thekvServer
, I think it'd be better if we lifted the initialization of therootSQLMemoryMonitor
out of thesqlServer
, and into the orchestration code around here. Have thesqlServer
take in its monitor, just like theServer
now does. So, changeSQLConfig
to take in a monitor instead of aMemoryPoolSize
.
Done
pkg/sql/logictest/testdata/logic_test/mem_limit, line 40 at r3 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
Thanks for constructing this example. This highlights something I didn't realize earlier - that this behavior change can lead to individual queries being uncompleteable even in a non-concurrent environment.
I think this is undesired behavior. In the ideal world, 2 things would be true:
- the server would never OOM
- queries would never return budget exceeded errors, by virtue of either spilling to disk or using smaller batches and more roundtrips.
I think this case should, ideally, by handled by having KV return a partial result, so SQL can process the partial result and ask for more when it's flushed memory back to the client.
That being said, this is probably impossible currently because of #54680, which is preventing us from returning partial results in cases like index join where SQL requests a parallel scan...
Thoughts? I guess in practice this will only kick in when the server is running low on memory anyway, so probably it won't be easily observable.
Yes, I think we need #54680. The test reflects current behavior. I've added a comment stating that this test will need to be rewritten.
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Do we really need this struct? Even if the scan doesn't use all of BoundAccount, would it be harmful to have access to it? BoundAccount is not a heavy-weight structure.
It's not heavyweight but it is confusing to have all the methods accessible when we want to make sure the caller only calls Grow
. And there is the nil case for cases (like the local one) where we are not doing memory accounting.
I am happy to take suggestions for better names. I've always been confused by the BoundAccount
terminology. This is more like a QueryMemoryReserver
.
pkg/storage/pebble_mvcc_scanner.go, line 82 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: should we plumb a real context?
At the very least, I think you can easily lift the
context.TODO
sever layers up, which is always better in my book.
I should not have been lazy. Fixed.
pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I'm continually confused about error redaction, but do we need to worry that we're logging a raw key here? That would be PII.
The code comment for Wrapf
says Use errors.Wrapf(err, "%s", <unsafestring>) for errors strings that may contain PII information.
, so I believe we are ok.
pkg/storage/pebble_mvcc_scanner.go, line 472 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think it's not too hard to plumb a ctx here. Try to fight against
context.TODO()
:P
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @tbg from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/sql/logictest/testdata/logic_test/mem_limit, line 40 at r3 (raw file):
Previously, sumeerbhola wrote…
Yes, I think we need #54680. The test reflects current behavior. I've added a comment stating that this test will need to be rewritten.
What I am anxious about is whether releasing this change will, in the next release, manifest as worse behavior for a user. The trouble is that if we make a situation in which a single query can, on its own, cause a budget exceeded error, there is really no recourse for a user that I can think of, which makes me uncomfortable.
Such a query, today, would either OOM a server or "get lucky" and simply succeed. Typically it would be getting lucky because 25% of system memory gives quite a bit of headroom. It doesn't seem incredibly unlikely that we have use cases in the wild that are happily running, getting lucky, without any trouble - and we'd be ending those happy cases with this patch.
I really wish we could get #54680 in for this reason, since a failure to reserve memory would simply manifest in that case as extra round trips. I almost think we should disable this patch if, by next release, we don't have #54680. But I suppose that's up for debate.
pkg/storage/pebble_mvcc_scanner.go, line 497 at r1 (raw file):
Previously, sumeerbhola wrote…
The code comment for
Wrapf
saysUse errors.Wrapf(err, "%s", <unsafestring>) for errors strings that may contain PII information.
, so I believe we are ok.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ftr, adding stuff from side discussion with @jordanlewis on TPCH)
Running ./workload run tpch --queries 4 --concurrency=1024 --tolerate-errors
immediately OOMs the node. The probable cause for query 4 OOMs was and continues to be in the SQL layer. KV accounting clearly was not the magic bullet to fix it.
Even though most OOMs are due to accounting gaps in SQL, I think it is still worthwhile to have this PR since it introduces plumbing for memory accounting were there was none, and can be used as a building block for introducing more memory accounting In KV.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)
pkg/sql/logictest/testdata/logic_test/mem_limit, line 40 at r3 (raw file):
Such a query, today, would either OOM a server or "get lucky" and simply succeed. Typically it would be getting lucky because 25% of system memory gives quite a bit of headroom. It doesn't seem incredibly unlikely that we have use cases in the wild that are happily running, getting lucky, without any trouble - and we'd be ending those happy cases with this patch.
I'm fine with a decision to hide this behind a cluster setting that is off by default if #54680 is not addressed. And another option would be to decrease the 25% headroom, though I'm not sure what all (including non-per-query state like the ts cache, latches, lockTable etc.) are relying on that and whether decreasing that is safe either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, and @tbg)
pkg/kv/txn.go, line 962 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
(reviewable is terrible at PR-level threads, so using this instead)
I think I have finally understood the deal with the "local SQL=>KV requests for the first request by a fetcher" case, and with the new
NoMemoryReservedAtSource
field. That field is not explained enough, and I think that's the case because it is pretty hard to explain. I think we can simplify the tracking model a lot if we simply pass an account from the client to the server. TheBoundAccount
can model reservations (it has areserved
field and a separateused
field). So, instead of theNoMemoryReservedAtSource
field and theSourceLocation
field, wouldn't it be both simpler and better forBatchRequests
to take an optional reservation in the form of a*BoundAccount
?
This way:
- The distinction between local/remote sources is turned into the distinction between a reservation existing or not existing. It's possible for the DistSender to split a request into multiple local parts, so the question becomes how to deal with the reservation then. I think we need to either split it or we'd need a thread-safe version of
BoundAccount
.- The distinction between Local+NoMemoryReservedAtSource and Local goes away. They're both requests with a reservation, which reservation is higher or lower.
- In the Local cases, the scanner allocations will be accounted to the correct flow (and separately we should make flows running on the gateway account to their SQL sessions)
- In the case that's currently represented by a reservation on the client and no tracking on the server, we'd no longer be just hoping that the reservation is large enough. Instead, once the reservation is exhausted, the server would try to grow it and fail if it can't.
BatchRequest
being a proto API, we have to use a trick to essentially pass a pointer to a Go object through it (or otherwise expand the RPC interface for the local-caller case), but it's all doable.
You raise some interesting alternatives.
I am inclined to not pass an account from client to server through a proto API. It seems hacky. The advantage of accounting to the "correct flow" I think is small enough, and only works for the local case. If we want txn level accounting in general, we need it to work for both local and remote. The main advantage is eliminating the NoMemoryReservedAtSource
which admittedly is crude, and not something we would necessarily want to be the long term solution. But I am also not keen to start making DistSender changes, given that we are going to layer Streamer+DistSender in the medium term.
Long term, I think the NoMemoryReservedAtSource
field goes away, since the caller should have reserved memory locally for N bytes and told the callee to not send more than N bytes (via TargetBytes
or some other field). So if the callee is local it knows how many bytes are reserved. We cannot have callers not reserving up front and passing that info to the callee -- this will mean coordination across Streamer and DistSender and universal adoption (TargetBytes
is currently ignored for things like retrieving intents). I've documented this in the comment for NoMemoryReservedAtSource
. Is this a decent compromise?
pkg/kv/kvserver/replica_read.go, line 192 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
// rootMonitor will not be nil in production settings, but we don't check // here due to tests that do not have a monitor.
I don't quite understand what this comment is telling me. Is it saying that, in the tests where it's nil, the code still somehow works? Consider rephrasing, cause "but we don't check" is weird (why would we check? we don't generally check pointers). Or just delete it.
I've simplified the comment.
pkg/kv/kvserver/replica_read.go, line 184 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I would find it much cleaner if we switched away from
Replica
implementingEvalContext
and, instead, creating a dedicated struct for it that delegates to aReplica
for most things (but not for this monitor). Setting or not setting a field on that struct versus switching or not switching theEvalContext
implementation seems a lot simpler to trace.
Is this something optional for this PR? I am asking since I am worried about over-engineering abstractions for unknown future uses. We don't know where else we will add memory accounting, and when we do we can generalize then. I've added a TODO here with your suggestion.
pkg/kv/kvserver/replica_read.go, line 247 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
could we take this line out of the if and perform it unconditionally by creating an unbounded account if
rootMonitor
is nil? I think that'd be better; it'd avoid the implementation of theevalCtx
being different between whatever tests and production. The less we let tests diverge, the better.
The implementation fo evalCtx
is different not just between tests and production but also within the latter depending on whether we are going to use a real account or not. I don't know what an "unbounded account" is -- there is MakeStandaloneBudget
but it is explicitly for root monitors and calls to Grow
do not check that the BytesMonitor
is non-nil (they will work, with a huge value, but it seems an abuse).
Is this something you feel strongly about? If yes, we should discuss this offline, since I don't quite understand why it matters much.
pkg/kv/kvserver/store.go, line 728 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: s/Must not be nil/Is not nil
But I'm confused, since the above you seemed to say that you couldn't get all tests to make this non-nil. Perhaps a better comment would be "Can be nil in some tests" or such.
Done
pkg/kv/kvserver/store.go, line 2898 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this new hero deserves a comment
Done
pkg/kv/kvserver/batcheval/eval_context.go, line 135 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Is referring to
MVCC scans
in this general-purpose interface a good idea? Can't we talk more generally about memory accounting? E.g. simplyGetMemoryAccount()
.
If the fact that this monitor is only sometimes initialized makes it only suitable for scans and not for accounting any other memory (which I don't think is the case), please explain that in a comment.And I think we should name the method, as well as the type Account rather than Monitor. Because it's an Account, not a Monitor :).
Renamed to ResponseMemoryAccount
and stated that this is currently only used for scans.
pkg/roachpb/api.proto, line 2036 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Well, in an API that's generally supposed to be agnostic to the collocation of the client and the server, I feel like you got to give some explanation about why such a surprising abstraction-breaking feature exists. We don't have to keep the comment up to date with every new use, but at list a hint would help. A natural question when seeing this field is what happens if the field is not set right. We have tests and tools that do RPCs in many ways; the
internalClientAdapter
is not supposed to be a load-bearing component.Btw, I was thinking that, instead of this, if we introduced the client's node id in the
BatchHeader
, I wouldn't have to ask many questions. It seems that that'd be more generally useful. And it could also be set at a higher level (in theDistSender
instead of being set in that weirdinternalClientAdapter
).
Is Header.GatewayNodeID
already doing the job, and slightly misnamed in that it may not be the gateway node for leaf distsql flows?
pkg/roachpb/api.proto, line 2037 at r6 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: s/LOCAL is set on codepaths that are calling the KV API from the same node (not using gRPC)./LOCAL means the client is collocated on the same node as the server.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/kv/txn.go, line 962 at r6 (raw file):
Previously, sumeerbhola wrote…
You raise some interesting alternatives.
I am inclined to not pass an account from client to server through a proto API. It seems hacky. The advantage of accounting to the "correct flow" I think is small enough, and only works for the local case. If we want txn level accounting in general, we need it to work for both local and remote. The main advantage is eliminating theNoMemoryReservedAtSource
which admittedly is crude, and not something we would necessarily want to be the long term solution. But I am also not keen to start making DistSender changes, given that we are going to layer Streamer+DistSender in the medium term.Long term, I think the
NoMemoryReservedAtSource
field goes away, since the caller should have reserved memory locally for N bytes and told the callee to not send more than N bytes (viaTargetBytes
or some other field). So if the callee is local it knows how many bytes are reserved. We cannot have callers not reserving up front and passing that info to the callee -- this will mean coordination across Streamer and DistSender and universal adoption (TargetBytes
is currently ignored for things like retrieving intents). I've documented this in the comment forNoMemoryReservedAtSource
. Is this a decent compromise?
sounds good
pkg/kv/kvserver/replica_read.go, line 184 at r3 (raw file):
Previously, sumeerbhola wrote…
Is this something optional for this PR? I am asking since I am worried about over-engineering abstractions for unknown future uses. We don't know where else we will add memory accounting, and when we do we can generalize then. I've added a TODO here with your suggestion.
optional :)
pkg/roachpb/api.proto, line 2036 at r3 (raw file):
Previously, sumeerbhola wrote…
Is
Header.GatewayNodeID
already doing the job, and slightly misnamed in that it may not be the gateway node for leaf distsql flows?
I forgot that existed. I think it might indicate the gateway even from requests from leaves based on this comment at the txn level:
Lines 52 to 56 in 4e4f31a
// gatewayNodeID, if != 0, is the ID of the node on whose behalf this | |
// transaction is running. Normally this is the current node, but in the case | |
// of Txns created on remote nodes by DistSQL this will be the gateway. | |
// It will be attached to all requests sent through this transaction. | |
gatewayNodeID roachpb.NodeID |
anyway, as discussed offline, I think these fields are fine for now.
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Previously, sumeerbhola wrote…
Do we really need this struct? Even if the scan doesn't use all of BoundAccount, would it be harmful to have access to it? BoundAccount is not a heavy-weight structure.
It's not heavyweight but it is confusing to have all the methods accessible when we want to make sure the caller only calls
Grow
. And there is the nil case for cases (like the local one) where we are not doing memory accounting.
I am happy to take suggestions for better names. I've always been confused by theBoundAccount
terminology. This is more like aQueryMemoryReserver
.
discussed offline; make we can make this a BoundAccount
after all.
a690006
to
1b97b29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
discussed offline; make we can make this a
BoundAccount
after all.
Jordan pointed me to NewUnlimitedMonitor
, which is a BytesMonitor
and not a BoundAccount
. I'm assuming it does not need to be stopped.
But there are many difficulties there:
I don't wish to create one for every read that does not need memory accounting, since it is wasted allocations. So I'd need to create and stash one that lives longer. Potentially put it in StoreConfig
itself, since we are already remembering to create one for it in server.go
. That means plumbing to grab it from a Replica
, which means adding another method. We'd still pay the BoundAccount
allocation for paths that don't need to track memory, but that isn't a big deal. It's the unnecessary plumbing that is most bothersome. Also the plumbing will be missing for some tests -- the same tests for which Replica.store.getRootMemoryMonitorForKV()
currently returns nil. So we'd need to find all those and fix them.
Then we also need to make sure that no one calls one of these mvcc methods without initializing MVCC{Scan,Get}Options.MemoryAccount
to a *BoundAccount
(this includes tests for the MVCC methods and pebbleMVCCScanner). This all seems unnecessarily complicated if a default initialized ResponseMemoryAccount
works. I think the big objection to this was the long comment on the lack of a Shrink
method -- I've added one to address that objection (even though it isn't used). IMO, between the choice of complex plumbing that achieves the same default behavior as a simple wrapper, one should choose the latter.
I've added some of this in the code comment for ResponseMemoryAccount
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Previously, sumeerbhola wrote…
Jordan pointed me to
NewUnlimitedMonitor
, which is aBytesMonitor
and not aBoundAccount
. I'm assuming it does not need to be stopped.But there are many difficulties there:
I don't wish to create one for every read that does not need memory accounting, since it is wasted allocations. So I'd need to create and stash one that lives longer. Potentially put it in
StoreConfig
itself, since we are already remembering to create one for it inserver.go
. That means plumbing to grab it from aReplica
, which means adding another method. We'd still pay theBoundAccount
allocation for paths that don't need to track memory, but that isn't a big deal. It's the unnecessary plumbing that is most bothersome. Also the plumbing will be missing for some tests -- the same tests for whichReplica.store.getRootMemoryMonitorForKV()
currently returns nil. So we'd need to find all those and fix them.Then we also need to make sure that no one calls one of these mvcc methods without initializing
MVCC{Scan,Get}Options.MemoryAccount
to a*BoundAccount
(this includes tests for the MVCC methods and pebbleMVCCScanner). This all seems unnecessarily complicated if a default initializedResponseMemoryAccount
works. I think the big objection to this was the long comment on the lack of aShrink
method -- I've added one to address that objection (even though it isn't used). IMO, between the choice of complex plumbing that achieves the same default behavior as a simple wrapper, one should choose the latter.I've added some of this in the code comment for
ResponseMemoryAccount
.
I'm fine with it like this, since it's reached the point where it's not worth arguing.
But let me take a crack at it myself. Will report back soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is the new functionality tested at all? Cause as some point when I was hacking around I've made scans use unlimited accounts instead of accounts tied to the kvMonitor
, and nothing seemed to fail. It's possible that I was holding it wrong but, if not, I do think some test should be added checking that account limits are respected. At least a unit test at the level of the pebbleMVCCScanner
.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I'm fine with it like this, since it's reached the point where it's not worth arguing.
But let me take a crack at it myself. Will report back soon.
See the last two commits here; squash what you like out of them
I've gotten rid of ResponseMemoryAccount
by making the zero value of *BoundAccount
act as a no-op. There's a difference between a no-op account and an unlimited account - you'd be able to query the latter for its current usage - and for a while I thought about having both, but I let go of the unbounded one cause it's not really needed at the moment.
I've made my peace with evalContextWithAccount
and left it alone - but added a sync.Pool for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the new functionality tested at all?
There is the new logic test that fails due to the memory accounting, which validates some of the end-to-end plumbing.
I was waiting for the discussion to settle before adding a test for pebbleMVCCScanner
.
I am not sure if/where to add KV tests for testing get/scan/reverse-scan -- what do you think?
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
See the last two commits here; squash what you like out of them
Where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Previously, sumeerbhola wrote…
See the last two commits here; squash what you like out of them
Where?
here, sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
here, sorry
@jordanlewis do you have any concerns about the BoundAccount changes in andreimatei@8c23104#diff-dc061ae803d00d0e6b1857ea6479a03f89f1562f844b09066dcec4238481d1ca
I'm uncomfortable with changing This is definitely fearmongering so I apologize but I don't think it's a trivial thing to do this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
(copying comment from wrong place here)
I'm uncomfortable with changing BoundAccount to be pass by reference as part of this PR, that's a pretty big change. The memory accounting stuff is fragile and I'm frankly scared of a big change like that. We can do it if you insist but maybe in another PR? I'm also a little concerned w/ performance - maybe changing to pass by reference will cause a bunch of these things to escape to the heap? We do store a ton of them in various structs (like aggregation buckets).
This is definitely fearmongering so I apologize but I don't think it's a trivial thing to do this change.
Just to make sure we're on the same page - what pass-by-reference are you talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
(copying comment from wrong place here)
I'm uncomfortable with changing BoundAccount to be pass by reference as part of this PR, that's a pretty big change. The memory accounting stuff is fragile and I'm frankly scared of a big change like that. We can do it if you insist but maybe in another PR? I'm also a little concerned w/ performance - maybe changing to pass by reference will cause a bunch of these things to escape to the heap? We do store a ton of them in various structs (like aggregation buckets).
This is definitely fearmongering so I apologize but I don't think it's a trivial thing to do this change.
Just to make sure we're on the same page - what pass-by-reference are you talking about?
The BoundAccount
receiver was changed to a pointer receiver. This means that the BoundAccount
is passed to the methods as a pointer, not a struct. That's the change I'm referring to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
The
BoundAccount
receiver was changed to a pointer receiver. This means that theBoundAccount
is passed to the methods as a pointer, not a struct. That's the change I'm referring to.
the methods on the BoundAccount
that matter always had a pointer receiver; they have to as the thing is obviously mutable. I've only changed Used/Monitor/allocated - minuscule leaf methods that are probably inlined everywhere. I've checked, and neither of the 3 causes the receiver (well, technically the value pointed-to by the receiver) to escape.
My patch was rushed - the way I had written it, there was no point in making the 3 methods take a ptr receiver because they were not protecting against a nil pointer. That's fixed; now they do (and I switched one to a value receiver cause it was an internal method).
Btw, I don't technically have to move User/Monitor to ptr receiver because nobody's currently calling them on a possibly-nil pointer (for the purposes of this PR, only Grow
matters). But, if we give these no-op semantics to the nil ptr, I think we should do it all the way.
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, @tbg, and @yuzefovich)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
@jordanlewis do you have any concerns about the BoundAccount changes in andreimatei/cockroach@8c23104#diff-dc061ae803d00d0e6b1857ea6479a03f89f1562f844b09066dcec4238481d1ca
@yuzefovich would you be able to chime in with thoughts on this (@jordanlewis is swamped with other stuff, and said you were an expert on such memory accounting matters)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/mvcc.go, line 2486 at r3 (raw file):
Previously, sumeerbhola wrote…
@jordanlewis do you have any concerns about the BoundAccount changes in andreimatei/cockroach@8c23104#diff-dc061ae803d00d0e6b1857ea6479a03f89f1562f844b09066dcec4238481d1ca
@yuzefovich would you be able to chime in with thoughts on this (@jordanlewis is swamped with other stuff, and said you were an expert on such memory accounting matters)?
I don't see a problem in changing the BoundAccount
methods to have a pointer receiver since we actually store the accounts both ways (i.e. by value and by pointer). In fact, in the newer vectorized engine we use the pointers, and I don't think we've seen any evidence of that having some negative performance impact. Andrei's justification above also sounds reasonable to me, so I'm on board with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The
ResponseMemoryAccount
is no more and we are instead using*BoundAccount
as you had suggested (mon: nil BoundAccount acts as an unlimited account #68340 made the relevant nilBoundAccount
change) - There is a unit test for
pebbleMVCCScanner
.
PTAL
I plan to merge when CI is green since merge conflicts with other PRs are quite frequent.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @itsbilal, @jordanlewis, @nvanbenschoten, @sumeerbhola, and @tbg)
The goal is to track memory allocations for: - non-local SQL=>KV requests: this can happen with joins, multi-tenant clusters, and if ranges move between DistSQL planning and execution. - local SQL=>KV requests for the first request by a fetcher: in this case the fetcher reserves a modest 1KB which can be significantly exceeded by KV allocations. Only allocations in pebbleMVCCScanner for kv pairs and intents are tracked. The memory is released when returning from executeReadOnlyBatchWithServersideRefreshes since the chain of returns will end up in gRPC response processing and we can't hook into where that memory is released. This should still help for some cases of OOMs, and give some signal of memory overload that we can use elsewhere (e.g. admission control). The BytesMonitor is used to construct a BoundAccount that is that is passed via the EvalContext interface. The other alternative would be for the engine to have a BytesMonitor at initialization time that it can use to construct a BoundAccount for each MVCC scan, and pass it back via MVCCScanResult. This would mean multiple BoundAccounts for a batch (since we don't want to release memory until all the requests in the batch are processed), and would be harder to extend to track additional request types compared to embedding in EvalContext. The rootSQLMonitor is reused for this memory allocation tracking. This tracking is always done for non-local requests, and for the first request by a fetcher for a local request. This is to avoid double-counting, the first request issued by a SQL fetcher only reserves 1KB, but subsequent ones have already reserved what was returned in the first response. So there is room to tighten this if we knew what had been reserved on the local client (there are complications because the batch may have been split to send to different nodes, only one of which was local). The AdmissionHeader.SourceLocation field is used to mark local requests and is set in rpc.internalClientAdapter. The first request is marked using the AdmissionHeader.NoMemoryReservedAtSource bit. Informs cockroachdb#19721 Release note (ops change): The memory pool used for SQL is now also used to cover KV memory used for scans.
bors r+ |
Build succeeded: |
The goal is to track memory allocations for:
multi-tenant clusters, and if ranges move between DistSQL
planning and execution.
in this case the fetcher reserves a modest 1KB which can
be significantly exceeded by KV allocations.
Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).
The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.
The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).
The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.
Informs #19721
Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.