Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #65881

Merged
merged 2 commits into from
Jun 3, 2021

Conversation

jordanlewis
Copy link
Member

@jordanlewis jordanlewis commented May 29, 2021

Updates #64906. The critical change is the first patch, the kvfetcher one.
The kvbatchfetcher one is theoretically good but I haven't found it to make
as large of a difference as the first patch.

With the first patch applied alone, I can no longer cause OOM conditions with
1024 concurrent TPCH query 18s sent at a single machine, which is a major
improvement. Prior to that patch, such a workload would overwhelm the machine
within 1-2 minutes.

This bug was found with help of the new tooling I've been adding to viewcore,
mostly the new pprof output format, the existing html object explorer, and the
new type explorer. You can see these updates at
https://github.com/jordanlewis/debug/tree/crl-stuff.


The KVFetcher is the piece of code that does the first level of decoding
of a KV batch response, doling out slices of keys and values to higher
level code that further decodes the key values into formats that the SQL
engine can operate on.

The KVFetcher uses a slice into the batch response to keep track of
where it is during the decoding process. Once the slice is empty, it's
finished until someone asks it for a new batch.

However, the KVFetcher used to keep around that empty slice pointer for
its lifetime, or until it was asked for a new batch. This causes the
batch response to be un-garbage-collectable, since there is still a
slice pointing at it, even though the slice is empty.

This causes queries to use up to 2x their accounted-for batch memory,
since the memory accounting system assumes that once data is transfered
out of the batch response into the SQL representation, the batch
response is freed - it assumes there's just 1 "copy" of this batch
response memory.

This is especially problematic for long queries (since they will not
allow that KVFetcher memory to be freed until they're finished).

In effect, this causes 1 extra batch per KVFetcher per query to be
retained in memory. This doesn't sound too bad, since a batch is of
fixed size. But the max batch size is 1 megabyte, so with 1024
concurrent queries, each with 3 KVFetchers, like we see in a TPCH
workload with 1024 concurrent query 18s, that's 1024 * 1MB * 3 = 3GB of
unaccounted for memory. This is easily enough memory to push a node over
and cause it to OOM.

This patch nils the batch response pointer once the KVFetcher is
finished decoding it, which allows it to be garbage collected as soon as
possible. In practice, this seems to allow at least a single-node
concurrency-1024 query18 TPCH workload to survive indefinitely (all
queries return out of budget errors) without OOMing.

Release note (bug fix): queries use up to 1MB less actual system memory
per scan, lookup join, index join, zigzag join, or inverted join in
their query plans. This will result in improved memory performance for
workloads with concurrent OLAP-style queries.


Previously, we could leave some dangling references to batch responses
around in the txnKVFetcher when we were fetching more than one batch at
a time. This would cause a delay in reclamation of memory for the
lifetime of a given query.

Release note (bug fix): use less memory in some queries, primarily
lookup joins.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jordanlewis jordanlewis requested a review from a team June 1, 2021 12:18
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finds! :lgtm:

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)


pkg/sql/row/kv_fetcher.go, line 123 at r1 (raw file):

			}
			if len(f.batchResponse) == 0 {
				f.batchResponse = nil

super nit: might be worth leaving a quick comment about this.

The KVFetcher is the piece of code that does the first level of decoding
of a KV batch response, doling out slices of keys and values to higher
level code that further decodes the key values into formats that the SQL
engine can operate on.

The KVFetcher uses a slice into the batch response to keep track of
where it is during the decoding process. Once the slice is empty, it's
finished until someone asks it for a new batch.

However, the KVFetcher used to keep around that empty slice pointer for
its lifetime, or until it was asked for a new batch. This causes the
batch response to be un-garbage-collectable, since there is still a
slice pointing at it, even though the slice is empty.

This causes queries to use up to 2x their accounted-for batch memory,
since the memory accounting system assumes that once data is transfered
out of the batch response into the SQL representation, the batch
response is freed - it assumes there's just 1 "copy" of this batch
response memory.

This is especially problematic for long queries (since they will not
allow that KVFetcher memory to be freed until they're finished).

In effect, this causes 1 extra batch per KVFetcher per query to be
retained in memory. This doesn't sound too bad, since a batch is of
fixed size. But the max batch size is 1 megabyte, so with 1024
concurrent queries, each with 3 KVFetchers, like we see in a TPCH
workload with 1024 concurrent query 18s, that's 1024 * 1MB * 3 = 3GB of
unaccounted for memory. This is easily enough memory to push a node over
and cause it to OOM.

This patch nils the batch response pointer once the KVFetcher is
finished decoding it, which allows it to be garbage collected as soon as
possible. In practice, this seems to allow at least a single-node
concurrency-1024 query18 TPCH workload to survive indefinitely (all
queries return out of budget errors) without OOMing.

Release note (bug fix): queries use up to 1MB less actual system memory
per scan, lookup join, index join, zigzag join, or inverted join in
their query plans. This will result in improved memory performance for
workloads with concurrent OLAP-style queries.
Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! :lgtm:

Reviewed 2 of 2 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @mgartner)


pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):

		f.responses = f.responses[1:]
		origSpan := f.requestSpans[0]
		f.requestSpans[0] = roachpb.Span{}

nit: pull out this popping pattern into helper functions like f.popFromRemainingBatches,f.popFromResponses, etc.

Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @mgartner)


pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: pull out this popping pattern into helper functions like f.popFromRemainingBatches,f.popFromResponses, etc.

I don't see a way to reduce any code by doing this, since each use is subtly different. I guess I could see an argument for making helper functions that are used just once, but I'm not sure I buy it. But, any suggestions for making this subtle GC poking less subtle are appreciated.

@rytaft
Copy link
Collaborator

rytaft commented Jun 2, 2021


pkg/sql/row/kv_batch_fetcher.go, line 545 at r4 (raw file):

				batchResp = t.BatchResponses[0]
				f.remainingBatches = t.BatchResponses[1:]
				t.BatchResponses[0] = nil

is this really necessary since we just assign the whole slice to nil on the next line? that's surprising...

Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @mgartner, and @rytaft)


pkg/sql/row/kv_batch_fetcher.go, line 545 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

is this really necessary since we just assign the whole slice to nil on the next line? that's surprising...

It's necessary because the underlying t.BatchResponses slice is retained by f.remainingBatches. Even though we assign f.remainingBatches to t.BatchResponses[1:], the 0th element will still be retained from the perspective of the garbage collector unless we nil it out.

Perhaps a more obvious way to phrase this would be to say:

f.remainingBatches = f.BatchResponses
f.remainingBatches[0] = nil
f.remainingBatches = f.remainingBatches[1:]

Thoughts?

@rytaft
Copy link
Collaborator

rytaft commented Jun 2, 2021


pkg/sql/row/kv_batch_fetcher.go, line 545 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

It's necessary because the underlying t.BatchResponses slice is retained by f.remainingBatches. Even though we assign f.remainingBatches to t.BatchResponses[1:], the 0th element will still be retained from the perspective of the garbage collector unless we nil it out.

Perhaps a more obvious way to phrase this would be to say:

f.remainingBatches = f.BatchResponses
f.remainingBatches[0] = nil
f.remainingBatches = f.remainingBatches[1:]

Thoughts?

Oh didn't catch that. Yea that change might be clearer. Thanks!

@mgartner
Copy link
Collaborator

mgartner commented Jun 2, 2021


pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I don't see a way to reduce any code by doing this, since each use is subtly different. I guess I could see an argument for making helper functions that are used just once, but I'm not sure I buy it. But, any suggestions for making this subtle GC poking less subtle are appreciated.

Ya there would be no code reduction. I was thinking it would make this function less dense - you have to grok a few action-packed lines to understand that these are simply pop operations. Separating the mechanics of the pop might aid in legibility. Feel free to leave as-is!

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @mgartner)


pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Ya there would be no code reduction. I was thinking it would make this function less dense - you have to grok a few action-packed lines to understand that these are simply pop operations. Separating the mechanics of the pop might aid in legibility. Feel free to leave as-is!

A popRemainingBatches function would be used at least three times. Below we would just do f.remainingBatches, t.BatchResponses = t.BatchResponses, nilbefore calling it.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Previously, we could leave some dangling references to batch responses
around in the txnKVFetcher when we were fetching more than one batch at
a time. This would cause a delay in reclamation of memory for the
lifetime of a given query.

Release note (bug fix): use less memory in some queries, primarily
lookup joins.
@jordanlewis
Copy link
Member Author

TFTRs, I was able to extract a little helper.

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 3, 2021

Build failed (retrying...):

@mgartner
Copy link
Collaborator

mgartner commented Jun 3, 2021

@jordanlewis I added backport labels, because I'm assuming this should be backported to as many versions as possible.

@craig
Copy link
Contributor

craig bot commented Jun 3, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants