Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rowflow: account for more memory usage of the row buffer #55877

Merged
merged 1 commit into from
Oct 27, 2020

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Oct 22, 2020

Previously, we accounted for the memory usage of the row buffer from
which the rows are pushed from in the routers only when copying the rows
from the row container, but we also have another in-memory row buffer
that we can move the rows from which was missing the memory accounting.
This is now fixed which also deflakes the router disk spill test.

Fixes: #55848.

Release note: None

@yuzefovich yuzefovich requested review from asubiotto and a team October 22, 2020 23:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

I'm open to suggestions on how to actually de-flake the test without removing it, but I can't seem to figure it out. Locally the tests fails after a few hundred of iterations under stress as is.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this. Does this also fix the race in #55848?

Could you describe the reason this fails in more detail? What does a rare case where this error does not occur look like? Even if the producer and consumer goroutines run concurrently, why does that affect the occurence of the error?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)

@yuzefovich
Copy link
Member Author

Yeah, it does fix the race too.

To be honest, I don't understand the reason why the test as is is flaky - sometimes the expected memory error doesn't occur which (sometimes). I'd appreciate if you took a look.

I thought that one explanation for the flakiness could be that the limit on test-monitor is set pretty high so that we can add a bout half of all rows to the in-memory container before spilling to disk. Once the spilling occurs, then the memory account of the row container is cleared, and the consumer goroutine never buffers up enough rows in rowBufToPushFrom account to hit that limit. I tried this diff to prototype a separate monitor with a limit of 1, but that still is flaky :/

Never mind, I think I figured it out here. Will kick off a nightly stress on a gceworker.

Previously, we accounted for the memory usage of the row buffer from
which the rows are pushed from in the routers only when copying the rows
from the row container, but we also have another in-memory row buffer
that we can move the rows from which was missing the memory accounting.
This is now fixed which also deflakes the router disk spill test.

Release note: None
@yuzefovich yuzefovich changed the title rowflow: de-flake router disk spill test rowflow: account for more memory usage of the row buffer Oct 27, 2020
@yuzefovich
Copy link
Member Author

No failures in 400k run in over an hour, so I assume that fixes it, RFAL.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@yuzefovich
Copy link
Member Author

The build is red because of a logic test flake.

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 27, 2020

Build succeeded:

@craig craig bot merged commit b319a0b into cockroachdb:master Oct 27, 2020
@yuzefovich yuzefovich deleted the fix-routers branch October 27, 2020 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/rowflow: TestRouterDiskSpill failed
3 participants