Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: allow empty ranges to be merged regardless of QPS #71899

Closed

Conversation

ajwerner
Copy link
Contributor

I noticed while running some ORM tests that empty ranges never got merged
away. When I manually range them through the queue, I got the report:

skipping merge: LHS QPS measurement not yet reliable

All of the ranges in question were totally empty. This small
patch seems to fix the problem.

Release note (bug fix): Empty ranges sometimes will now be merged away
regardless of QPS data.

@ajwerner ajwerner requested a review from a team as a code owner October 22, 2021 23:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable once we fix the test. :lgtm:

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner)

I noticed while running some ORM tests that empty ranges never got merged
away. When I manually range them through the queue, I got the report:

```
skipping merge: LHS QPS measurement not yet reliable
```

All of the ranges in question were totally empty. This small
patch seems to fix the problem.

Release note (bug fix): Empty ranges sometimes will now be merged away
regardless of QPS data.
@ajwerner
Copy link
Contributor Author

Alright, turning back to this, I'm not sure I understand some of the nuances of the failing tests. Namely, why are we running a ClearRange before the attempt to run the queue? @nvanbenschoten maybe you recall? Would it be just as reasonable not to run a ClearRange?

t.Run("unreliable-lhs-qps", func(t *testing.T) {
resetForLoadBasedSubtest(t)
lhs().LoadBasedSplitter().Reset(tc.Servers[0].Clock().PhysicalTime())
clearRange(t, lhsStartKey, rhsEndKey)
store.MustForceMergeScanAndProcess()
verifyUnmerged(t, store, lhsStartKey, rhsStartKey)
})
t.Run("unreliable-rhs-qps", func(t *testing.T) {
resetForLoadBasedSubtest(t)
rhs().LoadBasedSplitter().Reset(tc.Servers[1].Clock().PhysicalTime())
clearRange(t, lhsStartKey, rhsEndKey)
store.MustForceMergeScanAndProcess()
verifyUnmerged(t, store, lhsStartKey, rhsStartKey)
})

@andreimatei
Copy link
Contributor

Excuse my ignorance, but why do we want to treat empty ranges differently than non-empty ones? Why merge them when we wouldn't have merged them had they not been empty?
In particular, can this lead to an empty range being merged, then split back based on QPS, then merged back, etc? (In other words, do we split away empty ranges? Cause it would seem reasonable to me if we did.)

@ajwerner
Copy link
Contributor Author

Excuse my ignorance, but why do we want to treat empty ranges differently than non-empty ones? Why merge them when we wouldn't have merged them had they not been empty? In particular, can this lead to an empty range being merged, then split back based on QPS, then merged back, etc? (In other words, do we split away empty ranges? Cause it would seem reasonable to me if we did.)

The problem I ran into is that it seemed like ranges which never receive traffic or get unquiesced never get reliable QPS. The motivation for this PR is ORM tests which create and drop 10s of thousands of tables and drop them very rapidly with very short TTLs. None of those ranges which we split off were getting merged away. The reason from the queue was the lack of reliable QPS data. I was looking for a hack to deal with it.

@andreimatei
Copy link
Contributor

I guess I'd be curious what this "lack of reliable QPS data" is about.

@ajwerner
Copy link
Contributor Author

Yeah, that's legit. I'll re-run my experiments. Maybe I picked badly.

@aayushshah15
Copy link
Contributor

I guess I'd be curious what this "lack of reliable QPS data" is about.

Seems related to #68840.

@nvanbenschoten
Copy link
Member

The lack of reliable qps data is related to #64201. In that PR, we introduced a mandatory 5 minute (by default, see kv.range_split.by_load_merge_delay) qps measurement period before merging away splits. If we don't have at least 5 minutes of load information on a range, we err on the side of not merging it away, out of fear of 1) thrashing with load-based splitting, and 2) creating merge cascades where merges slow down QPS enough to trigger more merges, creating serious instability in cases like #62700 (comment) (and in the associated customer cluster).

Those issues are mostly unaffected by this change because this PR only applies to completely empty ranges. I think it is possible to have an empty range that performs a load-based split due to heavy read traffic, but that seems unlikely. And furthermore, rebalancing such a range would be very cheap. So this change seems reasonable to me.

However, writing this out does make me wonder whether we should change the definition of "empty". We're using LiveBytes right now. Do we want this behavior to apply to a 512MB range full of tombstones? Probably not. So we should probably switch that to KeyBytes. Does that undermine the reason to make the change?

Regarding the question of why we use ClearRange in this test, I don't have a complete answer. Some of the tests pass without it, some don't. It looks like the reset function writes a key-value to each side of the split which is equal to the RangeMinBytes, so the clearRange calls are needed to reduce the size of the ranges below the RangeMinBytes.

@ajwerner
Copy link
Contributor Author

ajwerner commented Jan 5, 2022

Indeed lowering kv.range_split.by_load_merge_delay did the trick.

@ajwerner ajwerner closed this Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants