-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: adopt for settings rangefeed-backed settingswatcher, remove g… #69269
server: adopt for settings rangefeed-backed settingswatcher, remove g… #69269
Conversation
Actually there's a bit more to do here. |
Grafted from cockroachdb#69269. This seems like a useful primitive for users of this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Grafted from cockroachdb#69269. This seems like a useful primitive for users of this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Grafted from cockroachdb#69269. This seems like a useful primitive for users of this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Grafted from cockroachdb#69269. This seems like a useful primitive for users of this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Grafted from cockroachdb#69269. This seems like a useful primitive for users of this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
dec8f50
to
1c00077
Compare
Release note: None
Release note: None
…ossip This commit removes the code which connected the settings to their backing table via the gossipped system config. Instead it unconditionally enables the rangefeed-backed `settingswatcher` which was developed to support tenants. Note that it is rather tested code that has been used in multi-tenant sql pods for about a year now and all the existing tests still pass. Release note: None
1c00077
to
2e8c2c4
Compare
@RaduBerinde I've rebased this and I think it's ready for a pass. I suspect there's missing testing somewhere. Please let me know what you're looking for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
It feels like we're processing each rangefeed event in two different places and in two different ways (one indirectly, after buffering). What's the benefit of buffering events? Why not just keep mu.data
up to date in in the main rangefeed callback? That would make everything a lot simpler. I don't think we'd even need to keep track of frontier timestamps anymore - whenever we get an event, we either spawn the async storage task, or if it is running already, we set a flag indicating that it needs to run again (and the task can check that flag and restart).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)
pkg/server/settingswatcher/settings_watcher.go, line 66 at r3 (raw file):
// bootstrap settings state. type Storage interface { WriteKVs(ctx context.Context, kvs []roachpb.KeyValue) error
[nit] This suggests that we may be writing KVs in batches, whereas IIUC each call is a full snapshot. Maybe SaveKVs
or SnapshotKVs
?
We guarantee that one instance of the call will be running at any one time, right? We should advertise that here.
pkg/server/settingswatcher/settings_watcher.go, line 255 at r3 (raw file):
return } }
[nit] add a comment here saying that a call was already running and we need to try again.
This singleflight+retry mechanism feels awkward to me (perhaps because each call still spawns an async task separately waiting for what should really be a single process). Wouldn't it be simpler to have at most one async task running, along the lines of:
- invariant: if
frontierToSave < frontierSaved
then there is an async task running or starting. IffrontierToSave >= frontierSaved
the async task is not running (or it's exiting). - Before we forward
frontierToSave
, we check the above condition and if we didn't have an async task running, we start it after the forward. - In the async task, we run a loop until frontierSaved >= frontierToSave. The latter can change during the loop, causing more iterations.
pkg/server/settingswatcher/settings_watcher_external_test.go, line 186 at r3 (raw file):
func (f *fakeStorage) WriteKVs(ctx context.Context, kvs []roachpb.KeyValue) error { f.Lock() defer f.Unlock()
[nit] should we introduce a random delay here? I want to make sure to test the situation where an event comes in while the async storage task is running. If we do that, we should also assert that only one instance of the method is running at a time (we can increment and defer(decrement) an atomic counter and check that it was 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of buffering events? Why not just keep mu.data up to date in in the main rangefeed callback? That would make everything a lot simpler.
The complexity exists to ensure that whenever we write out a snapshot of settings, it corresponds to a snapshot which actually existed in the settings table at some point in time. The problem is that updates may come out-of-order. The buffer is a hack to avoid needing to maintain a versioned store for data. I'm not sure it's saving much in the way of complexity.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)
Huh, the range feed events don't have monotonically increasing timestamps?? That is a huge thing, it should be documented in at least 3 places (the RangeFeed Internal API itself, If we care about the ordering in the settings snapshot, it feels like we should care about it in memory as well (even though the window of potential problems would be much smaller in practice). I assume that more often than not, we will care about the proper ordering when we write a range feed "client" (which means each use will have to get involved with buffering and frontiers). At the very least, it's much easier to reason about things if you can just get the nice semantics. So I think we should build a small layer into |
It has very specific ordering guarantees. It guarantees for that any individual key you will see writes for the first time in increasing timestamp order. It makes no statement at all about ordering of events corresponding to different rows. In cockroach, you can have a txn write to row A at t2 and then subsequently a txn writes to row B at t1. To not emit the t2 event until a t1 event becomes impossible would effectively negate the design of rangefeeds.
I don't disagree that it is under-documented. Here's a comment which says some things but is far afield from where one might look. cockroach/pkg/ccl/changefeedccl/sink_cloudstorage.go Lines 84 to 100 in 707af75
It's a tradeoff. If we wanted to wait for a snapshot, we'd have to wait for the closed timestamp which is on the order of seconds. Even today we don't update the settings atomically with the gossip update, though the window for things to be out of sync is extremely small; the updater is non-atomic. One approach to reducing the delay is #73399
This is what @irfansharif was setting out to do with cockroach/pkg/kv/kvclient/rangefeed/rangefeedbuffer/buffer.go Lines 32 to 36 in 24465b8
Yeah. The data structure can in principle maintain exactly one entry per key while waiting for a checkpoint. Hitting a memory error for these use cases we're intending to use this for ought to be extremely rare and indicative of something pathological. I'm all for it existing as a guard rail, but at least for settings, it feels like the sort of thing where if we use too much ram, we ought to be crashing the server or something drastic like that. Thinking through more complex handling doesn't seem worth it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to wait for a snapshot, we'd have to wait for the closed timestamp which is on the order of seconds.
This is another critical detail that should be better advertised, e.g. in the comments for WithOnFrontierAdvance / OnFrontierAdvance
.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)
pkg/server/settingswatcher/settings_watcher.go, line 53 at r4 (raw file):
// State used to store settings values to disk. buffer *rangefeedbuffer.Buffer
Can you add a comment here explaining that we need the buffer because the range feed callbacks can be out-of-order? Also mention that the buffer will hold a few seconds worth of changes in practice.
For posterity another would be setting a more aggressive |
That's a particularly legit idea for the |
Replacing this with some form of #74612. It's much cleaner. |
74612: rangefeedcache,settingswatcher,systemcfgwatcher: lose gossip deps r=ajwerner a=ajwerner This is a pile of commits which supersedes #69269 and pretty much puts in place all of the pieces to move on #70560. 75726: sql: refactor pg_has_role to remove privilege.GRANT usage r=RichardJCai a=ecwall refs #73129 Also combines some layers of privilege checking code. Release note: None 75770: vendor: bump cockroachdb/apd to v3.1.0, speed up decimal division r=nvanbenschoten a=nvanbenschoten Picks up two PRs that improved the performance of `Quo`, `Sqrt`, `Cbrt`, `Exp`, `Ln`, `Log`, and `Pow`: - cockroachdb/apd#114 - cockroachdb/apd#115 Almost all of the testing changes here are due to the rounding behavior in cockroachdb/apd#115. This brings us closer to PG's behavior, but also creates a lot of noise in this diff. To verify that this noise wasn't hiding any correctness regressions caused by the rewrite of `Context.Quo` in the first PR, I created #75757, which only includes the first PR. #75757 passes CI with minimal testing changes. The testing changes that PR did require all have to do with trailing zeros, and most of them are replaced in this PR. Release note (performance improvement): The performance of many DECIMAL arithmetic operators has been improved by as much as 60%. These operators include division (`/`), `sqrt`, `cbrt`, `exp`, `ln`, `log`, and `pow`. ---- ### Speedup on TPC-DS dataset The TPC-DS dataset is full of decimal columns, so it's a good playground to test this change. Unfortunately, the variance in the runtime performance of the TPC-DS queries themselves is high (many queries varied by 30-40% per attempt), so it was hard to get signal out of them. Instead, I imported the TPC-DS dataset with a scale factor of 10 and ran some custom aggregation queries against the largest table (web_sales, row count = 7,197,566): ```sql # q1 select sum(ws_wholesale_cost / ws_ext_list_price) from web_sales; # q2 select sum(ws_wholesale_cost / ws_ext_list_price - sqrt(ws_net_paid_inc_tax)) from web_sales; ``` Here's the difference in runtime of these two queries before and after this change on an `n2-standard-8` instance: ``` name old s/op new s/op delta TPC-DS/custom/q1 22.4 ± 0% 8.6 ± 0% -61.33% (p=0.002 n=6+6) TPC-DS/custom/q2 91.4 ± 0% 32.1 ± 0% -64.85% (p=0.008 n=5+5) ``` 75771: colexec: close the ordered sync used by the external sorter r=yuzefovich a=yuzefovich **colexec: close the ordered sync used by the external sorter** Previously, we forgot to close the ordered synchronizer that is used by the external sorter to merge already sorted partitions. This could result in some tracing spans never being finished and is now fixed. Release note: None **colexec: return an error rather than logging it on Close in some cases** This error eventually will be logged anyway, but we should try to propagate it up the stack as much as possible. Release note: None 75807: ui: Add CircleFilled component r=ericharmeling a=ericharmeling Previously, there was no component for a filled circle icon in the `ui` package. Recent product designs have requested this icon for the DB Console (see #67510, #73463). This PR adds a `CircleFilled` component to the `ui` package. Release note: None 75813: sql: fix flakey TestShowRangesMultipleStores r=ajwerner a=ajwerner Fixes #75699 Release note: None 75836: dev: add generate protobuf r=ajwerner a=ajwerner Convenient, fast. Release note: None Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Evan Wall <wall@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>
…ossip
This commit removes the code which connected the settings to their backing
table via the gossipped system config. Instead it unconditionally enables the
rangefeed-backed
settingswatcher
which was developed to support tenants.Note that it is rather tested code that has been used in multi-tenant sql
pods for about a year now and all the existing tests still pass.
Release justification: Low risk, high benefit change to existing functionality
Release note: None