015-storage-messaging #16

kelvich · 2022-01-19T21:26:35Z

Motivation

As in 014-safekeeper-gossip we need to solve the following problems:

Trim WAL on safekeepers
Decide on which SK should push WAL to the S3
Decide on which SK should forward WAL to the pageserver
Decide on when to shut down SK<->pageserver connection

This RFC suggests a more generic and hopefully more manageable way to address those problems. However, unlike 014-safekeeper-gossip, it does not bring us any closer to safekeeper-to-safekeeper recovery but rather unties two sets of different issues we previously wanted to solve with gossip.

Also, with this approach, we would not need "call me maybe" anymore, and the pageserver will have all the data required to understand that it needs to reconnect to another safekeeper.

Summary

Instead of p2p gossip, let's have a centralized broker where all the storage nodes report per-timeline state. Each storage node should have a --broker-url=1.2.3.4 CLI param.

Here I propose two ways to do that. After a lot of arguing with myself, I'm leaning towards the etcd approach. My arguments for it are in the pros/cons section. Both options require adding a Grpc client in our codebase either directly or as an etcd client's dependency.

015-storage-messaging.md

LizardWizzard · 2022-01-20T15:27:16Z

I agree with the approach of having separate metadata coordination service.

I think it is better to have separate service to manage this state. There are at least two reasons.

One that you've already mentioned is fault tolerance. When storage and console are tied this way we should make console as available as any other infrastructure services (like for example s3). So in my opinion storage should avoid having hard dependency on console.

The other reason is that users who run zenith themselves wouldn't have our console available.

So I think approach with separate metadata service is preferable. etcd is fine, but I think it is worth mentioning/discussing any alternatives, e.g consul

arssher · 2022-01-24T12:40:16Z

I agree that in cloud some way of centralized config broadcast from the console is a good thing, however I'm quite hesitant adding etcd or whatever as one more hard dependency, required in tests and local setups. What do you think about drawing the line differently or combining the approaches and using centralized service only for console-originated info and service discovery, that is, for

Info on which servers which timelines reside;
Addresses of these servers;
Configuration (when we start wanting changing it on the fly);

?

So, safekeepers would learn their timelines and peers for these timelines from the broker and then gossip directly. Fully centralized service allows to avoid managing these connections, but OTOH we get more decoupled system (e.g. in case of problems you have to mainly check 2 places, not 3) and possibility of running tests/locally without broker at all.

I still think that the best way of safekeeper recovery is direct WAL fetching from peers, so we need to know them anyway.

As for callmemaybe, we can rework it close to what is proposed here: safekeepers would know pageserver(s) for each of their timelines from the broker and would continiously push data for active timelines to pageservers, providing them with info about LSNs and such, allowing them to choose from whom to pull and inviting to do so. Safekeepers here are the initiators, as they learn about compute appearance first.

I don't think n-to-n connections here is a big problem as we don't normally (unless in recovery) carry WAL in such horizontal way, and gossiped info will be multiplexed into single physical connection. I mean, yes, we still have n of them instead of 1, but n won't be too big in short-mid term (even 10k async conns won't generate tremendous load), and if it becomes so, we can indeed assign safekeepers more smartly (which is actually good for load balancing anyway, I think, as the load of each tli on all safekeepers is the same. It also provides logical grouping, e.g. it would be immediately clear whether we have problems with whole safekeepers or individual timelines).

Also, though rfc explicitly says it is not about service discovery -- and by that I mean learning which timelines reside on which safekeepers/pageserver, and their addresses -- I feel it would be very weird and duplicating to have proposed broker and not use it for SD. We need to (reliably, i.e. with retries) broadcast console-originated tli assignment and config to server nodes anyway.

I'm fine with etcd. It is somewhat strange choice given that we don't need persistency (there might be related performance issues btw), but anyway. Really we need something highly available, not strongly consistent, with get/put + pub/sub by prefix interface.

petuhovskiy

I think both centralized and gossip approaches can be useful in different cases. I also agree that centralized approach is simpler to solve our current problems, so we can start with implementing centralized metadata broker, and after that implement gossip if we need it. Then it will be possible to choose freely between both approaches.

015-storage-messaging.md

bojanserafimov · 2022-01-24T23:31:58Z

015-storage-messaging.md

+
+But with an etcd we are in a bit different situation:
+
+1. We don't need persistency and strong consistency guarantees for the data we store in the etcd


I don't feel great about running a single-node in-memory etcd because probably we'd be the first to try that in production. If we need a third-party in-memory kv store, redis would run faster and be easier to manage.

Yeah, I don't suggest running it in a single-node setup in production. My point was that if we later want to drop etcd as a dependency for self-hosted installations, it would be easier to do so if we do not store any persistent data there -- we can change it to Redis or swap it with custom grpc service. But once we start keeping reliable info there, it would be way harder to switch to something in-house.

LizardWizzard · 2022-01-26T11:19:54Z

015-storage-messaging.md

+
+### Safekeeper behavior
+
+For each timeline safekeeper periodically broadcasts `compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/*` fields. It subscribes to changes of `compute_#{tenant}_#{timeline}` -- that way safekeeper will have an information about peering safekeepers.


This is not specific to safekeeper, but what is the period? And how the reader can identify the data as stale or its advertiser is down? There could be some threshold value, e.g no updates in 2 * update_period. Do we need to guard against possible clock skew?

what is the period?

Period would delay WAL trim / S3 pushdown switchover / pageserver reconnection time (only for stale connections, though; usually pageserver will receive a tcp error). I think 1 second is fine.

And how the reader can identify the data as stale or its advertiser is down?

Safekeeprs only care about median(lsn) among the group, so they can avoid that.

Pageserver needs to check that the safekeeper it is connected to is not extremly behind the peering safekeepers. We can try to do that based on the amount of LSN lag between our safekeeper and rest of them, but that approach can lead to a livelock. Alternatively, we can advertise the time of the last heartbeat from the compute in each safekeeper (right now heartbeats are sent each second) and have quite a big timeout to force a re-connection in the pageserver (let's say 30 sec) to defend against the time skew. Which should be fine since that is a last resort measure -- with most of usual the problems on the safekeeper (crash, restart, oom, etc) we should be able to detect disconnect on the pageserver and that 30 sec timeout would happen only in rare connectivity / freeze scenarios. IIRC protocol level heartbeats between safekeeper and pageserver are not enabled right now -- we can enable them also and have a smaller disconnect timeouts there (since we may just measure delay without looking at the exact value of the timestamp).

median(lsn)

MIN for commit and MAX to calculate the lag, right?

to defend against the time skew

I think there is one more option, we can defend even from a connection thread on the safekeeper being stuck, not only the clock skew. We can compare last heartbeat time not with the pageserver current system time, but with values advertised by other safekeepers. In that case safekeeper should forward system time from compute, avoiding generation of it's own timestamp. Then all the compared values are from the same time source. Exception to that is only when there is a different compute, which shouldn't be a problem and even if it is we can add some compute id, to avoid comparison of timestamps from different computes. What do you think?

I've tried to make some sketches, this is the version with livelock I used to imagine different scenarios: (click to open in full screen)

I can fix it to use timestamp comparison and if you have some failure scenarios in mind I can add them too. I think it is a good idea to add strict sequence of actions for different kinds of failures.

The other possible diagram might describe concurrent checkpoint handling by pageservers. I'm still thinking what type of coordination we need here. Can we even avoid it somehow?

For that case pageserver which is currently streaming to s3 can advertise the flag that it is streaming, latest checkpoint lsn and a timestamp, this timestamp may be the max value of compute heartbeat timestamp it has seen. This way the other pageserver can detect the difference between the current max timestamp and the advertised one, and take over the streaming role. Thoughts?

I doubt we need to fight with clock skews here. How about the following straw man for choosing the safekeeper to fetch WAL from:

If there is no live connection to safekeeper, choose the one with max commit_lsn.

Otherwise, switch to safekeeper with higher commit_lsn iff either 1) timeout (tracked locally on pageserver) passed with no data coming through current connection (prevents livelock) 2) diff between commit_lsn of our safekeeper and max commit_lsn exceeded some threshold -- we are talking with deeply lagging safekeeper.

This procedure can be run regularly + on interesting events, like lost connection to safekeeper or got new data from the broker.

Generally tracking of live servers without bothering about time skew can be done by pushing keys to etcd with expiration period, but not sure where we need that currently.

I think it is worthwhile to avoid clock skew early on during initial system design.

I've modified the diagram to match the picture we've agreed upon: (click to open full screen)

Let me know if I missed something or you spot any errors.

This is my understanding of tenant migration process with metadata service: (click to open full screen)

015-storage-messaging first version

5333469

kelvich changed the title ~~015-storage-messaging first version~~ 015-storage-messaging Jan 19, 2022

kelvich requested review from hlinnaka, petuhovskiy and LizardWizzard January 20, 2022 07:05

kelvich added c/pageserver Component: pageserver c/safekeeper Component: safekeeper t/tech_design_rfc Type: tech design RFC labels Jan 20, 2022

kelvich mentioned this pull request Jan 20, 2022

very first version of tenant migration from one pageserver to another via remote storage neondatabase/neon#995

Merged

kelvich requested a review from bojanserafimov January 20, 2022 07:07

LizardWizzard reviewed Jan 20, 2022

View reviewed changes

015-storage-messaging.md Show resolved Hide resolved

LizardWizzard mentioned this pull request Jan 24, 2022

refactoring of timeline memory state management in layered repo neondatabase/neon#1163

Merged

petuhovskiy reviewed Jan 24, 2022

View reviewed changes

015-storage-messaging.md Show resolved Hide resolved

015-storage-messaging.md Show resolved Hide resolved

015-storage-messaging.md Show resolved Hide resolved

bojanserafimov reviewed Jan 25, 2022

View reviewed changes

Update 015-storage-messaging based on the discussions

fae4b70

kelvich mentioned this pull request Jan 25, 2022

Storage coordination neondatabase/neon#1180

Closed

3 tasks

LizardWizzard reviewed Jan 26, 2022

View reviewed changes

arssher mentioned this pull request Feb 14, 2022

Epic: safekeeper coordination neondatabase/neon#543

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

015-storage-messaging #16

015-storage-messaging #16

kelvich commented Jan 19, 2022 •

edited

Loading

LizardWizzard commented Jan 20, 2022

arssher commented Jan 24, 2022 •

edited

Loading

petuhovskiy left a comment

bojanserafimov Jan 24, 2022

kelvich Jan 25, 2022

LizardWizzard Jan 26, 2022

kelvich Jan 26, 2022

LizardWizzard Jan 26, 2022 •

edited

Loading

arssher Feb 3, 2022 •

edited

Loading

LizardWizzard Feb 4, 2022

LizardWizzard Feb 6, 2022


		But with an etcd we are in a bit different situation:

		1. We don't need persistency and strong consistency guarantees for the data we store in the etcd


		### Safekeeper behavior

		For each timeline safekeeper periodically broadcasts `compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/*` fields. It subscribes to changes of `compute_#{tenant}_#{timeline}` -- that way safekeeper will have an information about peering safekeepers.

015-storage-messaging #16

Are you sure you want to change the base?

015-storage-messaging #16

Conversation

kelvich commented Jan 19, 2022 • edited Loading

Motivation

Summary

LizardWizzard commented Jan 20, 2022

arssher commented Jan 24, 2022 • edited Loading

petuhovskiy left a comment

Choose a reason for hiding this comment

bojanserafimov Jan 24, 2022

Choose a reason for hiding this comment

kelvich Jan 25, 2022

Choose a reason for hiding this comment

LizardWizzard Jan 26, 2022

Choose a reason for hiding this comment

kelvich Jan 26, 2022

Choose a reason for hiding this comment

LizardWizzard Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

arssher Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

LizardWizzard Feb 4, 2022

Choose a reason for hiding this comment

LizardWizzard Feb 6, 2022

Choose a reason for hiding this comment

kelvich commented Jan 19, 2022 •

edited

Loading

arssher commented Jan 24, 2022 •

edited

Loading

LizardWizzard Jan 26, 2022 •

edited

Loading

arssher Feb 3, 2022 •

edited

Loading