Skip to content
This repository has been archived by the owner on Apr 14, 2022. It is now read-only.

015-storage-messaging #16

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

015-storage-messaging #16

wants to merge 2 commits into from

Conversation

kelvich
Copy link
Contributor

@kelvich kelvich commented Jan 19, 2022

Motivation

As in 014-safekeeper-gossip we need to solve the following problems:

  • Trim WAL on safekeepers
  • Decide on which SK should push WAL to the S3
  • Decide on which SK should forward WAL to the pageserver
  • Decide on when to shut down SK<->pageserver connection

This RFC suggests a more generic and hopefully more manageable way to address those problems. However, unlike 014-safekeeper-gossip, it does not bring us any closer to safekeeper-to-safekeeper recovery but rather unties two sets of different issues we previously wanted to solve with gossip.

Also, with this approach, we would not need "call me maybe" anymore, and the pageserver will have all the data required to understand that it needs to reconnect to another safekeeper.

Summary

Instead of p2p gossip, let's have a centralized broker where all the storage nodes report per-timeline state. Each storage node should have a --broker-url=1.2.3.4 CLI param.

Here I propose two ways to do that. After a lot of arguing with myself, I'm leaning towards the etcd approach. My arguments for it are in the pros/cons section. Both options require adding a Grpc client in our codebase either directly or as an etcd client's dependency.

read more

@kelvich kelvich changed the title 015-storage-messaging first version 015-storage-messaging Jan 19, 2022
@kelvich kelvich added c/pageserver Component: pageserver c/safekeeper Component: safekeeper t/tech_design_rfc Type: tech design RFC labels Jan 20, 2022
@LizardWizzard
Copy link

I agree with the approach of having separate metadata coordination service.

I think it is better to have separate service to manage this state. There are at least two reasons.

One that you've already mentioned is fault tolerance. When storage and console are tied this way we should make console as available as any other infrastructure services (like for example s3). So in my opinion storage should avoid having hard dependency on console.

The other reason is that users who run zenith themselves wouldn't have our console available.

So I think approach with separate metadata service is preferable. etcd is fine, but I think it is worth mentioning/discussing any alternatives, e.g consul

@arssher
Copy link
Contributor

arssher commented Jan 24, 2022

I agree that in cloud some way of centralized config broadcast from the console is a good thing, however I'm quite hesitant adding etcd or whatever as one more hard dependency, required in tests and local setups. What do you think about drawing the line differently or combining the approaches and using centralized service only for console-originated info and service discovery, that is, for

  1. Info on which servers which timelines reside;
  2. Addresses of these servers;
  3. Configuration (when we start wanting changing it on the fly);

?

So, safekeepers would learn their timelines and peers for these timelines from the broker and then gossip directly. Fully centralized service allows to avoid managing these connections, but OTOH we get more decoupled system (e.g. in case of problems you have to mainly check 2 places, not 3) and possibility of running tests/locally without broker at all.

I still think that the best way of safekeeper recovery is direct WAL fetching from peers, so we need to know them anyway.

As for callmemaybe, we can rework it close to what is proposed here: safekeepers would know pageserver(s) for each of their timelines from the broker and would continiously push data for active timelines to pageservers, providing them with info about LSNs and such, allowing them to choose from whom to pull and inviting to do so. Safekeepers here are the initiators, as they learn about compute appearance first.

I don't think n-to-n connections here is a big problem as we don't normally (unless in recovery) carry WAL in such horizontal way, and gossiped info will be multiplexed into single physical connection. I mean, yes, we still have n of them instead of 1, but n won't be too big in short-mid term (even 10k async conns won't generate tremendous load), and if it becomes so, we can indeed assign safekeepers more smartly (which is actually good for load balancing anyway, I think, as the load of each tli on all safekeepers is the same. It also provides logical grouping, e.g. it would be immediately clear whether we have problems with whole safekeepers or individual timelines).

Also, though rfc explicitly says it is not about service discovery -- and by that I mean learning which timelines reside on which safekeepers/pageserver, and their addresses -- I feel it would be very weird and duplicating to have proposed broker and not use it for SD. We need to (reliably, i.e. with retries) broadcast console-originated tli assignment and config to server nodes anyway.

I'm fine with etcd. It is somewhat strange choice given that we don't need persistency (there might be related performance issues btw), but anyway. Really we need something highly available, not strongly consistent, with get/put + pub/sub by prefix interface.

Copy link
Member

@petuhovskiy petuhovskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both centralized and gossip approaches can be useful in different cases. I also agree that centralized approach is simpler to solve our current problems, so we can start with implementing centralized metadata broker, and after that implement gossip if we need it. Then it will be possible to choose freely between both approaches.

015-storage-messaging.md Show resolved Hide resolved
015-storage-messaging.md Show resolved Hide resolved
015-storage-messaging.md Show resolved Hide resolved

But with an etcd we are in a bit different situation:

1. We don't need persistency and strong consistency guarantees for the data we store in the etcd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel great about running a single-node in-memory etcd because probably we'd be the first to try that in production. If we need a third-party in-memory kv store, redis would run faster and be easier to manage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't suggest running it in a single-node setup in production. My point was that if we later want to drop etcd as a dependency for self-hosted installations, it would be easier to do so if we do not store any persistent data there -- we can change it to Redis or swap it with custom grpc service. But once we start keeping reliable info there, it would be way harder to switch to something in-house.


### Safekeeper behavior

For each timeline safekeeper periodically broadcasts `compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/*` fields. It subscribes to changes of `compute_#{tenant}_#{timeline}` -- that way safekeeper will have an information about peering safekeepers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not specific to safekeeper, but what is the period? And how the reader can identify the data as stale or its advertiser is down? There could be some threshold value, e.g no updates in 2 * update_period. Do we need to guard against possible clock skew?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the period?

Period would delay WAL trim / S3 pushdown switchover / pageserver reconnection time (only for stale connections, though; usually pageserver will receive a tcp error). I think 1 second is fine.

And how the reader can identify the data as stale or its advertiser is down?

Safekeeprs only care about median(lsn) among the group, so they can avoid that.

Pageserver needs to check that the safekeeper it is connected to is not extremly behind the peering safekeepers. We can try to do that based on the amount of LSN lag between our safekeeper and rest of them, but that approach can lead to a livelock. Alternatively, we can advertise the time of the last heartbeat from the compute in each safekeeper (right now heartbeats are sent each second) and have quite a big timeout to force a re-connection in the pageserver (let's say 30 sec) to defend against the time skew. Which should be fine since that is a last resort measure -- with most of usual the problems on the safekeeper (crash, restart, oom, etc) we should be able to detect disconnect on the pageserver and that 30 sec timeout would happen only in rare connectivity / freeze scenarios. IIRC protocol level heartbeats between safekeeper and pageserver are not enabled right now -- we can enable them also and have a smaller disconnect timeouts there (since we may just measure delay without looking at the exact value of the timestamp).

Copy link

@LizardWizzard LizardWizzard Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

median(lsn)

MIN for commit and MAX to calculate the lag, right?

to defend against the time skew

I think there is one more option, we can defend even from a connection thread on the safekeeper being stuck, not only the clock skew. We can compare last heartbeat time not with the pageserver current system time, but with values advertised by other safekeepers. In that case safekeeper should forward system time from compute, avoiding generation of it's own timestamp. Then all the compared values are from the same time source. Exception to that is only when there is a different compute, which shouldn't be a problem and even if it is we can add some compute id, to avoid comparison of timestamps from different computes. What do you think?

I've tried to make some sketches, this is the version with livelock I used to imagine different scenarios: (click to open in full screen)

mermaid-diagram-20220126193556

I can fix it to use timestamp comparison and if you have some failure scenarios in mind I can add them too. I think it is a good idea to add strict sequence of actions for different kinds of failures.

The other possible diagram might describe concurrent checkpoint handling by pageservers. I'm still thinking what type of coordination we need here. Can we even avoid it somehow?

For that case pageserver which is currently streaming to s3 can advertise the flag that it is streaming, latest checkpoint lsn and a timestamp, this timestamp may be the max value of compute heartbeat timestamp it has seen. This way the other pageserver can detect the difference between the current max timestamp and the advertised one, and take over the streaming role. Thoughts?

Copy link
Contributor

@arssher arssher Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt we need to fight with clock skews here. How about the following straw man for choosing the safekeeper to fetch WAL from:

  1. If there is no live connection to safekeeper, choose the one with max commit_lsn.
  2. Otherwise, switch to safekeeper with higher commit_lsn iff either 1) timeout (tracked locally on pageserver) passed with no data coming through current connection (prevents livelock) 2) diff between commit_lsn of our safekeeper and max commit_lsn exceeded some threshold -- we are talking with deeply lagging safekeeper.

This procedure can be run regularly + on interesting events, like lost connection to safekeeper or got new data from the broker.

Generally tracking of live servers without bothering about time skew can be done by pushing keys to etcd with expiration period, but not sure where we need that currently.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worthwhile to avoid clock skew early on during initial system design.

I've modified the diagram to match the picture we've agreed upon: (click to open full screen)

image

Let me know if I missed something or you spot any errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my understanding of tenant migration process with metadata service: (click to open full screen)

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
c/pageserver Component: pageserver c/safekeeper Component: safekeeper t/tech_design_rfc Type: tech design RFC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants