This repository has been archived by the owner on Apr 14, 2022. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
015-storage-messaging #16
base: main
Are you sure you want to change the base?
015-storage-messaging #16
Changes from all commits
5333469
fae4b70
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not specific to safekeeper, but what is the period? And how the reader can identify the data as stale or its advertiser is down? There could be some threshold value, e.g no updates in 2 * update_period. Do we need to guard against possible clock skew?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Period would delay WAL trim / S3 pushdown switchover / pageserver reconnection time (only for stale connections, though; usually pageserver will receive a tcp error). I think 1 second is fine.
Safekeeprs only care about median(lsn) among the group, so they can avoid that.
Pageserver needs to check that the safekeeper it is connected to is not extremly behind the peering safekeepers. We can try to do that based on the amount of LSN lag between our safekeeper and rest of them, but that approach can lead to a livelock. Alternatively, we can advertise the time of the last heartbeat from the compute in each safekeeper (right now heartbeats are sent each second) and have quite a big timeout to force a re-connection in the pageserver (let's say 30 sec) to defend against the time skew. Which should be fine since that is a last resort measure -- with most of usual the problems on the safekeeper (crash, restart, oom, etc) we should be able to detect disconnect on the pageserver and that 30 sec timeout would happen only in rare connectivity / freeze scenarios. IIRC protocol level heartbeats between safekeeper and pageserver are not enabled right now -- we can enable them also and have a smaller disconnect timeouts there (since we may just measure delay without looking at the exact value of the timestamp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MIN for commit and MAX to calculate the lag, right?
I think there is one more option, we can defend even from a connection thread on the safekeeper being stuck, not only the clock skew. We can compare last heartbeat time not with the pageserver current system time, but with values advertised by other safekeepers. In that case safekeeper should forward system time from compute, avoiding generation of it's own timestamp. Then all the compared values are from the same time source. Exception to that is only when there is a different compute, which shouldn't be a problem and even if it is we can add some compute id, to avoid comparison of timestamps from different computes. What do you think?
I've tried to make some sketches, this is the version with livelock I used to imagine different scenarios: (click to open in full screen)
I can fix it to use timestamp comparison and if you have some failure scenarios in mind I can add them too. I think it is a good idea to add strict sequence of actions for different kinds of failures.
The other possible diagram might describe concurrent checkpoint handling by pageservers. I'm still thinking what type of coordination we need here. Can we even avoid it somehow?
For that case pageserver which is currently streaming to s3 can advertise the flag that it is streaming, latest checkpoint lsn and a timestamp, this timestamp may be the max value of compute heartbeat timestamp it has seen. This way the other pageserver can detect the difference between the current max timestamp and the advertised one, and take over the streaming role. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt we need to fight with clock skews here. How about the following straw man for choosing the safekeeper to fetch WAL from:
This procedure can be run regularly + on interesting events, like lost connection to safekeeper or got new data from the broker.
Generally tracking of live servers without bothering about time skew can be done by pushing keys to etcd with expiration period, but not sure where we need that currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worthwhile to avoid clock skew early on during initial system design.
I've modified the diagram to match the picture we've agreed upon: (click to open full screen)
Let me know if I missed something or you spot any errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my understanding of tenant migration process with metadata service: (click to open full screen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel great about running a single-node in-memory etcd because probably we'd be the first to try that in production. If we need a third-party in-memory kv store, redis would run faster and be easier to manage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't suggest running it in a single-node setup in production. My point was that if we later want to drop etcd as a dependency for self-hosted installations, it would be easier to do so if we do not store any persistent data there -- we can change it to Redis or swap it with custom grpc service. But once we start keeping reliable info there, it would be way harder to switch to something in-house.