014-safekeeper-gossip #13

kelvich · 2022-01-17T22:47:33Z

Safekeeper gossip

Motivation

In some situations, safekeeper (SK) needs coordination with other SK's that serve the same tenant:

WAL deletion. SK needs to know what WAL was already safely replicated to delete it. Now we keep WAL indefinitely.
Deciding on who is sending WAL to the pageserver. Now sending SK crash may lead to a livelock where nobody sends WAL to the pageserver.
To enable SK to SK direct recovery without involving the compute

Summary

Compute node has connection strings to each safekeeper. During each compute->safekeeper connection establishment, the compute node should pass down all that connection strings to each safekeeper. With that info, safekeepers may establish Postgres connections to each other and periodically send ping messages with LSN payload.

kelvich · 2022-01-19T12:16:36Z

I'm going to open a new one in zenith repo

bojanserafimov · 2022-01-25T00:20:33Z

text/014-safekeeper-gossip.md

+
+## Proposed implementation
+
+Each safekeeper can periodically ping all its peers and share connectivity and liveness info. If the ping was not receiver for, let's say, four ping periods, we may consider sending safekeeper as dead. That would mean some of the alive safekeepers should connect to the pageserver. One way to decide which one exactly: `make_connection = my_node_id == min(alive_nodes)`


If a safekeeper fails, isn't that a human intervention scenario anyway? Or do we have a membership change implementation? It's a tricky thing to get right

Safekeeper gossip description

e62d021

stepashka changed the title ~~Safekeeper gossip description~~ Safekeepers coordination (via gossip) Jan 18, 2022

stepashka added c/safekeeper Component: safekeeper p/wal Pageserver: relates to WAL processing a/reliability Area: relates to reliability of the service labels Jan 18, 2022

stepashka assigned kelvich Jan 18, 2022

stepashka added the t/tech_design_rfc Type: tech design RFC label Jan 18, 2022

kelvich closed this Jan 19, 2022

kelvich reopened this Jan 19, 2022

kelvich changed the title ~~Safekeepers coordination (via gossip)~~ 014-safekeeper-gossip Jan 19, 2022

kelvich mentioned this pull request Jan 19, 2022

015-storage-messaging #16

Open

bojanserafimov reviewed Jan 25, 2022

View reviewed changes

SomeoneToIgnore mentioned this pull request Mar 21, 2022

add safekepeers gossip annd storage messaging rfcs neondatabase/neon#1384

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

014-safekeeper-gossip #13

014-safekeeper-gossip #13

kelvich commented Jan 17, 2022 •

edited by stepashka

Loading

kelvich commented Jan 19, 2022

bojanserafimov Jan 25, 2022


		## Proposed implementation

		Each safekeeper can periodically ping all its peers and share connectivity and liveness info. If the ping was not receiver for, let's say, four ping periods, we may consider sending safekeeper as dead. That would mean some of the alive safekeepers should connect to the pageserver. One way to decide which one exactly: `make_connection = my_node_id == min(alive_nodes)`

014-safekeeper-gossip #13

Are you sure you want to change the base?

014-safekeeper-gossip #13

Conversation

kelvich commented Jan 17, 2022 • edited by stepashka Loading

Safekeeper gossip

Motivation

Summary

kelvich commented Jan 19, 2022

bojanserafimov Jan 25, 2022

Choose a reason for hiding this comment

kelvich commented Jan 17, 2022 •

edited by stepashka

Loading