Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster remote room joins: Support partial join re-syncing on workers other than the master #14544

Open
matrixbot opened this issue Dec 20, 2023 · 0 comments

Comments

@matrixbot
Copy link
Collaborator

matrixbot commented Dec 20, 2023

This issue has been migrated from #14544.


An enhancement of: #12994 (worker-mode support for Faster Remote Room Joins).

Instead of relying on the master to perform the re-syncing of the rooms, we should allow other workers to be involved.
Part of the difficulty is in choosing a worker to perform the re-sync for a room, ensuring that even after a crash/restart, exactly one worker will pick up the job of re-syncing that room again.
We should be mindful that in a hypothetical deployment, workers can be taken out of service — a room shouldn't be locked to one worker forever in case this happens, as that would mean the re-sync would never progress.

Aside: in future we should consider moving the /send_join request out of the master process. The obvious candidate is the "client reader" that receives the client-side /join request (and hence currently makes the request to ReplicationRemoteJoinRestServlet). The main thing to worry about then is locking (to ensure that we don't have multiple workers all trying to do the remote-join dance at once). For prior art in that department, we should look at the code that handles incoming events received over federation (https://github.com/matrix-org/synapse/blob/v1.69.0rc2/synapse/federation/federation_server.py#L1108-L1116), which uses a database row to hold a lock: we can simply call try_acquire_lock before starting a resync operation.

That still leaves us with the problem of making sure we resume the partial-state resync if the client reader that is currently processing it gets restarted (or, worse, turned off, never to return). Again following the example of incoming events: in that case, we kick off a processing job as soon as a worker discovers itself to be a "federation inbound" worker by receiving a /send request. Probably we could do the same here on a /_matrix/client/v3/rooms/.*/(send|join|invite|leave|ban|unban|kick) request?
matrix-org/synapse#12994 (comment)

@matrixbot matrixbot changed the title Dummy issue Faster remote room joins: Support partial join re-syncing on workers other than the master Dec 21, 2023
@matrixbot matrixbot reopened this Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant