Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Faster joins: don't stall when a user joins during a fast join #14606

Merged
merged 23 commits into from
Feb 10, 2023

Conversation

MatMaul
Copy link
Contributor

@MatMaul MatMaul commented Dec 2, 2022

Fixes #12801.
Complement tests are at
matrix-org/complement#567.

Avoid blocking on full state when handling a subsequent join into a
partial state room.

Also always perform a remote join into partial state rooms, since we do
not know whether the joining user has been banned and want to avoid
leaking history to banned users.

Pull Request Checklist

@MatMaul MatMaul force-pushed the mv/join-during-fast-join branch 9 times, most recently from 78bdd06 to 0215ca7 Compare December 9, 2022 15:49
@MatMaul MatMaul force-pushed the mv/join-during-fast-join branch 2 times, most recently from c0087ef to 1e928ab Compare December 15, 2022 19:38
@MatMaul MatMaul force-pushed the mv/join-during-fast-join branch 3 times, most recently from 9672775 to 5dae42f Compare December 19, 2022 15:59
@MatMaul MatMaul changed the title WIP Join during fast join Faster joins: don't stale when an user joins during a fast join Dec 19, 2022
Signed-off-by: Mathieu Velten <mathieuv@matrix.org>
@MatMaul MatMaul force-pushed the mv/join-during-fast-join branch from 5dae42f to 384ccf2 Compare December 19, 2022 16:03
@MatMaul MatMaul marked this pull request as ready for review December 19, 2022 16:05
@MatMaul MatMaul requested a review from a team as a code owner December 19, 2022 16:05
@MatMaul
Copy link
Contributor Author

MatMaul commented Dec 20, 2022

We had a discussion about do_invite_join here.
TLDR:

  1. this method currently removes current room extremities as a first step. If the second send_join is not accepted by the resident server, we don't get any new extremity and we can't send messages anymore.
  2. un-partial-stating during this is not handled and should fail

For 1 we could store the membership event received from send_join (if succeeded) in the same transaction as deleting the old extremities.

@squahtx
Copy link
Contributor

squahtx commented Dec 20, 2022

One of the ways an un-partial-stating during join could manifest is in a PartialStateConflictError (note the todo).

try:
max_stream_id = (
await self._federation_event_handler.process_remote_join(
origin,
room_id,
auth_chain,
state,
event,
room_version_obj,
partial_state=ret.partial_state,
)
)
except PartialStateConflictError as e:
# The homeserver was already in the room and it is no longer partial
# stated. We ought to be doing a local join instead. Turn the error into
# a 429, as a hint to the client to try again.
# TODO(faster_joins): `_should_perform_remote_join` suggests that we may
# do a remote join for restricted rooms even if we have full state.
logger.error(
"Room %s was un-partial stated while processing remote join.",
room_id,
)
raise LimitExceededError(msg=e.msg, errcode=e.errcode, retry_after_ms=0)

We also need to be careful that we don't return a room to partial state, because that breaks a lot of assumptions.

@MatMaul
Copy link
Contributor Author

MatMaul commented Dec 29, 2022

  1. this method currently removes current room extremities as a first step. If the second send_join is not accepted by the resident server, we don't get any new extremity and we can't send messages anymore.

For 1 we could store the membership event received from send_join (if succeeded) in the same transaction as deleting the old extremities.

It's not possible since persisting events can potentially be done in another worker.

What I've done instead to mitigate the issue:

  • do not remove forward extremities if the room is partially joined (in SQL to avoid races)
  • move _clean_room_for_join call after send_join, so we avoid cleaning up if the join is failing anyway

@MatMaul
Copy link
Contributor Author

MatMaul commented Jan 3, 2023

I've added a way to fallback to local join if a room is unpartial stated during remote join.

However I am not sure this is really useful: retrying the remote join is a lot simpler, and I think that should be enough ?

Actually it's possible we will never hit the case: the event context is calculated by another resident server in a make_join remote call. It can contain local state (if a message was send to the resident server from the joining-already-joined server in the meantime), but unsure if/how this plays out with the local unpartial state, since it's coming from the remote resident server.

I think we should just retry the whole make/send_join in case of and that should be enough, in the same principle as 54c012c.

@DMRobertson DMRobertson changed the title Faster joins: don't stale when an user joins during a fast join Faster joins: don't stall when an user joins during a fast join Jan 17, 2023
@squahtx
Copy link
Contributor

squahtx commented Jan 20, 2023

I tried to address the issues with restricted join in #14882, but it seems to have ended up second join-proofing do_invite_join and replacing that part of this PR. Please ignore all my comments about that method if #14882 lands.

Sean Quah added 5 commits January 27, 2023 15:16
* the filter for partial state rooms in `_clean_room_for_join_txn` has
  been replaced with a host membership check in `_do_invite_join`.
* `_do_invite_join` can no longer raise a `PartialStateConflictError`.
@squahtx
Copy link
Contributor

squahtx commented Jan 27, 2023

Now that #14882 has landed, I've gone and resolved the merge conflicts.

All the PR comments plus changes for do_invite_join have gone into #14882, so that method is now compatible with second joins.
Also do_invite_join will no longer raise a PartialStateConflictError because there is locking now.

changelog.d/14606.misc Outdated Show resolved Hide resolved
synapse/handlers/event_auth.py Show resolved Hide resolved
synapse/handlers/room_member.py Outdated Show resolved Hide resolved
synapse/handlers/room_member.py Outdated Show resolved Hide resolved
@@ -881,11 +902,11 @@ async def update_membership_locked(
if action == "kick":
raise AuthError(403, "The target user is not in the room")

is_host_in_room = await self._is_host_in_room(state_before_join)
is_host_in_room = await self._is_host_in_room(partial_state_before_join)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think knocking (line +1052) is going to be broken in that we can end up sending a knock event from a user with a ban we don't know about.
We can't exactly await full state, since we're holding the room membership lock. Any thoughts on what we should do in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can end up sending a knock event from a user with a ban we don't know about.

Isn't it possible for us to be in this situation already, if we try to knock on a room that this HS isn't joined to?

I'm tempted to ignore this case and let fully-joined homeservers in the room reject the event. (Partially-joined HSes may accept the knock if they too have not seen the ban, but the resync process should clear that up?)

Copy link
Contributor

@squahtx squahtx Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit different. Previously, we would do a remote /make_knock+/send_knock and the remote homeserver would tell us to go away if we were banned.
Now, since we are in the room but do not have full state, we can add a local knock into the DAG that would otherwise be rejected and cite it as a prev/auth event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Would it be worth requiring knocks to go over federation during the resync? (Not necessarily in this PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not wholly convinced the remote knock code is suitable. It persists the knock event as an outlier. I'm not sure what happens if we then receive the knock event over federation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to land this as-is and kick any knocking cans down the road.

@squahtx squahtx requested a review from DMRobertson February 6, 2023 10:15
@DMRobertson DMRobertson changed the title Faster joins: don't stall when an user joins during a fast join Faster joins: don't stall when a user joins during a fast join Feb 9, 2023
Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this LGTM and we should land it, provided that we have reasonable complement coverage to land with it.

synapse/handlers/federation_event.py Outdated Show resolved Hide resolved
synapse/handlers/federation_event.py Show resolved Hide resolved
@@ -881,11 +902,11 @@ async def update_membership_locked(
if action == "kick":
raise AuthError(403, "The target user is not in the room")

is_host_in_room = await self._is_host_in_room(state_before_join)
is_host_in_room = await self._is_host_in_room(partial_state_before_join)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can end up sending a knock event from a user with a ban we don't know about.

Isn't it possible for us to be in this situation already, if we try to knock on a room that this HS isn't joined to?

I'm tempted to ignore this case and let fully-joined homeservers in the room reject the event. (Partially-joined HSes may accept the knock if they too have not seen the ban, but the resync process should clear that up?)

Comment on lines +1152 to +1153
if is_partial_state_room and previous_membership != Membership.JOIN:
return True, remote_room_hosts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, checking the previous membership ensures that profile changes don't trigger another remote join. (Maybe worth writing a complement test that you can change your profile during the resync? Don't think that should block this though.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test as part of matrix-org/complement#567. Synapse develop currently blocks and fails the test.

@DMRobertson
Copy link
Contributor

Also: is the description of this PR still accurate to what it does, or should we update it?

DMRobertson referenced this pull request in matrix-org/complement Feb 9, 2023
* Theoretical test for leave after partial join

* Check that leaves during resync work today

* Fixes to tests
@squahtx
Copy link
Contributor

squahtx commented Feb 10, 2023

Also: is the description of this PR still accurate to what it does, or should we update it?

Er, let's update it.

@squahtx squahtx requested a review from DMRobertson February 10, 2023 15:48
@squahtx squahtx merged commit 6cddf24 into develop Feb 10, 2023
@squahtx squahtx deleted the mv/join-during-fast-join branch February 10, 2023 23:31
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Mar 5, 2023
Synapse 1.78.0 (2023-02-28)
===========================

Bugfixes
--------

- Fix a bug introduced in Synapse 1.76 where 5s delays would occasionally occur in deployments using workers. ([\#15150](matrix-org/synapse#15150))


Synapse 1.78.0rc1 (2023-02-21)
==============================

Features
--------

- Implement the experimental `exact_event_match` push rule condition from [MSC3758](matrix-org/matrix-spec-proposals#3758). ([\#14964](matrix-org/synapse#14964))
- Add account data to the command line [user data export tool](https://matrix-org.github.io/synapse/v1.78/usage/administration/admin_faq.html#how-can-i-export-user-data). ([\#14969](matrix-org/synapse#14969))
- Implement [MSC3873](matrix-org/matrix-spec-proposals#3873) to disambiguate push rule keys with dots in them. ([\#15004](matrix-org/synapse#15004))
- Allow Synapse to use a specific Redis [logical database](https://redis.io/commands/select/) in worker-mode deployments. ([\#15034](matrix-org/synapse#15034))
- Tag opentracing spans for federation requests with the name of the worker serving the request. ([\#15042](matrix-org/synapse#15042))
- Implement the experimental `exact_event_property_contains` push rule condition from [MSC3966](matrix-org/matrix-spec-proposals#3966). ([\#15045](matrix-org/synapse#15045))
- Remove spurious `dont_notify` action from the defaults for the `.m.rule.reaction` pushrule. ([\#15073](matrix-org/synapse#15073))
- Update the error code returned when user sends a duplicate annotation. ([\#15075](matrix-org/synapse#15075))


Bugfixes
--------

- Prevent clients from reporting nonexistent events. ([\#13779](matrix-org/synapse#13779))
- Return spec-compliant JSON errors when unknown endpoints are requested. ([\#14605](matrix-org/synapse#14605))
- Fix a long-standing bug where the room aliases returned could be corrupted. ([\#15038](matrix-org/synapse#15038))
- Fix a bug introduced in Synapse 1.76.0 where partially-joined rooms could not be deleted using the [purge room API](https://matrix-org.github.io/synapse/latest/admin_api/rooms.html#delete-room-api). ([\#15068](matrix-org/synapse#15068))
- Fix a long-standing bug where federated joins would fail if the first server in the list of servers to try is not in the room. ([\#15074](matrix-org/synapse#15074))
- Fix a bug introduced in Synapse v1.74.0 where searching with colons when using ICU for search term tokenisation would fail with an error. ([\#15079](matrix-org/synapse#15079))
- Reduce the likelihood of a rare race condition where rejoining a restricted room over federation would fail. ([\#15080](matrix-org/synapse#15080))
- Fix a bug introduced in Synapse 1.76 where workers would fail to start if the `health` listener was configured. ([\#15096](matrix-org/synapse#15096))
- Fix a bug introduced in Synapse 1.75 where the [portdb script](https://matrix-org.github.io/synapse/release-v1.78/postgres.html#porting-from-sqlite) would fail to run after a room had been faster-joined. ([\#15108](matrix-org/synapse#15108))


Improved Documentation
----------------------

- Document how to start Synapse with Poetry. Contributed by @thezaidbintariq. ([\#14892](matrix-org/synapse#14892), [\#15022](matrix-org/synapse#15022))
- Update delegation documentation to clarify that SRV DNS delegation does not eliminate all needs to serve files from .well-known locations. Contributed by @williamkray. ([\#14959](matrix-org/synapse#14959))
- Fix a mistake in registration_shared_secret_path docs. ([\#15078](matrix-org/synapse#15078))
- Refer to a more recent blog post on the [Database Maintenance Tools](https://matrix-org.github.io/synapse/latest/usage/administration/database_maintenance_tools.html) page. Contributed by @jahway603. ([\#15083](matrix-org/synapse#15083))


Internal Changes
----------------

- Re-type hint some collections as read-only. ([\#13755](matrix-org/synapse#13755))
- Faster joins: don't stall when another user joins during a partial-state room resync. ([\#14606](matrix-org/synapse#14606))
- Add a class `UnpersistedEventContext` to allow for the batching up of storing state groups. ([\#14675](matrix-org/synapse#14675))
- Add a check to ensure that locked dependencies have source distributions available. ([\#14742](matrix-org/synapse#14742))
- Tweak comment on `_is_local_room_accessible` as part of room visibility in `/hierarchy` to clarify the condition for a room being visible. ([\#14834](matrix-org/synapse#14834))
- Prevent `WARNING: there is already a transaction in progress` lines appearing in PostgreSQL's logs on some occasions. ([\#14840](matrix-org/synapse#14840))
- Use `StrCollection` to avoid potential bugs with `Collection[str]`. ([\#14929](matrix-org/synapse#14929))
- Improve performance of `/sync` in a few situations. ([\#14973](matrix-org/synapse#14973))
- Limit concurrent event creation for a room to avoid state resolution when sending bursts of events to a local room. ([\#14977](matrix-org/synapse#14977))
- Skip calculating unread push actions in /sync when enable_push is false. ([\#14980](matrix-org/synapse#14980))
- Add a schema dump symlinks inside `contrib`, to make it easier for IDEs to interrogate Synapse's database schema. ([\#14982](matrix-org/synapse#14982))
- Improve type hints. ([\#15008](matrix-org/synapse#15008), [\#15026](matrix-org/synapse#15026), [\#15027](matrix-org/synapse#15027), [\#15028](matrix-org/synapse#15028), [\#15031](matrix-org/synapse#15031), [\#15035](matrix-org/synapse#15035), [\#15052](matrix-org/synapse#15052), [\#15072](matrix-org/synapse#15072), [\#15084](matrix-org/synapse#15084))
- Update [MSC3952](matrix-org/matrix-spec-proposals#3952) support based on changes to the MSC. ([\#15037](matrix-org/synapse#15037))
- Avoid mutating a cached value in `get_user_devices_from_cache`. ([\#15040](matrix-org/synapse#15040))
- Fix a rare exception in logs on start up. ([\#15041](matrix-org/synapse#15041))
- Update pyo3-log to v0.8.1. ([\#15043](matrix-org/synapse#15043))
- Avoid mutating cached values in `_generate_sync_entry_for_account_data`. ([\#15047](matrix-org/synapse#15047))
- Refactor arguments of `try_unbind_threepid` and `_try_unbind_threepid_with_id_server` to not use dictionaries. ([\#15053](matrix-org/synapse#15053))
- Merge debug logging from the hotfixes branch. ([\#15054](matrix-org/synapse#15054))
- Faster joins: omit device list updates originating from partial state rooms in /sync responses without lazy loading of members enabled. ([\#15069](matrix-org/synapse#15069))
- Fix clashing database transaction name. ([\#15070](matrix-org/synapse#15070))
- Upper-bound frozendict dependency. This works around us being unable to test installing our wheels against Python 3.11 in CI. ([\#15114](matrix-org/synapse#15114))
- Tweak logging for when a worker waits for its view of a replication stream to catch up. ([\#15120](matrix-org/synapse#15120))

<details><summary>Locked dependency updates</summary>

- Bump bleach from 5.0.1 to 6.0.0. ([\#15059](matrix-org/synapse#15059))
- Bump cryptography from 38.0.4 to 39.0.1. ([\#15020](matrix-org/synapse#15020))
- Bump ruff version from 0.0.230 to 0.0.237. ([\#15033](matrix-org/synapse#15033))
- Bump dtolnay/rust-toolchain from 9cd00a88a73addc8617065438eff914dd08d0955 to 25dc93b901a87e864900a8aec6c12e9aa794c0c3. ([\#15060](matrix-org/synapse#15060))
- Bump systemd-python from 234 to 235. ([\#15061](matrix-org/synapse#15061))
- Bump serde_json from 1.0.92 to 1.0.93. ([\#15062](matrix-org/synapse#15062))
- Bump types-requests from 2.28.11.8 to 2.28.11.12. ([\#15063](matrix-org/synapse#15063))
- Bump types-pillow from 9.4.0.5 to 9.4.0.10. ([\#15064](matrix-org/synapse#15064))
- Bump sentry-sdk from 1.13.0 to 1.15.0. ([\#15065](matrix-org/synapse#15065))
- Bump types-jsonschema from 4.17.0.3 to 4.17.0.5. ([\#15099](matrix-org/synapse#15099))
- Bump types-bleach from 5.0.3.1 to 6.0.0.0. ([\#15100](matrix-org/synapse#15100))
- Bump dtolnay/rust-toolchain from 25dc93b901a87e864900a8aec6c12e9aa794c0c3 to e12eda571dc9a5ee5d58eecf4738ec291c66f295. ([\#15101](matrix-org/synapse#15101))
- Bump dawidd6/action-download-artifact from 2.24.3 to 2.25.0. ([\#15102](matrix-org/synapse#15102))
- Bump types-pillow from 9.4.0.10 to 9.4.0.13. ([\#15104](matrix-org/synapse#15104))
- Bump types-setuptools from 67.1.0.0 to 67.3.0.1. ([\#15105](matrix-org/synapse#15105))


</details>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster joins: Handle a second join while syncing state for the first one
4 participants