Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Test cancellation at every await during request handling #12674

Merged
merged 26 commits into from
Jun 7, 2022

Conversation

squahtx
Copy link
Contributor

@squahtx squahtx commented May 9, 2022

I'm throwing up this PR for discussion rather than to get it merged as-is.
The code's pretty magical and I'm not sure if it's something we want to merge.
EDIT: Consensus seems to be that we're happy to have integration tests, even if they're magical.
I'm putting this up for a real review now.

The interesting commit is "Add _test_cancellation_at_every_await helper method".

Motivation

Once we start enabling cancellation for various endpoints, it would be desirable to have some sort of integration testing for it as part of CI.

Improper handling of cancellation can manifest as anything from logging contexts going missing to stuck requests to certain endpoints. As it's uncommon for clients to disconnect before a request finishes processing, any bugs won't be immediately obvious after deployment to production and will bite at random times.

Testing Approach

It turns out that when you await a Deferred, Python calls Deferred.__await__().__next__(). Deferred.__await__() just returns self.

And __next__ controls what happens to the await:

We can patch Deferred.__next__ to intercept awaits and force them to block. Which gives us time to throw a cancellation into the request processing!

With that power in hand, we can construct a test that will run a request up to the first await (on a Deferred) and cancel the request. And then we can re-run the request, allow it to get a little further, then cancel it again once it hits the next await, and so on. ie. we can test a cancellation at every await on a Deferred along a single code path.

We identify when we see a new awaits by looking at the stack. If it's a stack we've seen before, we let the await behave as normal.

This PR implements this approach.

Logging

The test process is very magical and it's not clear what it's actually doing for a given endpoint. So I've added some debug logging at INFO level that will print a call tree for the endpoint under test, showing every await the test sees. There's an example below for the GET /rooms/$room_id/members endpoint. PartialStateEventsTracker.await_full_state shows up, as you might hope, and gets cancel-tested.

Logging from a run through `GET /rooms/$room_id/members`

Only the new parts of the call tree are printed for each request.

request-0  Running _test_cancellation_at_every_await for test_get_member_list_cancellation...
request-1  JsonResource wrapped_async_request_handler:275 (/home/squah/repos/synapse/synapse/http/server.py:275)
request-1    preserve_fn.<locals>.g:353 (/home/squah/repos/synapse/synapse/http/server.py:353)
request-1      JsonResource._async_render:559 (/home/squah/repos/synapse/synapse/http/server.py:559)
request-1        RoomMemberListRestServlet.on_GET:511 (/home/squah/repos/synapse/synapse/rest/client/room.py:511)
request-1          MessageHandler.get_state_events:225 (/home/squah/repos/synapse/synapse/handlers/message.py:225)
request-1            Auth.check_user_in_room_or_world_readable:693 (/home/squah/repos/synapse/synapse/api/auth.py:693)
request-1              Auth.check_user_in_room:109 (/home/squah/repos/synapse/synapse/api/auth.py:109)
request-1                > *force-blocked await* at StateHandler.get_current_state:175 (/home/squah/repos/synapse/synapse/state/__init__.py:175)
request-1  --- disconnected ---
request-2                StateHandler.get_current_state:179 (/home/squah/repos/synapse/synapse/state/__init__.py:179)
request-2                  StateHandler measured_func:106 (/home/squah/repos/synapse/synapse/util/metrics.py:106)
request-2                    StateHandler.resolve_state_groups_for_events:424 (/home/squah/repos/synapse/synapse/state/__init__.py:424)
request-2                      StateGroupStorage.get_state_groups_ids:587 (/home/squah/repos/synapse/synapse/storage/state.py:587)
request-2                        StateGroupStorage._get_state_group_for_events:806 (/home/squah/repos/synapse/synapse/storage/state.py:806)
request-2                          PartialStateEventsTracker.await_full_state:70 (/home/squah/repos/synapse/synapse/storage/util/partial_state_events_tracker.py:70)
request-2                            DeferredCacheListDescriptor wrapped:511 (/home/squah/repos/synapse/synapse/util/caches/descriptors.py:511)
request-2                              g:753 (/home/squah/repos/synapse/synapse/logging/context.py:753)
request-2                                run_in_background:814 (/home/squah/repos/synapse/synapse/logging/context.py:814)
request-2                                  EventsWorkerStore.get_partial_state_events:2001 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:2001)
request-2                                    DatabasePool.simple_select_many_batch:1719 (/home/squah/repos/synapse/synapse/storage/database.py:1719)
request-2                                      DatabasePool.runInteraction:796 (/home/squah/repos/synapse/synapse/storage/database.py:796)
request-2                                        delay_cancellation:750 (/home/squah/repos/synapse/synapse/util/async_helpers.py:750)
request-2                                          DatabasePool _runInteraction:768 (/home/squah/repos/synapse/synapse/storage/database.py:768)
request-2                                            > *blocking await* at DatabasePool.runWithConnection:891 (/home/squah/repos/synapse/synapse/storage/database.py:891)
request-2                                      > *blocking await* at DatabasePool.runInteraction:796 (/home/squah/repos/synapse/synapse/storage/database.py:796)
request-2                          > *blocking await* at PartialStateEventsTracker.await_full_state:70 (/home/squah/repos/synapse/synapse/storage/util/partial_state_events_tracker.py:70)
request-2  --- disconnected ---
request-3                        > *force-blocked await* at StateGroupStorage._get_state_group_for_events:808 (/home/squah/repos/synapse/synapse/storage/state.py:808)
request-3  --- disconnected ---
request-4                    StateHandler.resolve_state_groups_for_events:433 (/home/squah/repos/synapse/synapse/state/__init__.py:433)
request-4                      StateGroupStorage.get_state_group_delta:565 (/home/squah/repos/synapse/synapse/storage/state.py:565)
request-4                        DeferredCacheDescriptor _wrapped:351 (/home/squah/repos/synapse/synapse/util/caches/descriptors.py:351)
request-4                          g:753 (/home/squah/repos/synapse/synapse/logging/context.py:753)
request-4                            run_in_background:814 (/home/squah/repos/synapse/synapse/logging/context.py:814)
request-4                              StateGroupDataStore.get_state_group_delta:155 (/home/squah/repos/synapse/synapse/storage/databases/state/store.py:155)
request-4                                DatabasePool.runInteraction:796 (/home/squah/repos/synapse/synapse/storage/database.py:796)
request-4                                  delay_cancellation:750 (/home/squah/repos/synapse/synapse/util/async_helpers.py:750)
request-4                                    DatabasePool _runInteraction:768 (/home/squah/repos/synapse/synapse/storage/database.py:768)
request-4                                      > *blocking await* at DatabasePool.runWithConnection:891 (/home/squah/repos/synapse/synapse/storage/database.py:891)
request-4                                > *blocking await* at DatabasePool.runInteraction:796 (/home/squah/repos/synapse/synapse/storage/database.py:796)
request-4                      > *blocking await* at StateGroupStorage.get_state_group_delta:565 (/home/squah/repos/synapse/synapse/storage/state.py:565)
request-4  --- disconnected ---
request-5                StateHandler.get_current_state:186 (/home/squah/repos/synapse/synapse/state/__init__.py:186)
request-5                  EventsWorkerStore.get_event:388 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:388)
request-5                    EventsWorkerStore.get_events_as_list:480 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:480)
request-5                      EventsWorkerStore._get_events_from_cache_or_db:690 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:690)
request-5                        delay_cancellation:750 (/home/squah/repos/synapse/synapse/util/async_helpers.py:750)
request-5                          DataStore get_missing_events_from_db:670 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:670)
request-5                            EventsWorkerStore._get_events_from_db:985 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:985)
request-5                              EventsWorkerStore._enqueue_events:1135 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:1135)
request-5                                EventsWorkerStore._maybe_start_fetch_thread:817 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:817)
request-5                                  run_as_background_process:256 (/home/squah/repos/synapse/synapse/metrics/background_process_metrics.py:256)
request-5                                    run:243 (/home/squah/repos/synapse/synapse/metrics/background_process_metrics.py:243)
request-5                                      EventsWorkerStore._fetch_thread:823 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:823)
request-5                                        > *blocking await* at DatabasePool.runWithConnection:891 (/home/squah/repos/synapse/synapse/storage/database.py:891)
request-5                              > *blocking await* at EventsWorkerStore._enqueue_events:1139 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:1139)
request-5                      > *blocking await* at EventsWorkerStore._get_events_from_cache_or_db:690 (/home/squah/repos/synapse/synapse/storage/databases/main/events_worker.py:690)
request-5  --- disconnected ---
request-6          MessageHandler.get_state_events:230 (/home/squah/repos/synapse/synapse/handlers/message.py:230)
request-6            StateGroupWorkerStore.get_filtered_current_state_ids:268 (/home/squah/repos/synapse/synapse/storage/databases/main/state.py:268)
request-6              DatabasePool.runInteraction:796 (/home/squah/repos/synapse/synapse/storage/database.py:796)
request-6                delay_cancellation:750 (/home/squah/repos/synapse/synapse/util/async_helpers.py:750)
request-6                  DatabasePool _runInteraction:768 (/home/squah/repos/synapse/synapse/storage/database.py:768)
request-6                    > *blocking await* at DatabasePool.runWithConnection:891 (/home/squah/repos/synapse/synapse/storage/database.py:891)
request-6              > *blocking await* at DatabasePool.runInteraction:796 (/home/squah/repos/synapse/synapse/storage/database.py:796)
request-6  --- disconnected ---
request-7    preserve_fn.<locals>.g:357 (/home/squah/repos/synapse/synapse/http/server.py:357)
request-7      DirectServeJsonResource._send_response:426 (/home/squah/repos/synapse/synapse/http/server.py:426)
request-7        CallableMixin.__call__:1081 (/usr/lib/python3.8/unittest/mock.py:1081)
request-7          CallableMixin._mock_call:1085 (/usr/lib/python3.8/unittest/mock.py:1085)
request-7            CallableMixin._execute_mock_call:1155 (/usr/lib/python3.8/unittest/mock.py:1155)
request-7              respond_with_json:798 (/home/squah/repos/synapse/synapse/http/server.py:798)
request-7                run_in_background:814 (/home/squah/repos/synapse/synapse/logging/context.py:814)
request-7                  > *blocking await* at _async_write_json_to_request_in_thread:871 (/home/squah/repos/synapse/synapse/http/server.py:871)

Limitations

  • We can only test a single code path for a request, usually the happy path.
  • We can't easily test code paths with contention, for example, in ReadWriteLocks.

Pros of merging something like this

Cons of merging something like this

  • The code is complex and there is an unusual amount of magic involved, such as patching Deferred.__next__ and inspecting the stack.
  • It still misses bugs, such as when I revert Handle cancellation in DatabasePool.runInteraction() #12199.
  • It's not clear what's being tested, unless you turn on logging and manually inspect the test logs.

Alternatives

  • We could manually construct tests for each endpoint? They'll have less coverage than the approach in this PR.
  • We could forgo integration tests entirely?
  • We could patch defer.inlineCallbacks instead, like we do for _check_yield_points? We won't be able to force awaits to block if we go for this.

Sean Quah added 5 commits May 13, 2022 12:38
Signed-off-by: Sean Quah <seanq@element.io>
Signed-off-by: Sean Quah <seanq@element.io>
Signed-off-by: Sean Quah <seanq@element.io>
Signed-off-by: Sean Quah <seanq@element.io>
@squahtx squahtx force-pushed the squah/test_cancellation_at_every_await branch from bc61e33 to b7aa039 Compare May 13, 2022 11:40
@squahtx squahtx changed the base branch from squah/add_endpoint_cancellation_flag to develop May 13, 2022 11:40
@erikjohnston
Copy link
Member

Wow, this is quite magical. Broadly, I think if this is the least-worst option for testing this stuff then we should merge it. In my mind awful testing strategies are better than no testing strategies? 🤷

@reivilibre
Copy link
Contributor

It is quite magical, but I can still make sense of it so it's not too bad. For a testing strategy I think this is reasonable and I wouldn't turn it down if the alternative is weaker or no coverage.
With this kind of helper, it's important to consider what the experience will be like for someone who ends up breaking one of these tests and has to figure out what's going on. I think the logging output is very helpful here so I don't think I have much to complain about.

I think it'd be better knowing about these problems rather than covering them up by not having such a testing scheme, so I'd say 'merge it'. If we really regret it for some reason later, we can just rip them out — they're only tests after all.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As others have said: it's magical, but not awful, and I don't think you're digging into anything that is likely to change in the near future.

It's definitely a net win.

@squahtx squahtx marked this pull request as ready for review May 27, 2022 15:27
@squahtx squahtx requested a review from a team May 27, 2022 15:27
@squahtx
Copy link
Contributor Author

squahtx commented May 27, 2022

Thanks for your thoughts everyone! Sounds like we think the benefits outweigh the drawbacks, so I'll put this up for real review now.

@richvdh richvdh self-assigned this May 30, 2022
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a start on looking at this and made some suggestions. Could you have a look at what I've said so far, and then I'll take a closer look?

@@ -98,3 +111,413 @@ def _test_disconnect(
self.assertEqual(code, expected_code)
self.assertEqual(request.code, expected_code)
self.assertEqual(body, expected_body)

@logcontext_clean
def _test_cancellation_at_every_await(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid the underscore here, if this is meant to be used by test methods - otherwise it's hard to see what is meant to be the usable api, and what are just the helper methods like _log_await_stack.

In fact, as far as I can tell the only reason this should be in a mixin rather than a completely separate module is so that you can call self.assertEqual and friends. All assertEqual does is raise an AssertionError if the condition fails, so you could do that yourself without having to inherit from TestCase.

Comment on lines 190 to 192
with mock.patch(
"synapse.http.server.respond_with_json", wraps=respond_with_json
) as respond_mock:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to patch this? Can we not just check the status of channel (in particular, check channel.is_finished instead of respond_mock.called)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no, disconnecting detaches the channel from the Request and so it never gets the response.

message = f"{' ' * i}> *{note}* at {frame}"
self._log_for_request(request_number, message)

def _format_stack_frame(self, frame_info: inspect.FrameInfo) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and the other methods below can be turned into regular functions. There's no need for it all to be part of the same object.

Comment on lines 279 to 284
) -> Tuple[
Callable[["Deferred[T]"], "Deferred[T]"],
Callable[[], None],
Callable[[], int],
Callable[[], bool],
]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest making this whole thing an object, and turn create_deferred___next___patch into its constructor.

self.assertFalse(
respond_mock.called,
"Request finished before we could disconnect - "
"was `await_result=False` passed to `make_request`?",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood this the first time I read it, and thought that await_result=False was a bad thing to do.

Suggested change
"was `await_result=False` passed to `make_request`?",
"ensure `await_result=False` is passed to `make_request`.",

while not respond_mock.called and not has_seen_new_await():
previous_awaits_seen = get_awaits_seen()

reactor.pump([0.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd inline this (and the other calls to reactor.pump) for clarity:

Suggested change
reactor.pump([0.0])
reactor.advance(0.0)

Comment on lines 239 to 240
self.assertEqual(channel.code, expected_code)
return channel.json_body
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just return channel and leave these up to the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly? There may be an unfortunate interaction where we block the awaits in _async_write_json_to_request_in_thread.
I'll have to test it.

def _test_cancellation_at_every_await(
self,
reactor: ThreadedMemoryReactorClock,
make_request: Callable[[], FakeChannel],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than passing this in, can we just take a *args, **kwargs to pass through to make_request ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that we might want to test federation requests in future, which have their own make_signed_federation_request function

def make_signed_federation_request(

though there is nothing stopping us from defining a make_signed_federation_request_with_cancellation_test variant.

@@ -98,3 +111,413 @@ def _test_disconnect(
self.assertEqual(code, expected_code)
self.assertEqual(request.code, expected_code)
self.assertEqual(body, expected_body)

@logcontext_clean
def _test_cancellation_at_every_await(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be called make_request_with_cancellation_test or something


self.assertEqual(channel.code, expected_code)
return channel.json_body
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can drop this else, since there's a return above.

@squahtx squahtx requested a review from richvdh May 31, 2022 18:36
@squahtx
Copy link
Contributor Author

squahtx commented May 31, 2022

Thanks for the feedback. I think I've addressed all of it now. Mind taking another look?

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks generally good. A few comments.

tests/http/server/_base.py Outdated Show resolved Hide resolved
tests/http/server/_base.py Outdated Show resolved Hide resolved
tests/http/server/_base.py Outdated Show resolved Hide resolved
tests/http/server/_base.py Outdated Show resolved Hide resolved
tests/http/server/_base.py Outdated Show resolved Hide resolved
tests/http/server/_base.py Outdated Show resolved Hide resolved
tests/http/server/_base.py Outdated Show resolved Hide resolved
@squahtx squahtx requested a review from richvdh June 1, 2022 15:17
@squahtx squahtx marked this pull request as draft June 1, 2022 15:40
@squahtx
Copy link
Contributor Author

squahtx commented Jun 1, 2022

The tests seem to have broken after #12872

@squahtx
Copy link
Contributor Author

squahtx commented Jun 1, 2022

The tests seem to have broken after #12872

It looks like a bug in the test.

Before #12872, when running RoomsMemberListTestCase.test_get_member_list_cancellation, we saw:

...
request-5  --- disconnected ---
request-6          MessageHandler.get_state_events:220 (/home/squah/repos/synapse/synapse/handlers/message.py:220)
request-6            StateGroupWorkerStore.get_filtered_current_state_ids:333 (/home/squah/repos/synapse/synapse/storage/databases/main/state.py:333)
request-6              DatabasePool.runInteraction:834 (/home/squah/repos/synapse/synapse/storage/database.py:834)
request-6                delay_cancellation:762 (/home/squah/repos/synapse/synapse/util/async_helpers.py:762)
request-6                  DatabasePool _runInteraction:806 (/home/squah/repos/synapse/synapse/storage/database.py:806)
request-6                    > *blocking await* at DatabasePool.runWithConnection:929 (/home/squah/repos/synapse/synapse/storage/database.py:929)
request-6              > *blocking await* at DatabasePool.runInteraction:834 (/home/squah/repos/synapse/synapse/storage/database.py:834)
request-6  --- disconnected ---
request-7    preserve_fn.<locals>.g:370 (/home/squah/repos/synapse/synapse/http/server.py:370)
request-7      DirectServeJsonResource._send_response:439 (/home/squah/repos/synapse/synapse/http/server.py:439)
request-7        CallableMixin.__call__:1081 (/usr/lib/python3.8/unittest/mock.py:1081)
request-7          CallableMixin._mock_call:1085 (/usr/lib/python3.8/unittest/mock.py:1085)
request-7            CallableMixin._execute_mock_call:1155 (/usr/lib/python3.8/unittest/mock.py:1155)
request-7              respond_with_json:811 (/home/squah/repos/synapse/synapse/http/server.py:811)
request-7                run_in_background:814 (/home/squah/repos/synapse/synapse/logging/context.py:814)
request-7                  > *blocking await* at _async_write_json_to_request_in_thread:884 (/home/squah/repos/synapse/synapse/http/server.py:884)

After #12872, we see:

...
request-5  --- disconnected ---
request-6          MessageHandler.get_state_events:220 (/home/squah/repos/synapse/synapse/handlers/message.py:220)
request-6            StateStorageController.get_current_state_ids:392 (/home/squah/repos/synapse/synapse/storage/controllers/state.py:392)
request-6              PartialCurrentStateTracker.await_full_state:162 (/home/squah/repos/synapse/synapse/storage/util/partial_state_events_tracker.py:162)
request-6                RoomWorkerStore.is_partial_state_room:1150 (/home/squah/repos/synapse/synapse/storage/databases/main/room.py:1150)
request-6                  DatabasePool.simple_select_one_onecol:1566 (/home/squah/repos/synapse/synapse/storage/database.py:1566)
request-6                    DatabasePool.runInteraction:834 (/home/squah/repos/synapse/synapse/storage/database.py:834)
request-6                      delay_cancellation:762 (/home/squah/repos/synapse/synapse/util/async_helpers.py:762)
request-6                        DatabasePool _runInteraction:806 (/home/squah/repos/synapse/synapse/storage/database.py:806)
request-6                          > *blocking await* at DatabasePool.runWithConnection:929 (/home/squah/repos/synapse/synapse/storage/database.py:929)
request-6                    > *blocking await* at DatabasePool.runInteraction:834 (/home/squah/repos/synapse/synapse/storage/database.py:834)
request-6  --- disconnected ---
request-7  <lambda>:414 (/home/squah/repos/synapse/tests/server.py:414)
request-7    DatabasePool _runInteraction:806 (/home/squah/repos/synapse/synapse/storage/database.py:806)
request-7      > *force-blocked await* at DatabasePool.runWithConnection:929 (/home/squah/repos/synapse/synapse/storage/database.py:929)
request-7  --- disconnected ---

and the test got stuck here, since the await we've forced to block sits within a delay_cancellation.
synapse/tests/server.py:414 is the callFromThread trampoline that gets queued after a database transaction completes. What's strange is that it does get called in both the before and after cases, but:

  • runs Deferred.send in the before case
  • and runs our Deferred.__next__ patch in the after case

for reasons I cannot explain.

I've added code to unblock force-blocked awaits while waiting for a cancellation response.

@richvdh
Copy link
Member

richvdh commented Jun 1, 2022

  • runs Deferred.send in the before case

  • and runs our Deferred.__next__ patch in the after case

... which implies that in the before case, this is happening after our patch has been removed, implying some sort of timing change? Do you know which callback callFromThread is trying to run?

I'm not sure it matters too much, if you have something that works.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@squahtx
Copy link
Contributor Author

squahtx commented Jun 1, 2022

Do you know which callback callFromThread is trying to run?

It's running Deferred.callback, to resume the _runInteraction > runWithConnection coroutine.

... which implies that in the before case, this is happening after our patch has been removed, implying some sort of timing change?

The patch is definitely still installed in both cases. If I add the following debug to callFromThread:

`callFromThread` debugging hack
--- a/tests/server.py
+++ b/tests/server.py
@@ -411,7 +411,18 @@ class ThreadedMemoryReactorClock(MemoryReactorClock):
         """
         Make the callback fire in the next reactor iteration.
         """
-        cb = lambda: callback(*args, **kwargs)
+        import types
+        def cb():
+            if isinstance(callback, types.MethodType) and callback.__func__ == Deferred.callback:
+                d = callback.__self__
+                gotResult_callbacks = d.callbacks[1]  # `d.callbacks[0]` is `_set_context_cb`
+                gotResult_callback, gotResult_errback = gotResult_callbacks
+                gotResult, _args, _kwargs = gotResult_callback
+                coroutine = gotResult.__closure__[1].cell_contents
+                import sys
+                sys.stdout.write(f"    callFromThread: resuming coroutine {coroutine}\n")
+                sys.stdout.write(f"        d.__next__ = {d.__next__.__qualname__}\n")
+            callback(*args, **kwargs)
         # it's not safe to call callLater() here, so we append the callback to a

I get:

Before #12872:

    callFromThread: resuming coroutine <coroutine object DatabasePool.runInteraction.<locals>._runInteraction at 0x7f4fab21edc0>
        d.__next__ = Deferred__next__Patch.patch.<locals>.Deferred___next__

After #12872:

    callFromThread: resuming coroutine <coroutine object DatabasePool.runInteraction.<locals>._runInteraction at 0x7f0f1b3c7240>
        d.__next__ = Deferred__next__Patch.patch.<locals>.Deferred___next__

More logs with extra debugging:
Before #12872:

request-6  --- disconnected ---
    callFromThread: resuming coroutine <coroutine object DatabasePool.runInteraction.<locals>._runInteraction at 0x7f4fab647d40>
        d.__next__ = Deferred.send
    Deferred.send() got called at /home/squah/repos/synapse/synapse/storage/database.py:929
request-7  --- make_request() ---
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/state/__init__.py:176
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/util/partial_state_events_tracker.py:70
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/state.py:868
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/state.py:608
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/database.py:929
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/database.py:834
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    callFromThread: resuming coroutine <coroutine object DatabasePool.runInteraction.<locals>._runInteraction at 0x7f4fab21edc0>
        d.__next__ = Deferred__next__Patch.patch.<locals>.Deferred___next__
    Deferred.send() got called at /home/squah/repos/synapse/synapse/storage/database.py:929
    Deferred.send() got called at /home/squah/repos/synapse/synapse/storage/database.py:834
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/http/server.py:884
request-7    preserve_fn.<locals>.g:370 (/home/squah/repos/synapse/synapse/http/server.py:370)
request-7      DirectServeJsonResource._send_response:439 (/home/squah/repos/synapse/synapse/http/server.py:439)
request-7        CallableMixin.__call__:1081 (/usr/lib/python3.8/unittest/mock.py:1081)
request-7          CallableMixin._mock_call:1085 (/usr/lib/python3.8/unittest/mock.py:1085)
request-7            CallableMixin._execute_mock_call:1155 (/usr/lib/python3.8/unittest/mock.py:1155)
request-7              respond_with_json:811 (/home/squah/repos/synapse/synapse/http/server.py:811)
request-7                run_in_background:814 (/home/squah/repos/synapse/synapse/logging/context.py:814)
request-7                  > *blocking await* at _async_write_json_to_request_in_thread:884 (/home/squah/repos/synapse/synapse/http/server.py:884)

After #12872:

request-6  --- disconnected ---
    callFromThread: resuming coroutine <coroutine object DatabasePool.runInteraction.<locals>._runInteraction at 0x7f0f1b3a9ec0>
        d.__next__ = Deferred.send
    Deferred.send() got called at /home/squah/repos/synapse/synapse/storage/database.py:929
request-7  --- make_request() ---
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/state/__init__.py:176
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/util/partial_state_events_tracker.py:71
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/controllers/state.py:340
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/controllers/state.py:80
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/database.py:929
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/database.py:834
    Deferred.send() got called at /home/squah/repos/synapse/tests/http/server/_base.py:379
    callFromThread: resuming coroutine <coroutine object DatabasePool.runInteraction.<locals>._runInteraction at 0x7f0f1b3c7240>
        d.__next__ = Deferred__next__Patch.patch.<locals>.Deferred___next__
    Deferred.__next__ patch got called at /home/squah/repos/synapse/synapse/storage/database.py:929
request-7  cb:425 (/home/squah/repos/synapse/tests/server.py:425)
request-7    DatabasePool _runInteraction:806 (/home/squah/repos/synapse/synapse/storage/database.py:806)
request-7      > *force-blocked await* at DatabasePool.runWithConnection:929 (/home/squah/repos/synapse/synapse/storage/database.py:929)

I genuinely cannot see a difference between the before/after except whether __next__ or send get called.

@squahtx squahtx marked this pull request as ready for review June 1, 2022 21:17
@squahtx
Copy link
Contributor Author

squahtx commented Jun 1, 2022

Anyway I'll get this merged after the long weekend.

@squahtx squahtx merged commit a10cc5f into develop Jun 7, 2022
@squahtx squahtx deleted the squah/test_cancellation_at_every_await branch June 7, 2022 16:14
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Aug 23, 2022
Synapse 1.62.0 (2022-07-05)
===========================

No significant changes since 1.62.0rc3.

Authors of spam-checker plugins should consult the [upgrade notes](https://github.com/matrix-org/synapse/blob/release-v1.62/docs/upgrade.md#upgrading-to-v1620) to learn about the enriched signatures for spam checker callbacks, which are supported with this release of Synapse.

Synapse 1.62.0rc3 (2022-07-04)
==============================

Bugfixes
--------

- Update the version of the [ldap3 plugin](https://github.com/matrix-org/matrix-synapse-ldap3/) included in the `matrixdotorg/synapse` DockerHub images and the Debian packages hosted on `packages.matrix.org` to 0.2.1. This fixes [a bug](matrix-org/matrix-synapse-ldap3#163) with usernames containing uppercase characters. ([\matrix-org#13156](matrix-org#13156))
- Fix a bug introduced in Synapse 1.62.0rc1 affecting unread counts for users on small servers. ([\matrix-org#13168](matrix-org#13168))

Synapse 1.62.0rc2 (2022-07-01)
==============================

Bugfixes
--------

- Fix unread counts for users on large servers. Introduced in v1.62.0rc1. ([\matrix-org#13140](matrix-org#13140))
- Fix DB performance when deleting old push notifications. Introduced in v1.62.0rc1. ([\matrix-org#13141](matrix-org#13141))

Synapse 1.62.0rc1 (2022-06-28)
==============================

Features
--------

- Port the spam-checker API callbacks to a new, richer API. This is part of an ongoing change to let spam-checker modules inform users of the reason their event or operation is rejected. ([\matrix-org#12857](matrix-org#12857), [\matrix-org#13047](matrix-org#13047))
- Allow server admins to customise the response of the `/.well-known/matrix/client` endpoint. ([\matrix-org#13035](matrix-org#13035))
- Add metrics measuring the CPU and DB time spent in state resolution. ([\matrix-org#13036](matrix-org#13036))
- Speed up fetching of device list changes in `/sync` and `/keys/changes`. ([\matrix-org#13045](matrix-org#13045), [\matrix-org#13098](matrix-org#13098))
- Improve URL previews for sites which only provide Twitter Card metadata, e.g. LWN.net. ([\matrix-org#13056](matrix-org#13056))

Bugfixes
--------

- Update [MSC3786](matrix-org/matrix-spec-proposals#3786) implementation to check `state_key`. ([\matrix-org#12939](matrix-org#12939))
- Fix a bug introduced in Synapse 1.58 where Synapse would not report full version information when installed from a git checkout. This is a best-effort affair and not guaranteed to be stable. ([\matrix-org#12973](matrix-org#12973))
- Fix a bug introduced in Synapse 1.60 where Synapse would fail to start if the `sqlite3` module was not available. ([\matrix-org#12979](matrix-org#12979))
- Fix a bug where non-standard information was required when requesting the `/hierarchy` API over federation. Introduced
  in Synapse v1.41.0. ([\matrix-org#12991](matrix-org#12991))
- Fix a long-standing bug which meant that rate limiting was not restrictive enough in some cases. ([\matrix-org#13018](matrix-org#13018))
- Fix a bug introduced in Synapse 1.58 where profile requests for a malformed user ID would ccause an internal error. Synapse now returns 400 Bad Request in this situation. ([\matrix-org#13041](matrix-org#13041))
- Fix some inconsistencies in the event authentication code. ([\matrix-org#13087](matrix-org#13087), [\matrix-org#13088](matrix-org#13088))
- Fix a long-standing bug where room directory requests would cause an internal server error if given a malformed room alias. ([\matrix-org#13106](matrix-org#13106))

Improved Documentation
----------------------

- Add documentation for how to configure Synapse with Workers using Docker Compose. Includes example worker config and docker-compose.yaml. Contributed by @Thumbscrew. ([\matrix-org#12737](matrix-org#12737))
- Ensure the [Poetry cheat sheet](https://matrix-org.github.io/synapse/develop/development/dependencies.html) is available in the online documentation. ([\matrix-org#13022](matrix-org#13022))
- Mention removed community/group worker endpoints in upgrade.md. Contributed by @olmari. ([\matrix-org#13023](matrix-org#13023))
- Add instructions for running Complement with `gotestfmt`-formatted output locally. ([\matrix-org#13073](matrix-org#13073))
- Update OpenTracing docs to reference the configuration manual rather than the configuration file. ([\matrix-org#13076](matrix-org#13076))
- Update information on downstream Debian packages. ([\matrix-org#13095](matrix-org#13095))
- Remove documentation for the Delete Group Admin API which no longer exists. ([\matrix-org#13112](matrix-org#13112))

Deprecations and Removals
-------------------------

- Remove the unspecced `DELETE /directory/list/room/{roomId}` endpoint, which hid rooms from the [public room directory](https://spec.matrix.org/v1.3/client-server-api/#listing-rooms). Instead, `PUT` to the same URL with a visibility of `"private"`. ([\matrix-org#13123](matrix-org#13123))

Internal Changes
----------------

- Add tests for cancellation of `GET /rooms/$room_id/members` and `GET /rooms/$room_id/state` requests. ([\matrix-org#12674](matrix-org#12674))
- Report login failures due to unknown third party identifiers in the same way as failures due to invalid passwords. This prevents an attacker from using the error response to determine if the identifier exists. Contributed by Daniel Aloni. ([\matrix-org#12738](matrix-org#12738))
- Merge the Complement testing Docker images into a single, multi-purpose image. ([\matrix-org#12881](matrix-org#12881), [\matrix-org#13075](matrix-org#13075))
- Simplify the database schema for `event_edges`. ([\matrix-org#12893](matrix-org#12893))
- Clean up the test code for client disconnection. ([\matrix-org#12929](matrix-org#12929))
- Remove code generating comments in configuration. ([\matrix-org#12941](matrix-org#12941))
- Add `Cross-Origin-Resource-Policy: cross-origin` header to content repository's thumbnail and download endpoints. ([\matrix-org#12944](matrix-org#12944))
- Replace noop background updates with `DELETE` delta. ([\matrix-org#12954](matrix-org#12954), [\matrix-org#13050](matrix-org#13050))
- Use lower isolation level when inserting read receipts to avoid serialization errors. Contributed by Nick @ Beeper. ([\matrix-org#12957](matrix-org#12957))
- Reduce the amount of state we pull from the DB. ([\matrix-org#12963](matrix-org#12963))
- Enable testing against PostgreSQL databases in Complement CI. ([\matrix-org#12965](matrix-org#12965), [\matrix-org#13034](matrix-org#13034))
- Fix an inaccurate comment. ([\matrix-org#12969](matrix-org#12969))
- Remove the `delete_device` method and always call `delete_devices`. ([\matrix-org#12970](matrix-org#12970))
- Use a GitHub form for issues rather than a hard-to-read, easy-to-ignore template. ([\matrix-org#12982](matrix-org#12982))
- Move [MSC3715](matrix-org/matrix-spec-proposals#3715) behind an experimental config flag. ([\matrix-org#12984](matrix-org#12984))
- Add type hints to tests. ([\matrix-org#12985](matrix-org#12985), [\matrix-org#13099](matrix-org#13099))
- Refactor macaroon tokens generation and move the unsubscribe link in notification emails to `/_synapse/client/unsubscribe`. ([\matrix-org#12986](matrix-org#12986))
- Fix documentation for running complement tests. ([\matrix-org#12990](matrix-org#12990))
- Faster joins: add issue links to the TODO comments in the code. ([\matrix-org#13004](matrix-org#13004))
- Reduce DB usage of `/sync` when a large number of unread messages have recently been sent in a room. ([\matrix-org#13005](matrix-org#13005), [\matrix-org#13096](matrix-org#13096), [\matrix-org#13118](matrix-org#13118))
- Replaced usage of PyJWT with methods from Authlib in `org.matrix.login.jwt`. Contributed by Hannes Lerchl. ([\matrix-org#13011](matrix-org#13011))
- Modernize the `contrib/graph/` scripts. ([\matrix-org#13013](matrix-org#13013))
- Remove redundant `room_version` parameters from event auth functions. ([\matrix-org#13017](matrix-org#13017))
- Decouple `synapse.api.auth_blocking.AuthBlocking` from `synapse.api.auth.Auth`. ([\matrix-org#13021](matrix-org#13021))
- Add type annotations to `synapse.storage.databases.main.devices`. ([\matrix-org#13025](matrix-org#13025))
- Set default `sync_response_cache_duration` to two minutes. ([\matrix-org#13042](matrix-org#13042))
- Rename CI test runs. ([\matrix-org#13046](matrix-org#13046))
- Increase timeout of complement CI test runs. ([\matrix-org#13048](matrix-org#13048))
- Refactor entry points so that they all have a `main` function. ([\matrix-org#13052](matrix-org#13052))
- Refactor the Dockerfile-workers configuration script to use Jinja2 templates in Synapse workers' Supervisord blocks. ([\matrix-org#13054](matrix-org#13054))
- Add headers to individual options in config documentation to allow for linking. ([\matrix-org#13055](matrix-org#13055))
- Make Complement CI logs easier to read. ([\matrix-org#13057](matrix-org#13057), [\matrix-org#13058](matrix-org#13058), [\matrix-org#13069](matrix-org#13069))
- Don't instantiate modules with keyword arguments. ([\matrix-org#13060](matrix-org#13060))
- Fix type checking errors against Twisted trunk. ([\matrix-org#13061](matrix-org#13061))
- Allow MSC3030 `timestamp_to_event` calls from anyone on world-readable rooms. ([\matrix-org#13062](matrix-org#13062))
- Add a CI job to check that schema deltas are in the correct folder. ([\matrix-org#13063](matrix-org#13063))
- Avoid rechecking event auth rules which are independent of room state. ([\matrix-org#13065](matrix-org#13065))
- Reduce the duplication of code that invokes the rate limiter. ([\matrix-org#13070](matrix-org#13070))
- Add a Subject Alternative Name to the certificate generated for Complement tests. ([\matrix-org#13071](matrix-org#13071))
- Add more tests for room upgrades. ([\matrix-org#13074](matrix-org#13074))
- Pin dependencies maintained by matrix.org to [semantic version](https://semver.org/) bounds. ([\matrix-org#13082](matrix-org#13082))
- Correctly report prometheus DB stats for `get_earliest_token_for_stats`. ([\matrix-org#13085](matrix-org#13085))
- Fix a long-standing bug where a finished logging context would be re-started when Synapse failed to persist an event from federation. ([\matrix-org#13089](matrix-org#13089))
- Simplify the alias deletion logic as an application service. ([\matrix-org#13093](matrix-org#13093))
- Add type annotations to `tests.test_server`. ([\matrix-org#13124](matrix-org#13124))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants