-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Conversation
@@ -1462,68 +1468,68 @@ async def have_seen_events( | |||
event_ids: events we are looking for | |||
|
|||
Returns: | |||
The set of events we have already seen. | |||
The remaining set of events we haven't seen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed from returning the events we saw to the events we didn't see.
In big rooms like #matrixhq
, we've seen most of the events already. Only a little piece of the state is new.
results: Set[str] = set() | ||
for chunk in batch_iter(event_ids, 500): | ||
r = await self._have_seen_events_dict( | ||
[(room_id, event_id) for event_id in chunk] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a bunch of avoiding creating new intermediary lists and using set
s for lookups.
# we deliberately do *not* query the database for room_id, to make the | ||
# query an index-only lookup on `events_event_id_key`. | ||
# | ||
# We therefore pull the events from the database into a set... | ||
|
||
sql = "SELECT event_id FROM events AS e WHERE " | ||
clause, args = make_in_list_sql_clause( | ||
txn.database_engine, "e.event_id", [eid for (_rid, eid) in remaining] | ||
txn.database_engine, "e.event_id", remaining_event_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any random Postgres SQL tricks to make looking up 1000 events faster?
Like this sort of stuff:
|
||
@cachedList(cached_method_name="have_seen_event", list_name="keys") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any max size of a @cachedList
? Seems like it grows forever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ought to be the same cache as have_seen_event
, which has max_entries=100000
(before multiplying by the global cache factor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some discussion in backend internal,
Seems like have_seen_events
currently uses the default of 1000
from the DeferredCache
.
And it's not one of the per_cache_factors
that we configure for matrix.org
so it's just some small multiple of that in any case with the global_factor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ought to be the same cache as have_seen_event, which has max_entries=100000 (before multiplying by the global cache factor).
@squahtx Ahhh, I think see what you're trying to say. _have_seen_events_dict()
re-uses the have_seen_event
cache which is defined at max_entries=100000
synapse/synapse/storage/databases/main/events_worker.py
Lines 1481 to 1482 in f9f0342
@cachedList(cached_method_name="have_seen_event", list_name="keys") | |
async def _have_seen_events_dict( |
synapse/synapse/storage/databases/main/events_worker.py
Lines 1523 to 1524 in f9f0342
@cached(max_entries=100000, tree=True) | |
async def have_seen_event(self, room_id: str, event_id: str) -> bool: |
In any case, it seems to have a 0% cache hit rate,
https://grafana.matrix.org/d/000000012/synapse?orgId=1&from=1660891259284&to=1660934459284&viewPanel=1
And the cache size doesn't get above 6k:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cache still seems so weird to me. Why doesn't it grow bigger? We should at least see 200k state events from Matrix HQ show up and completely fill up the cache sometimes. Or at least the cache should keep growing until full?
https://grafana.matrix.org/d/000000012/synapse?orgId=1&from=1664997921975&to=1666293921975&viewPanel=8
… two events Split off from #13561
… two events (#13586) Split off from #13561 Part of #13356 Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh)
) | ||
results.update(eid for ((_rid, eid), have_event) in r.items() if have_event) | ||
remaining_event_ids: Set[str] = set() | ||
for chunk in batch_iter(event_ids, 1000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the number 1000
specifically? If it's supposed to match with DeferredCache.max_entries, could they be made to refer to the same declaration somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just arbitrarily doubled to have less round-trip time to the database with the really big rooms. I don't know of a good heuristic to justify a better value. But I know we don't want the SQL queries to get too big which is why we're batching in the first place.
Good to know that we shouldn't exceed the max_entries
of the cache though 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know that we shouldn't exceed the max_entries of the cache though
Oh, that was a very tentative "if"; mentioned it as a guess because it was mentioned in the PR discussion.
If (again, if ;)) they are not supposed to coupled, perhaps consider moving it to a file-local constant with EVENT_QUEUE_BATCH_SIZE
or similar and move its definition with the existing ones?
# if the event cache contains the event, obviously we've seen it. | ||
event_cache_entry_map = self._get_events_from_local_cache(event_ids) | ||
event_ids_in_cache = event_cache_entry_map.keys() | ||
remaining_event_ids = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these enumerations a significant part of the runtime? Unless I'm missing something being lazy or whatnot here, this:
setdiff(events, subset(eventcache, events))
can be simplified to:
setdiff(events, eventcache)
It should only be necessary to perform one full enumeration of the eventids in a single pass. That'd require complementing or modifying _get_events_from_local_cache
, though. Unless you already checked that the JIT is smart enough about it (I think things have changed a lot there in Python since I actually knew this stuff), inlining that logic could also bring improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these enumerations a significant part of the runtime?
Maybe but it's practically nothing compared to @cachedList
being bad. I'm waiting for changes there to see how necessary these kind of optimizations are.
Fix #13625
Part of #13356
Mentioned in internal doc
Optimize
have_seen_events
because when backfilling#matrixhq
, 20s is just callinghave_seen_events
on the 200k state and auth events in the room. And in the future, hopefully a PR to avoid thehave_seen_events
altogether or do it in the background.have_seen_events
(157db.have_seen_events
) takes 6.62s to process 77k eventshave_seen_events
(246db.have_seen_events
) takes 13.19s to process 122k eventsWith 50k events, for a cold query, we're seeing a 3.7s -> 0.33s improvement. And the new version doesn't even have a cache yet and it still performs almost as good.
Dev notes
Todo
have_seen_events
to use the returned remaining events we didn't see (previously, it returned the events we saw)Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.(run the linters)