Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

monthly_active_users table gets a ton of locks #4318

Closed
turt2live opened this issue Dec 21, 2018 · 14 comments
Closed

monthly_active_users table gets a ton of locks #4318

turt2live opened this issue Dec 21, 2018 · 14 comments
Labels
z-bug (Deprecated Label)

Comments

@turt2live
Copy link
Member

 blocked_pid | blocked_user | blocking_pid | blocking_user |                 blocked_statement                 |                            current_statement_in_blocking_process
-------------+--------------+--------------+---------------+---------------------------------------------------+---------------------------------------------------------------------------------------------
       10476 | t2botsynapse |        10405 | t2botsynapse  | LOCK TABLE monthly_active_users in EXCLUSIVE MODE | LOCK TABLE monthly_active_users in EXCLUSIVE MODE
       10471 | t2botsynapse |        10405 | t2botsynapse  | LOCK TABLE monthly_active_users in EXCLUSIVE MODE | LOCK TABLE monthly_active_users in EXCLUSIVE MODE
       10473 | t2botsynapse |        10405 | t2botsynapse  | LOCK TABLE monthly_active_users in EXCLUSIVE MODE | LOCK TABLE monthly_active_users in EXCLUSIVE MODE
       10418 | t2botsynapse |        10405 | t2botsynapse  | LOCK TABLE monthly_active_users in EXCLUSIVE MODE | LOCK TABLE monthly_active_users in EXCLUSIVE MODE
       10441 | t2botsynapse |        10405 | t2botsynapse  | LOCK TABLE monthly_active_users in EXCLUSIVE MODE | LOCK TABLE monthly_active_users in EXCLUSIVE MODE
       10436 | t2botsynapse |        10405 | t2botsynapse  | LOCK TABLE monthly_active_users in EXCLUSIVE MODE | LOCK TABLE monthly_active_users in EXCLUSIVE MODE
       10481 | t2botsynapse |        10405 | t2botsynapse  | LOCK TABLE monthly_active_users in EXCLUSIVE MODE | LOCK TABLE monthly_active_users in EXCLUSIVE MODE
       10475 | t2botsynapse |        10405 | t2botsynapse  | LOCK TABLE monthly_active_users in EXCLUSIVE MODE | LOCK TABLE monthly_active_users in EXCLUSIVE MODE

~1100 records not shown

This is with:

track_appservice_user_ips: true
mau_stats_only: true
@turt2live
Copy link
Member Author

Metrics-wise, it does this:
image

(dip is where the upgrade to 0.34 happened)

Server is on python 2.7 at this time. Requests are failing due to the master process being effectively deadlocked waiting for the MAU transactions.

@turt2live
Copy link
Member Author

Previously there were limited locks on the table, and they were introduced in fixing #3854

@hawkowl
Copy link
Contributor

hawkowl commented Dec 21, 2018

@turt2live if you're feeling adventurous, could you try #4306 and see if that reduces the locking? It makes upserts atomic on Pg9.5+.

@turt2live
Copy link
Member Author

#4306 does indeed appear to fix the issue, and theoretically would also allow for lock=False to be added back in as the upsert should lock the row appropriately. Then again, it might be needed still if people are using old postgres/sqlite (then again, do we really care at that point? postgres would be ancient and sqlite is sqlite).

@turt2live
Copy link
Member Author

Turns out I was wrong about #4306 working. #4306 adds the native upsert on _simple_upsert but not _simple_upsert_txn (which is used by the MAU stuff). Also, the native upsert always returns true even if the update was an update (instead of an insert).

Out of fear of breaking the rest of synapse, I just modified the MAU stuff to use _simple_upsert_txn_native directly instead of fixing _simple_upsert_txn to use the native one too. Because the MAU stuff only invalidates some caches, it is relatively safe to let it always think that the user was inserted however this might not be okay for the rest of synapse.

Additionally, my postgres is still angry for some dumb reason:

2018-12-21 05:32:58.823 CET [12434] t2botsynapse@t2b_synapse ERROR:  duplicate key value violates unique constraint "monthly_active_users_users"
2018-12-21 05:32:58.823 CET [12434] t2botsynapse@t2b_synapse DETAIL:  Key (user_id)=(@redacted:t2bot.io) already exists.
2018-12-21 05:32:58.823 CET [12434] t2botsynapse@t2b_synapse STATEMENT:  INSERT INTO monthly_active_users (timestamp, user_id) VALUES (1545366778821, '@redacted:t2bot.io') ON CONFLICT (user_id) DO UPDATE SET timestamp=EXCLUDED.timestamp

It appears as though dropping the index and recreating it fixes it (I haven't seen it in the ~10 minutes synapse has been running), although it is questionable how there got to be such a broken index in the first place.

@turt2live
Copy link
Member Author

and yet another update:
Recreating the index does indeed fix the postgres problem, although the new error is:

2018-12-21 05:56:40.455 CET [12805] t2botsynapse@t2b_synapse ERROR:  could not serialize access due to concurrent update
2018-12-21 05:56:40.455 CET [12805] t2botsynapse@t2b_synapse STATEMENT:  INSERT INTO monthly_active_users (timestamp, user_id) VALUES (1545368200453, '@redacted:t2bot.io') ON CONFLICT (user_id) DO UPDATE SET timestamp=EXCLUDED.timestamp

@richvdh
Copy link
Member

richvdh commented Dec 21, 2018

I'm a bit confused about what introduced this change. @turt2live says

Previously there were limited locks on the table, and they were introduced in fixing #3854

but the fix to #3854 (#3961) landed in 0.33.6, so unless you've just upgraded from something earlier than that (?), it doesn't really explain why it suddenly started now.

@turt2live
Copy link
Member Author

I was upgrading from 0.34-rc2 to develop. I think what happened was a large queue of events got drained as soon as the server came back alive. Combined with the startup routines of synapse for mau, it deadlocked.

@neilisfragile
Copy link
Contributor

neilisfragile commented Dec 21, 2018

Okay so sounds like the strategy to fix this is to wait for #4306 to land then either modify justupsert_monthly_active_user_txn to call the native form, or to modify _simple_upsert_txn if the act of always returning true regardless of insert or update can be shown to be safe.

@neilisfragile neilisfragile added the z-bug (Deprecated Label) label Dec 21, 2018
@turt2live
Copy link
Member Author

@neilisfragile do you mean #4306 ?

@neilisfragile
Copy link
Contributor

Yes I do, updated, thanks.

@richvdh
Copy link
Member

richvdh commented Dec 27, 2018

I think what happened was a large queue of events got drained as soon as the server came back alive.

Ok so this isn't anything specific to 0.34?

@turt2live
Copy link
Member Author

on retrospect, probably not. It was just a convenient time for it to all fall apart. I haven't tested or reproduced the issue on earlier versions, however. I also haven't been able to replicate the exact conditions since upgrading.

@richvdh richvdh changed the title 0.34: monthly_active_users table gets a ton of locks monthly_active_users table gets a ton of locks Dec 27, 2018
@hawkowl
Copy link
Contributor

hawkowl commented Jan 9, 2019

I am going to close this in favour of the root cause of locks in this area, #4015 .

@hawkowl hawkowl closed this as completed Jan 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

4 participants