-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(db): add postgres triggers to cleanup expired rows on ttl-enabled schema tables #10389
Conversation
8525587
to
39d67f3
Compare
39d67f3
to
a4a2a0a
Compare
This is a clever approach. Are we fine with the random response time increase that we see for those insertions that perform a garbage collection? |
This programs the database in a way that is very hard to control when something goes wrong or needs tweaking. I must say that the cleverness is appealing especially since it offloads all the work to the database server! |
I think that if we have a mechanism for that, scheduling the garbage collection from th CP instead of randomly invoking it from insertions is better. The concern that I have is that when we'd piggyback the GC onto insertions, we'd see the additional cost as latency increases in our proxy path. I'm sure that we have other non-obvious contributing factors, but I'd generally want to keep clear of adding cost to the proxy path that'd be very difficult to analyze when looking at the latency profile. |
All this needs to be measured. Perhaps we don't have any problems with the solution I am proposing (it has parameter for propability and number of rows to be deleted). Perhaps having cluster of 50 traditional nodes OR 10 control planes that are scheduled to bombard database is a way worse. Perhaps adding cluster mutex lock there only makes it more complex and complicated (the cluster mutex is still an additional query to database and then you need to deal with possible situations when lock is not released). Inserting new access tokens and authorization codes is not that frequent. It usually means one per user per certain amount of time (e.g. session). If there are say 100-1000 logins per second (which is quite a bit already). This will only affect login of one or two depending on how we setup the variables. And login in particular is not that sensitive to latency. Usually login is very latency as you need to do all sorts of password hashing etc. that should be computation intensive, and yes take time - most probably much more than it takes postgres to delete some rows. Outside OAuth2 plugin, nobody has reported any problems. |
Data planes and DBless proxy nodes don't insert anything. CP inserts and there we don't have the latency issue that much (on CPs insertions are a low factor). The only problem we have is OAuth2 plugin in TRADITIONAL mode (where there is no official separation of CP and DP, easy fix is to make timer detect that is there ADMIN_LISTEN and only enable it nodes that have that). |
I've been thinking about this the whole day, and so far I still have some concerns here:
Do you think it is a good idea that we remove the |
aeef455
to
ace3f86
Compare
@hanshuebner and @hbagdi, thanks for great feedback from @windmgc I modified this PR so that now on each insert statement the trigger removes maximum of 2 rows. I also removed usage of RANDOM. The penalty of deleting maximum of 2 rows per insert statement should be miniscule. |
ace3f86
to
1da67f6
Compare
I have forever been told my DBA to not use triggers or use them minimally and be careful with them. I've never found a good reason when I push for why not.
I could be wrong but isn't this change about all tables that have TTLs? At least that was what the original PR (#10331) was about.
Write traffic via CP is generally bursty in nature and latency differences do matter. I do see your point that the latency matters less. Anyways, this change looks much better than it was before. |
I'd like to keep all of the original timer running in the background after consideration. It seems a bit hard to tell if the entity is a "custom" one, from Kong's perspective the "oauth2_tokens" is also a custom entity loaded by plugin. And it also seems not reliable to judge by And @bungle @hbagdi If you all agree with keeping the original timer code I'll go this way |
Yes, I approve this approach. Now that we don't need it that much for core anymore, perhaps also change that it runs less often. I do not know what would be a good value. Once per 5 mins? Once per hour? But I feel once per minute is too much. |
I don't have strong opinions either way. |
062bb4f
to
5facb25
Compare
I think increasing it to per 5min is enough, and it won't cause a huge behavior change. One thing to note that is from 3.X we replaced timer library to timer-ng which can also guarantee that, if a timer is created by |
0463e43
to
4cc86b8
Compare
4e14988
to
f2172c5
Compare
@bungle Could you please review it again? I've restored the original timer code and increase the interval to 300s. |
Can we call this change of frequency in the changelog, please? |
…on database server-side timestamp
f2172c5
to
f2ba9c2
Compare
@windmgc Please cherry pick this PR to EE. |
ping @windmgc |
1 similar comment
ping @windmgc |
As not yet cherry-picked, now covered by CE2EE master merge. |
@outsinre notice that this has now been cherrypicked to ee |
Summary
This is alternative to #10331. @windmgc please check this out. What do you think? If ok, please take over and do the rest of the changes (aka add migrations so that cleanup is in other entities too, similar to what I did for
oauth2_tokens
here).Fix
FTI-4746 and KAG-516