-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RN-635: Introduce database locking mechanism to prevent change handlers executing concurrently #4139
Conversation
@@ -69,6 +69,8 @@ export async function postChanges(req, res) { | |||
await ACTION_HANDLERS[action](models, translatedPayload); | |||
|
|||
if (action === SUBMIT_SURVEY_RESPONSE) { | |||
// TODO: Rework this functionality, as directly calling an analytics refresh here is both inefficient | |||
// and may create duplicate records in the analytics table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the explicit call to AnalyticsRefresher.refreshAnalytics()
here is dangerous, as it's not wrapped in an advisory lock.
We could consider ways to handle this (eg. move the locking down to the implemented change handler level, use a different mechanism to listen for analytics refreshes, etc.)
However, I sort of wanna call into question if we should be doing this at all...
I know why we did this, to avoid bugs in PSSS where a user would submit data and then wouldn't immediately see it. However, as our system grows in complexity, there are more and more processes that occur asynchronously after data is submitted before it's integrated into the system (analytics refresh, survey response outdate, etc.)
If we wanna support seemless data integration I feel a deeper rework is required to move to an events based system and have the caller listen to some kind of 'data_submission_success' event.
So putting it out there that we just wear this bug for now, if reviewer agrees then I'll put it to Erin 👍
@@ -132,13 +132,13 @@ export class SurveyResponseOutdater extends ChangeHandler { | |||
* This method processes the changed records and ensures that all responses in the affected | |||
* dimension combos get a correct `outdated` status | |||
*/ | |||
async handleChanges(changedResponses) { | |||
const surveysById = await this.fetchSurveysById(changedResponses); | |||
async handleChanges(transactionModels, changedResponses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this pattern of passing in transactionModels
and passing it all over the place is kinda awkward huh?
I don't love it, it really makes things look really bloated. But I also don't love some of the other options that come to mind:
- A
transactionModels
property on the class, whichChangeHandler
sets prior to callinghandleChanges()
and removes afterwards. I don't like that it's unclear whentransactionModels
is set from the child class's perspective. - Similar to above but replacing the
models
property with thetransactionModels
. I don't like that it hides the reality of this work being done in a transaction (it's important for the developer to know imo)
Overall I think I do like this method most... but still really don't love it 😅 Any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh haha same thing as my last comment! I was wondering about capturing everything that happens with a transacting models in a class, which I think works in most cases?
const start = Date.now(); | ||
await driver.runSql(`SELECT mv$refreshMaterializedView('analytics', 'public', true);`); | ||
const end = Date.now(); | ||
console.log(`Analytics refresh took: ${end - start}ms`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this as now we run the migrations via the central-server
so the AnalyticsRefresher
should be able to handle this 👍
@@ -35,13 +35,16 @@ import { | |||
HIERARCHY_WIND_AFTER_CANONICAL_TYPES_CHANGED, | |||
} from './EntityHierarchyCacher.fixtures'; | |||
|
|||
const TEST_DEBOUNCE_TIME = 50; // short debounce time so tests run more quickly | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you sooo much for these excellent tests @edmofro!! They caught so many bugs that I created when re-writing the EntityHierarchyCacher
. Truly would have been a nightmare without them!
} | ||
// eslint-disable-next-line no-bitwise | ||
return hash >>> 0; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah @tupaia/tsutils
... Originally I had imagined this package would be for typescript specific code (eg. type guards, useful generic types, etc.)
However, at this point it really seems like it's just the same as @tupaia/utils
, and honestly I'd much rather create new utils code in typescript than in javascript.
I've created a ticket to merge these two packages into a single @tupaia/utils
package which is written in typescript
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of small but dense algorithm is the sort of thing I like unit tests around. Not sure if we even care about the expected output, but perhaps we care that e.g. it's a somewhat sensible output, the same input always hashes to the same int, and we don't get conflicts between similar inputs? Totally not required if we trust the source of the algo :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehehe I had actually been toying with how to test this in the back of my mind for a while.
The properties we want the function to hold are:
a) It will convert any string into a 64bit integer
b) It will always convert the same string into the same integer
It's the kind of function that would really benefit from a concept I've explored before, property based testing.
It might give that a try, otherwise a simple 'generate random string, hash it a bunch of a times and ensure they're all equal' should do the trick? 😉
const REBUILD_DEBOUNCE_TIME = 1000; // wait 1 second after changes before rebuilding, to avoid double-up | ||
|
||
export class EntityHierarchyCacher { | ||
export class EntityHierarchyCacher extends ChangeHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't realised that EntityHierarchyCacher
wasn't a ChangeHandler
when I took on this work 😅
It was a bit of a tricky refactor, but the unit tests really saved my life. It's good to have consistency, and I feel things are a bit simpler now that it doesn't do any of the promise management stuff itself!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much simpler, I love it! Thanks for doing this. Also really like that the change handlers are no longer async, that was clever.
One thing that feels a little like clutter is passing transactingModels
so many places. I'm not sure if it would make things more or less clear, but did you think of trying a separate class that handles processing a rebuild, and gets constructed at that time with the transactingModels
so each function can still refer to this.transactingModels
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good call, I'll break it out and see how it looks 👍
b5b8deb
to
3f53587
Compare
- Rewrote ChangeHandlers to execute changes in a transaction - Now each change handler acquires a transactional advisory lock on the database while it executes - This prevents multiple services from executing the change handlers concurrently
3f53587
to
19dff97
Compare
…grations are run by the central-server - This means the AnalyticsRefresher should be able to ensure analytics are refreshed
19dff97
to
d80ce80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really tidy mechanism, thanks for doing the research to work it out!
/** | ||
* Acquires an advisory lock for the current transaction | ||
* Lock will be immediately released once the transaction ends | ||
* (https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comment!
@@ -47,10 +46,10 @@ export class MeditrakSyncQueue extends ChangeHandler { | |||
this.changeTranslators.survey = change => [change]; | |||
} | |||
|
|||
async refreshPermissionsBasedView() { | |||
async refreshPermissionsBasedView(transactionModels) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use transactingModels
elsewhere?
async refreshPermissionsBasedView(transactionModels) { | |
async refreshPermissionsBasedView(transactingModels) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do... I'll fix that up 👍 though I do find the term transactingModels
a tad confusing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's fair, it's a bit of a weird one. Shall we change them all to modelsInTransaction
?
const REBUILD_DEBOUNCE_TIME = 1000; // wait 1 second after changes before rebuilding, to avoid double-up | ||
|
||
export class EntityHierarchyCacher { | ||
export class EntityHierarchyCacher extends ChangeHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much simpler, I love it! Thanks for doing this. Also really like that the change handlers are no longer async, that was clever.
One thing that feels a little like clutter is passing transactingModels
so many places. I'm not sure if it would make things more or less clear, but did you think of trying a separate class that handles processing a rebuild, and gets constructed at that time with the transactingModels
so each function can still refer to this.transactingModels
?
@@ -132,13 +132,13 @@ export class SurveyResponseOutdater extends ChangeHandler { | |||
* This method processes the changed records and ensures that all responses in the affected | |||
* dimension combos get a correct `outdated` status | |||
*/ | |||
async handleChanges(changedResponses) { | |||
const surveysById = await this.fetchSurveysById(changedResponses); | |||
async handleChanges(transactionModels, changedResponses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh haha same thing as my last comment! I was wondering about capturing everything that happens with a transacting models in a class, which I think works in most cases?
} | ||
// eslint-disable-next-line no-bitwise | ||
return hash >>> 0; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of small but dense algorithm is the sort of thing I like unit tests around. Not sure if we even care about the expected output, but perhaps we care that e.g. it's a somewhat sensible output, the same input always hashes to the same int, and we don't get conflicts between similar inputs? Totally not required if we trust the source of the algo :-)
- EntityHierarchyCacher acts as the ChangeHandler - EntityHierarchySubtreeRebuilder does the actual subtree rebuild logic
- SurveyResponseOutdater acts as the ChangeHandler - OutdatedResponseFlagger flags the new outdated responses or a given survey
- MeditrakSyncQueue acts as the ChangeHandler - MeditrakSyncRecordUpdater ensures the sync queue records are updated when new changes come in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks!
@edmofro did a bit more exploring into this one today. So it turns out, in the case of the I'm still thinking of keeping the Advisory Lock stuff in place though, as it seems safer overall. Postgres's default transaction isolation level is READ COMMITTED. In theory it could be possible for two concurrent transactions to interfere with one another, if one writes data to a table and commits it and the other is halfway through and then re-reads from that table. |
Issue RN-635:
Ever since we switched to the blue/green deployment system, we've created a situation where it's quite common for multiple server instances to point to a single database. This can create issues when the database change handlers for each server execute concurrently for the same change. Issues are discussed further in the ticket.
To prevent this from occurring we need some kind of database level mechanism to limit the number of change handler executions to one at a time. Introducing Postgres Advisory Locks!
Advisory locks seems like the perfect tool for the job, as our change handlers may operate over multiple tables (which we don't want to lock on) and it's really just concurrent executions of the change handlers that we care about (we don't care to lock anything else out).
In order to get the advisory locking working I had to re-write our change handlers to execute in a transaction. It seemed like the only way to get the locking mechanism to work, as the knex connection pool means that you can't reliably unlock session level advisory locks. Also, running our change handlers in a transaction seems like a good move regardless... so I consider it a win-win 😄
Also made an effort to clean up instances where we have been explicitly calling
AnalyticsRefresher.refreshAnalytics()
where we should have been leaving it up to the change handler. See more in the PR comments.