Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RN-635: Introduce database locking mechanism to prevent change handlers executing concurrently #4139

Merged
merged 11 commits into from
Sep 13, 2022

Conversation

rohan-bes
Copy link
Collaborator

Issue RN-635:

Ever since we switched to the blue/green deployment system, we've created a situation where it's quite common for multiple server instances to point to a single database. This can create issues when the database change handlers for each server execute concurrently for the same change. Issues are discussed further in the ticket.

To prevent this from occurring we need some kind of database level mechanism to limit the number of change handler executions to one at a time. Introducing Postgres Advisory Locks!

Advisory locks seems like the perfect tool for the job, as our change handlers may operate over multiple tables (which we don't want to lock on) and it's really just concurrent executions of the change handlers that we care about (we don't care to lock anything else out).

In order to get the advisory locking working I had to re-write our change handlers to execute in a transaction. It seemed like the only way to get the locking mechanism to work, as the knex connection pool means that you can't reliably unlock session level advisory locks. Also, running our change handlers in a transaction seems like a good move regardless... so I consider it a win-win 😄

Also made an effort to clean up instances where we have been explicitly calling AnalyticsRefresher.refreshAnalytics() where we should have been leaving it up to the change handler. See more in the PR comments.

@@ -69,6 +69,8 @@ export async function postChanges(req, res) {
await ACTION_HANDLERS[action](models, translatedPayload);

if (action === SUBMIT_SURVEY_RESPONSE) {
// TODO: Rework this functionality, as directly calling an analytics refresh here is both inefficient
// and may create duplicate records in the analytics table
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the explicit call to AnalyticsRefresher.refreshAnalytics() here is dangerous, as it's not wrapped in an advisory lock.

We could consider ways to handle this (eg. move the locking down to the implemented change handler level, use a different mechanism to listen for analytics refreshes, etc.)

However, I sort of wanna call into question if we should be doing this at all...
I know why we did this, to avoid bugs in PSSS where a user would submit data and then wouldn't immediately see it. However, as our system grows in complexity, there are more and more processes that occur asynchronously after data is submitted before it's integrated into the system (analytics refresh, survey response outdate, etc.)
If we wanna support seemless data integration I feel a deeper rework is required to move to an events based system and have the caller listen to some kind of 'data_submission_success' event.

So putting it out there that we just wear this bug for now, if reviewer agrees then I'll put it to Erin 👍

@@ -132,13 +132,13 @@ export class SurveyResponseOutdater extends ChangeHandler {
* This method processes the changed records and ensures that all responses in the affected
* dimension combos get a correct `outdated` status
*/
async handleChanges(changedResponses) {
const surveysById = await this.fetchSurveysById(changedResponses);
async handleChanges(transactionModels, changedResponses) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this pattern of passing in transactionModels and passing it all over the place is kinda awkward huh?

I don't love it, it really makes things look really bloated. But I also don't love some of the other options that come to mind:

  1. A transactionModels property on the class, which ChangeHandler sets prior to calling handleChanges() and removes afterwards. I don't like that it's unclear when transactionModels is set from the child class's perspective.
  2. Similar to above but replacing the models property with the transactionModels. I don't like that it hides the reality of this work being done in a transaction (it's important for the developer to know imo)

Overall I think I do like this method most... but still really don't love it 😅 Any other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh haha same thing as my last comment! I was wondering about capturing everything that happens with a transacting models in a class, which I think works in most cases?

const start = Date.now();
await driver.runSql(`SELECT mv$refreshMaterializedView('analytics', 'public', true);`);
const end = Date.now();
console.log(`Analytics refresh took: ${end - start}ms`);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this as now we run the migrations via the central-server so the AnalyticsRefresher should be able to handle this 👍

@@ -35,13 +35,16 @@ import {
HIERARCHY_WIND_AFTER_CANONICAL_TYPES_CHANGED,
} from './EntityHierarchyCacher.fixtures';

const TEST_DEBOUNCE_TIME = 50; // short debounce time so tests run more quickly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you sooo much for these excellent tests @edmofro!! They caught so many bugs that I created when re-writing the EntityHierarchyCacher. Truly would have been a nightmare without them!

}
// eslint-disable-next-line no-bitwise
return hash >>> 0;
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah @tupaia/tsutils... Originally I had imagined this package would be for typescript specific code (eg. type guards, useful generic types, etc.)

However, at this point it really seems like it's just the same as @tupaia/utils, and honestly I'd much rather create new utils code in typescript than in javascript.

I've created a ticket to merge these two packages into a single @tupaia/utils package which is written in typescript

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of small but dense algorithm is the sort of thing I like unit tests around. Not sure if we even care about the expected output, but perhaps we care that e.g. it's a somewhat sensible output, the same input always hashes to the same int, and we don't get conflicts between similar inputs? Totally not required if we trust the source of the algo :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehehe I had actually been toying with how to test this in the back of my mind for a while.

The properties we want the function to hold are:
a) It will convert any string into a 64bit integer
b) It will always convert the same string into the same integer

It's the kind of function that would really benefit from a concept I've explored before, property based testing.

It might give that a try, otherwise a simple 'generate random string, hash it a bunch of a times and ensure they're all equal' should do the trick? 😉

const REBUILD_DEBOUNCE_TIME = 1000; // wait 1 second after changes before rebuilding, to avoid double-up

export class EntityHierarchyCacher {
export class EntityHierarchyCacher extends ChangeHandler {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't realised that EntityHierarchyCacher wasn't a ChangeHandler when I took on this work 😅

It was a bit of a tricky refactor, but the unit tests really saved my life. It's good to have consistency, and I feel things are a bit simpler now that it doesn't do any of the promise management stuff itself!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much simpler, I love it! Thanks for doing this. Also really like that the change handlers are no longer async, that was clever.

One thing that feels a little like clutter is passing transactingModels so many places. I'm not sure if it would make things more or less clear, but did you think of trying a separate class that handles processing a rebuild, and gets constructed at that time with the transactingModels so each function can still refer to this.transactingModels?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good call, I'll break it out and see how it looks 👍

@rohan-bes rohan-bes force-pushed the rn-635-duplicates-in-derived-data branch from b5b8deb to 3f53587 Compare September 2, 2022 02:16
- Rewrote ChangeHandlers to execute changes in a transaction
- Now each change handler acquires a transactional advisory lock on the database while it executes
- This prevents multiple services from executing the change handlers concurrently
@rohan-bes rohan-bes force-pushed the rn-635-duplicates-in-derived-data branch from 3f53587 to 19dff97 Compare September 2, 2022 02:23
@rohan-bes rohan-bes force-pushed the rn-635-duplicates-in-derived-data branch from 19dff97 to d80ce80 Compare September 2, 2022 02:59
Copy link
Member

@edmofro edmofro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really tidy mechanism, thanks for doing the research to work it out!

/**
* Acquires an advisory lock for the current transaction
* Lock will be immediately released once the transaction ends
* (https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment!

@@ -47,10 +46,10 @@ export class MeditrakSyncQueue extends ChangeHandler {
this.changeTranslators.survey = change => [change];
}

async refreshPermissionsBasedView() {
async refreshPermissionsBasedView(transactionModels) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use transactingModels elsewhere?

Suggested change
async refreshPermissionsBasedView(transactionModels) {
async refreshPermissionsBasedView(transactingModels) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do... I'll fix that up 👍 though I do find the term transactingModels a tad confusing...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair, it's a bit of a weird one. Shall we change them all to modelsInTransaction?

const REBUILD_DEBOUNCE_TIME = 1000; // wait 1 second after changes before rebuilding, to avoid double-up

export class EntityHierarchyCacher {
export class EntityHierarchyCacher extends ChangeHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much simpler, I love it! Thanks for doing this. Also really like that the change handlers are no longer async, that was clever.

One thing that feels a little like clutter is passing transactingModels so many places. I'm not sure if it would make things more or less clear, but did you think of trying a separate class that handles processing a rebuild, and gets constructed at that time with the transactingModels so each function can still refer to this.transactingModels?

@@ -132,13 +132,13 @@ export class SurveyResponseOutdater extends ChangeHandler {
* This method processes the changed records and ensures that all responses in the affected
* dimension combos get a correct `outdated` status
*/
async handleChanges(changedResponses) {
const surveysById = await this.fetchSurveysById(changedResponses);
async handleChanges(transactionModels, changedResponses) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh haha same thing as my last comment! I was wondering about capturing everything that happens with a transacting models in a class, which I think works in most cases?

}
// eslint-disable-next-line no-bitwise
return hash >>> 0;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of small but dense algorithm is the sort of thing I like unit tests around. Not sure if we even care about the expected output, but perhaps we care that e.g. it's a somewhat sensible output, the same input always hashes to the same int, and we don't get conflicts between similar inputs? Totally not required if we trust the source of the algo :-)

- EntityHierarchyCacher acts as the ChangeHandler
- EntityHierarchySubtreeRebuilder does the actual subtree rebuild logic
- SurveyResponseOutdater acts as the ChangeHandler
- OutdatedResponseFlagger flags the new outdated responses or a given survey
- MeditrakSyncQueue acts as the ChangeHandler
- MeditrakSyncRecordUpdater ensures the sync queue records are updated when new changes come in
@rohan-bes rohan-bes requested a review from edmofro September 9, 2022 07:38
Copy link
Member

@edmofro edmofro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@rohan-bes
Copy link
Collaborator Author

@edmofro did a bit more exploring into this one today.

So it turns out, in the case of the EntityHierarchyCacher atleast, just wrapping the change handler behaviour in a transaction is sufficient to avoid the duplicate entries from appearing. I guess that it would come down to the details of how the logic works, but in this case the transaction keeps things consistent.

I'm still thinking of keeping the Advisory Lock stuff in place though, as it seems safer overall. Postgres's default transaction isolation level is READ COMMITTED. In theory it could be possible for two concurrent transactions to interfere with one another, if one writes data to a table and commits it and the other is halfway through and then re-reads from that table.

@rohan-bes rohan-bes merged commit 33afab4 into dev Sep 13, 2022
@rohan-bes rohan-bes deleted the rn-635-duplicates-in-derived-data branch September 13, 2022 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants