RN-635: Introduce database locking mechanism to prevent change handlers executing concurrently #4139

rohan-bes · 2022-09-02T01:43:57Z

Issue RN-635:

Ever since we switched to the blue/green deployment system, we've created a situation where it's quite common for multiple server instances to point to a single database. This can create issues when the database change handlers for each server execute concurrently for the same change. Issues are discussed further in the ticket.

To prevent this from occurring we need some kind of database level mechanism to limit the number of change handler executions to one at a time. Introducing Postgres Advisory Locks!

Advisory locks seems like the perfect tool for the job, as our change handlers may operate over multiple tables (which we don't want to lock on) and it's really just concurrent executions of the change handlers that we care about (we don't care to lock anything else out).

In order to get the advisory locking working I had to re-write our change handlers to execute in a transaction. It seemed like the only way to get the locking mechanism to work, as the knex connection pool means that you can't reliably unlock session level advisory locks. Also, running our change handlers in a transaction seems like a good move regardless... so I consider it a win-win 😄

Also made an effort to clean up instances where we have been explicitly calling AnalyticsRefresher.refreshAnalytics() where we should have been leaving it up to the change handler. See more in the PR comments.

rohan-bes · 2022-09-02T01:48:47Z

packages/central-server/src/apiV2/postChanges.js

@@ -69,6 +69,8 @@ export async function postChanges(req, res) {
    await ACTION_HANDLERS[action](models, translatedPayload);

    if (action === SUBMIT_SURVEY_RESPONSE) {
+      // TODO: Rework this functionality, as directly calling an analytics refresh here is both inefficient
+      // and may create duplicate records in the analytics table


So the explicit call to AnalyticsRefresher.refreshAnalytics() here is dangerous, as it's not wrapped in an advisory lock.

We could consider ways to handle this (eg. move the locking down to the implemented change handler level, use a different mechanism to listen for analytics refreshes, etc.)

However, I sort of wanna call into question if we should be doing this at all...
I know why we did this, to avoid bugs in PSSS where a user would submit data and then wouldn't immediately see it. However, as our system grows in complexity, there are more and more processes that occur asynchronously after data is submitted before it's integrated into the system (analytics refresh, survey response outdate, etc.)
If we wanna support seemless data integration I feel a deeper rework is required to move to an events based system and have the caller listen to some kind of 'data_submission_success' event.

So putting it out there that we just wear this bug for now, if reviewer agrees then I'll put it to Erin 👍

rohan-bes · 2022-09-02T01:54:46Z

packages/database/src/changeHandlers/SurveyResponseOutdater.js

@@ -132,13 +132,13 @@ export class SurveyResponseOutdater extends ChangeHandler {
   * This method processes the changed records and ensures that all responses in the affected
   * dimension combos get a correct `outdated` status
   */
-  async handleChanges(changedResponses) {
-    const surveysById = await this.fetchSurveysById(changedResponses);
+  async handleChanges(transactionModels, changedResponses) {


So this pattern of passing in transactionModels and passing it all over the place is kinda awkward huh?

I don't love it, it really makes things look really bloated. But I also don't love some of the other options that come to mind:

A transactionModels property on the class, which ChangeHandler sets prior to calling handleChanges() and removes afterwards. I don't like that it's unclear when transactionModels is set from the child class's perspective.

Similar to above but replacing the models property with the transactionModels. I don't like that it hides the reality of this work being done in a transaction (it's important for the developer to know imo)

Overall I think I do like this method most... but still really don't love it 😅 Any other ideas?

Oh haha same thing as my last comment! I was wondering about capturing everything that happens with a transacting models in a class, which I think works in most cases?

rohan-bes · 2022-09-02T01:55:26Z

packages/database/src/runPostMigration.js

-  const start = Date.now();
-  await driver.runSql(`SELECT mv$refreshMaterializedView('analytics', 'public', true);`);
-  const end = Date.now();
-  console.log(`Analytics refresh took: ${end - start}ms`);


Removed this as now we run the migrations via the central-server so the AnalyticsRefresher should be able to handle this 👍

rohan-bes · 2022-09-02T01:56:49Z

packages/database/src/tests/changeHandlers/EntityHierarchyCacher.test.js

@@ -35,13 +35,16 @@ import {
  HIERARCHY_WIND_AFTER_CANONICAL_TYPES_CHANGED,
 } from './EntityHierarchyCacher.fixtures';

+const TEST_DEBOUNCE_TIME = 50; // short debounce time so tests run more quickly
+


Thank you sooo much for these excellent tests @edmofro!! They caught so many bugs that I created when re-writing the EntityHierarchyCacher. Truly would have been a nightmare without them!

rohan-bes · 2022-09-02T02:01:05Z

packages/tsutils/src/hashStringToInt.ts

+  }
+  // eslint-disable-next-line no-bitwise
+  return hash >>> 0;
+};


Ah @tupaia/tsutils... Originally I had imagined this package would be for typescript specific code (eg. type guards, useful generic types, etc.)

However, at this point it really seems like it's just the same as @tupaia/utils, and honestly I'd much rather create new utils code in typescript than in javascript.

I've created a ticket to merge these two packages into a single @tupaia/utils package which is written in typescript

This kind of small but dense algorithm is the sort of thing I like unit tests around. Not sure if we even care about the expected output, but perhaps we care that e.g. it's a somewhat sensible output, the same input always hashes to the same int, and we don't get conflicts between similar inputs? Totally not required if we trust the source of the algo :-)

Hehehe I had actually been toying with how to test this in the back of my mind for a while.

The properties we want the function to hold are:
a) It will convert any string into a 64bit integer
b) It will always convert the same string into the same integer

It's the kind of function that would really benefit from a concept I've explored before, property based testing.

It might give that a try, otherwise a simple 'generate random string, hash it a bunch of a times and ensure they're all equal' should do the trick? 😉

rohan-bes · 2022-09-02T02:02:29Z

packages/database/src/changeHandlers/EntityHierarchyCacher.js

-const REBUILD_DEBOUNCE_TIME = 1000; // wait 1 second after changes before rebuilding, to avoid double-up
-
-export class EntityHierarchyCacher {
+export class EntityHierarchyCacher extends ChangeHandler {


I hadn't realised that EntityHierarchyCacher wasn't a ChangeHandler when I took on this work 😅

It was a bit of a tricky refactor, but the unit tests really saved my life. It's good to have consistency, and I feel things are a bit simpler now that it doesn't do any of the promise management stuff itself!

So much simpler, I love it! Thanks for doing this. Also really like that the change handlers are no longer async, that was clever.

One thing that feels a little like clutter is passing transactingModels so many places. I'm not sure if it would make things more or less clear, but did you think of trying a separate class that handles processing a rebuild, and gets constructed at that time with the transactingModels so each function can still refer to this.transactingModels?

Yeah that's a good call, I'll break it out and see how it looks 👍

- Rewrote ChangeHandlers to execute changes in a transaction - Now each change handler acquires a transactional advisory lock on the database while it executes - This prevents multiple services from executing the change handlers concurrently

…grations are run by the central-server - This means the AnalyticsRefresher should be able to ensure analytics are refreshed

…ing done

edmofro

This is a really tidy mechanism, thanks for doing the research to work it out!

edmofro · 2022-09-08T01:21:43Z

packages/database/src/TupaiaDatabase.js

+  /**
+   * Acquires an advisory lock for the current transaction
+   * Lock will be immediately released once the transaction ends
+   * (https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS)


Great comment!

edmofro · 2022-09-08T01:22:39Z

packages/central-server/src/database/MeditrakSyncQueue.js

@@ -47,10 +46,10 @@ export class MeditrakSyncQueue extends ChangeHandler {
    this.changeTranslators.survey = change => [change];
  }

-  async refreshPermissionsBasedView() {
+  async refreshPermissionsBasedView(transactionModels) {


We use transactingModels elsewhere?

Suggested change

async refreshPermissionsBasedView(transactionModels) {

async refreshPermissionsBasedView(transactingModels) {

We do... I'll fix that up 👍 though I do find the term transactingModels a tad confusing...

Yeah that's fair, it's a bit of a weird one. Shall we change them all to modelsInTransaction?

edmofro · 2022-09-08T01:27:20Z

packages/database/src/changeHandlers/EntityHierarchyCacher.js

-const REBUILD_DEBOUNCE_TIME = 1000; // wait 1 second after changes before rebuilding, to avoid double-up
-
-export class EntityHierarchyCacher {
+export class EntityHierarchyCacher extends ChangeHandler {


So much simpler, I love it! Thanks for doing this. Also really like that the change handlers are no longer async, that was clever.

One thing that feels a little like clutter is passing transactingModels so many places. I'm not sure if it would make things more or less clear, but did you think of trying a separate class that handles processing a rebuild, and gets constructed at that time with the transactingModels so each function can still refer to this.transactingModels?

edmofro · 2022-09-08T01:28:36Z

packages/database/src/changeHandlers/SurveyResponseOutdater.js

@@ -132,13 +132,13 @@ export class SurveyResponseOutdater extends ChangeHandler {
   * This method processes the changed records and ensures that all responses in the affected
   * dimension combos get a correct `outdated` status
   */
-  async handleChanges(changedResponses) {
-    const surveysById = await this.fetchSurveysById(changedResponses);
+  async handleChanges(transactionModels, changedResponses) {


Oh haha same thing as my last comment! I was wondering about capturing everything that happens with a transacting models in a class, which I think works in most cases?

edmofro · 2022-09-08T01:33:10Z

packages/tsutils/src/hashStringToInt.ts

+  }
+  // eslint-disable-next-line no-bitwise
+  return hash >>> 0;
+};


This kind of small but dense algorithm is the sort of thing I like unit tests around. Not sure if we even care about the expected output, but perhaps we care that e.g. it's a somewhat sensible output, the same input always hashes to the same int, and we don't get conflicts between similar inputs? Totally not required if we trust the source of the algo :-)

- EntityHierarchyCacher acts as the ChangeHandler - EntityHierarchySubtreeRebuilder does the actual subtree rebuild logic

- SurveyResponseOutdater acts as the ChangeHandler - OutdatedResponseFlagger flags the new outdated responses or a given survey

- MeditrakSyncQueue acts as the ChangeHandler - MeditrakSyncRecordUpdater ensures the sync queue records are updated when new changes come in

edmofro

lgtm, thanks!

rohan-bes · 2022-09-12T00:57:18Z

@edmofro did a bit more exploring into this one today.

So it turns out, in the case of the EntityHierarchyCacher atleast, just wrapping the change handler behaviour in a transaction is sufficient to avoid the duplicate entries from appearing. I guess that it would come down to the details of how the logic works, but in this case the transaction keeps things consistent.

I'm still thinking of keeping the Advisory Lock stuff in place though, as it seems safer overall. Postgres's default transaction isolation level is READ COMMITTED. In theory it could be possible for two concurrent transactions to interfere with one another, if one writes data to a table and commits it and the other is halfway through and then re-reads from that table.

rohan-bes commented Sep 2, 2022

View reviewed changes

rohan-bes force-pushed the rn-635-duplicates-in-derived-data branch from b5b8deb to 3f53587 Compare September 2, 2022 02:16

rohan-bes force-pushed the rn-635-duplicates-in-derived-data branch from 3f53587 to 19dff97 Compare September 2, 2022 02:23

rohan-bes added 3 commits September 2, 2022 12:58

RN-635: Re-wrote EntityHierarchyCacher to be a ChangeHandler

c649931

RN-635: Removed refreshing analytics from runPostMigrations as now mi…

3575344

…grations are run by the central-server - This means the AnalyticsRefresher should be able to ensure analytics are refreshed

RN-635: Added warnings about manual analytics refreshes where it's be…

d80ce80

…ing done

rohan-bes force-pushed the rn-635-duplicates-in-derived-data branch from 19dff97 to d80ce80 Compare September 2, 2022 02:59

edmofro requested changes Sep 8, 2022

View reviewed changes

rohan-bes added 5 commits September 9, 2022 15:30

RN-635: Broke EntityHierarchyCacher into separate classes

046ec6f

- EntityHierarchyCacher acts as the ChangeHandler - EntityHierarchySubtreeRebuilder does the actual subtree rebuild logic

RN-635: Broke SurveyResponseOutdater into separate classes

af0e426

- SurveyResponseOutdater acts as the ChangeHandler - OutdatedResponseFlagger flags the new outdated responses or a given survey

RN-635: Broke MeditrakSyncQueue into separate classes

4b03830

- MeditrakSyncQueue acts as the ChangeHandler - MeditrakSyncRecordUpdater ensures the sync queue records are updated when new changes come in

RN-635: Renamed transactionModels->transactingModels

95489bc

RN-635: Added unit tests for hashStringToInt

3238c83

rohan-bes requested a review from edmofro September 9, 2022 07:38

edmofro approved these changes Sep 9, 2022

View reviewed changes

RN-635: Fixed tests deadlocking

1d0260c

RN-635: Fixed 'createMeditrakSyncView.js'

22c97b9

rohan-bes merged commit 33afab4 into dev Sep 13, 2022

rohan-bes deleted the rn-635-duplicates-in-derived-data branch September 13, 2022 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RN-635: Introduce database locking mechanism to prevent change handlers executing concurrently #4139

RN-635: Introduce database locking mechanism to prevent change handlers executing concurrently #4139

rohan-bes commented Sep 2, 2022

rohan-bes Sep 2, 2022

rohan-bes Sep 2, 2022

edmofro Sep 8, 2022

rohan-bes Sep 2, 2022

rohan-bes Sep 2, 2022

rohan-bes Sep 2, 2022

edmofro Sep 8, 2022

rohan-bes Sep 9, 2022

rohan-bes Sep 2, 2022

edmofro Sep 8, 2022

rohan-bes Sep 9, 2022

edmofro left a comment

edmofro Sep 8, 2022

edmofro Sep 8, 2022

rohan-bes Sep 9, 2022

edmofro Sep 9, 2022

edmofro Sep 8, 2022

edmofro Sep 8, 2022

edmofro Sep 8, 2022

edmofro left a comment

rohan-bes commented Sep 12, 2022

	async refreshPermissionsBasedView(transactionModels) {
	async refreshPermissionsBasedView(transactingModels) {

RN-635: Introduce database locking mechanism to prevent change handlers executing concurrently #4139

RN-635: Introduce database locking mechanism to prevent change handlers executing concurrently #4139

Conversation

rohan-bes commented Sep 2, 2022

Issue RN-635:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edmofro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edmofro left a comment

Choose a reason for hiding this comment

rohan-bes commented Sep 12, 2022