Fix initial database connection issue #348

WadeBarnes · 2022-03-01T17:38:50Z

If the issuer kit database (mongo) is unavailable for a period of time when the issuer kit api attempts to connect (https://github.com/bcgov/issuer-kit/blob/main/api/src/app.ts#L38) a connection error can be thrown by the MongoClient driver (https://github.com/bcgov/issuer-kit/blob/main/api/src/mongodb.ts), resulting it the following log message:

error: Unhandled Rejection at: Promise  {"name":"MongoServerSelectionError","reason":{"type":"Single","setName":null,"maxSetVersion":null,"maxElectionId":null,"servers":{},"stale":false,"compatible":true,"compatibilityError":null,"logicalSessionTimeoutMinutes":null,"heartbeatFrequencyMS":10000,"localThresholdMS":15,"commonWireVersion":null}}

At this stage the driver has failed to make and initial connection to the database and will never attempt to connect with the database again. This renders the api completely inoperable. This manifests as an inability to issue credentials to a wallet, the initial connections are made but the credential issuing flow appears to hang on the wallet and issuer-web side due to this error. In contrast when the api is able to make the initial connection the database can become unavailable and available again and the driver is able to reconnect. It is only the initial connection that appears critical.

This issue occurs in our OpenShift environments during rollouts and evacuations due to the fact the API startup process is much faster than the database's startup process. Therefore in the majority of cases the API has started while the database is unavailable.

There are a couple ways we can deal with this:

Have the api shutdown when it is unable to connect with the database (logging the reason for shutdown, the fact it could not connect). In addition the serverSelectionTimeoutMS parameter (which defaults to 30 seconds) could be made configurable so it can be adjusted to give the connection attempt a bit more time. In OpenShift this will cause a new pod to be started which will try to connect to the database.
Come up with a retry process. Where the api would attempt the initial connection a number of times before finally failing. This looks like it becomes complicated quickly and may have other side effects.

Therefore it looks like the most appropriate way to handle this issue is using the first option.

The text was updated successfully, but these errors were encountered:

WadeBarnes · 2022-03-15T14:31:11Z

@wadeking98, The changes in PR #351 don't seem to be working as expected. The initial server selection timeout does not appear to be configurable. I can set the SERVER_SELECTION_TIMEOUT environment variable to 300 seconds (5 minutes), and the initial database connection will still timeout after 30 seconds when the database is not available.

Error message is:

Database connection failed MongoServerSelectionError: connect ECONNREFUSED 10.98.165.109:27017
    at Timeout._onTimeout (/opt/app-root/src/node_modules/mongodb/lib/core/sdam/topology.js:438:30)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7) {
  reason: TopologyDescription {
    type: 'Single',
    setName: null,
    maxSetVersion: null,
    maxElectionId: null,
    servers: Map { 'issuer-kit-db:27017' => [ServerDescription] },
    stale: false,
    compatible: true,
    compatibilityError: null,
    logicalSessionTimeoutMinutes: null,
    heartbeatFrequencyMS: 10000,
    localThresholdMS: 15,
    commonWireVersion: null
  }
}
npm info lifecycle api@1.0.0~start: Failed to exec start script
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! api@1.0.0 start: `node lib/`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the api@1.0.0 start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
npm timing npm Completed in 34098ms

npm ERR! A complete log of this run can be found in:
npm ERR!     /opt/app-root/src/.npm/_logs/2022-03-15T14_29_04_436Z-debug.log

esune · 2022-03-15T16:23:12Z

@WadeBarnes a custom /health endpoint that returns failure and triggers a redeploy might help here. When throwing the db connection exception you could set an app variable that will drive what type of result is returned (success or failure) when calling it.

WadeBarnes · 2022-03-16T20:51:48Z

@wadeking98, @esune suggestion is a good one, it would resolve other issues, like the pod being made available before the database connection is actually established, and would provide k8s more control over the app in general. The current solution helps, but the pod will still appear to be running before a database connection is available.

On startup the /health endpoint would return a 503 error until the initial database connection is establish. Once established the /health endpoint would return a 200.

This signaling would replace the app shutdown on initial database connection failure.

An easy feature for you to add?

wadeking98 · 2022-03-16T21:56:28Z

@WadeBarnes did you still want this feature now that we've fixed the timeout issue?

WadeBarnes · 2022-03-16T23:26:43Z

Yes please

WadeBarnes assigned wadeking98 Mar 1, 2022

wadeking98 mentioned this issue Mar 14, 2022

added sleep env variable for db connect timeout #351

Merged

WadeBarnes closed this as completed in #351 Mar 15, 2022

WadeBarnes reopened this Mar 15, 2022

wadeking98 closed this as completed Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix initial database connection issue #348

Fix initial database connection issue #348

WadeBarnes commented Mar 1, 2022

WadeBarnes commented Mar 15, 2022

esune commented Mar 15, 2022

WadeBarnes commented Mar 16, 2022

wadeking98 commented Mar 16, 2022

WadeBarnes commented Mar 16, 2022

Fix initial database connection issue #348

Fix initial database connection issue #348

Comments

WadeBarnes commented Mar 1, 2022

WadeBarnes commented Mar 15, 2022

esune commented Mar 15, 2022

WadeBarnes commented Mar 16, 2022

wadeking98 commented Mar 16, 2022

WadeBarnes commented Mar 16, 2022