Skip to content
This repository has been archived by the owner on Jul 12, 2024. It is now read-only.

Fix initial database connection issue #348

Closed
WadeBarnes opened this issue Mar 1, 2022 · 5 comments · Fixed by #351
Closed

Fix initial database connection issue #348

WadeBarnes opened this issue Mar 1, 2022 · 5 comments · Fixed by #351
Assignees

Comments

@WadeBarnes
Copy link
Member

If the issuer kit database (mongo) is unavailable for a period of time when the issuer kit api attempts to connect (https://github.com/bcgov/issuer-kit/blob/main/api/src/app.ts#L38) a connection error can be thrown by the MongoClient driver (https://github.com/bcgov/issuer-kit/blob/main/api/src/mongodb.ts), resulting it the following log message:

error: Unhandled Rejection at: Promise  {"name":"MongoServerSelectionError","reason":{"type":"Single","setName":null,"maxSetVersion":null,"maxElectionId":null,"servers":{},"stale":false,"compatible":true,"compatibilityError":null,"logicalSessionTimeoutMinutes":null,"heartbeatFrequencyMS":10000,"localThresholdMS":15,"commonWireVersion":null}}

At this stage the driver has failed to make and initial connection to the database and will never attempt to connect with the database again. This renders the api completely inoperable. This manifests as an inability to issue credentials to a wallet, the initial connections are made but the credential issuing flow appears to hang on the wallet and issuer-web side due to this error. In contrast when the api is able to make the initial connection the database can become unavailable and available again and the driver is able to reconnect. It is only the initial connection that appears critical.

This issue occurs in our OpenShift environments during rollouts and evacuations due to the fact the API startup process is much faster than the database's startup process. Therefore in the majority of cases the API has started while the database is unavailable.

There are a couple ways we can deal with this:

  1. Have the api shutdown when it is unable to connect with the database (logging the reason for shutdown, the fact it could not connect). In addition the serverSelectionTimeoutMS parameter (which defaults to 30 seconds) could be made configurable so it can be adjusted to give the connection attempt a bit more time. In OpenShift this will cause a new pod to be started which will try to connect to the database.
  2. Come up with a retry process. Where the api would attempt the initial connection a number of times before finally failing. This looks like it becomes complicated quickly and may have other side effects.

Therefore it looks like the most appropriate way to handle this issue is using the first option.

@WadeBarnes
Copy link
Member Author

@wadeking98, The changes in PR #351 don't seem to be working as expected. The initial server selection timeout does not appear to be configurable. I can set the SERVER_SELECTION_TIMEOUT environment variable to 300 seconds (5 minutes), and the initial database connection will still timeout after 30 seconds when the database is not available.

Error message is:

Database connection failed MongoServerSelectionError: connect ECONNREFUSED 10.98.165.109:27017
    at Timeout._onTimeout (/opt/app-root/src/node_modules/mongodb/lib/core/sdam/topology.js:438:30)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7) {
  reason: TopologyDescription {
    type: 'Single',
    setName: null,
    maxSetVersion: null,
    maxElectionId: null,
    servers: Map { 'issuer-kit-db:27017' => [ServerDescription] },
    stale: false,
    compatible: true,
    compatibilityError: null,
    logicalSessionTimeoutMinutes: null,
    heartbeatFrequencyMS: 10000,
    localThresholdMS: 15,
    commonWireVersion: null
  }
}
npm info lifecycle api@1.0.0~start: Failed to exec start script
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! api@1.0.0 start: `node lib/`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the api@1.0.0 start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
npm timing npm Completed in 34098ms

npm ERR! A complete log of this run can be found in:
npm ERR!     /opt/app-root/src/.npm/_logs/2022-03-15T14_29_04_436Z-debug.log

@esune
Copy link
Member

esune commented Mar 15, 2022

@WadeBarnes a custom /health endpoint that returns failure and triggers a redeploy might help here. When throwing the db connection exception you could set an app variable that will drive what type of result is returned (success or failure) when calling it.

@WadeBarnes
Copy link
Member Author

@wadeking98, @esune suggestion is a good one, it would resolve other issues, like the pod being made available before the database connection is actually established, and would provide k8s more control over the app in general. The current solution helps, but the pod will still appear to be running before a database connection is available.

On startup the /health endpoint would return a 503 error until the initial database connection is establish. Once established the /health endpoint would return a 200.

This signaling would replace the app shutdown on initial database connection failure.

An easy feature for you to add?

@wadeking98
Copy link
Contributor

@WadeBarnes did you still want this feature now that we've fixed the timeout issue?

@WadeBarnes
Copy link
Member Author

Yes please

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants