-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abort HA Realization Logic After Timeout #800
base: main
Are you sure you want to change the base?
Conversation
46a9b12
to
a03246f
Compare
a03246f
to
c13752a
Compare
After going over this with @lippserd, I reworked this PR. The changes are, in a nutshell, that the potentially blocking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase and add a comment about COMMIT
possibly not respecting context deadlines in ExecTx()
and NamedExecTx()
.
c13752a
to
3e78b92
Compare
A key finding of Icinga/icingadb#800 was that committing a transaction does not necessarily have to respect the context of the transaction, depending on the database driver. As @lippserd suggested there, I have added notes to the documentation of all relevant database functions.
A key finding of Icinga/icingadb#800 was that committing a transaction does not necessarily have to respect the context of the transaction, depending on the database driver. As @lippserd suggested there, I have added notes to the documentation of all relevant database functions.
3e78b92
to
15f7be6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the conflict, everything looks fine to me now. Sorry for that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I approved earlier, I now have something that I had previously forgotten. It's not about the code itself, but about the commit(s). The changes here are subject to multiple commits:
- Changes in main
- Increasing the "heartbeat from future" log to warning
- Explicit rollback
- Inserting the environment inside the transaction
- Introduction and use of LastMessageTime() (plus the corresponding godoc paragraph of realize)
- Executing commit in a separate Go routine (plus its godoc paragraph of realize)
The main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen.
Timing issues may be the root of future failures. Thus, it is important to be aware if the timing seems to be out of sync.
Each transaction is created within the retryable function, but this function may be exited prematurely before committing. A deferred rollback ensures that the transaction will be rolled back and cleaned up in this case, or will be a noop when performed after the commit.
The HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context.
Since the retryable HA function may be executed a few times before succeeding, the inserted heartbeat value will be directly outdated. The heartbeat logic was slightly altered to always use the latest heartbeat time value.
A strange HA behavior was reported in #787, resulting in both instances being active. The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received. However, as it turns out, executing COMMIT on a database transaction is not bound to the transaction's context, allowing to survive longer. To mitigate this, another context watch was introduced. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database.
15f7be6
to
8b95d25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks a lot @oxzi!
Shouldn't this pull request close issue #787? |
This was the initial motivation, yes. However, as I was unable to reproduce the reported behavior, I cannot verify this. My intention was to make the observed bug impossible with this change, but maybe there is still another bug lurking around the corner. However, with a bit more trust in this change, I have linked this PR to the issue. Otherwise, we can reopen it. |
The main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen.
Timing issues may be the root of future failures. Thus, it is important to be aware if the timing seems to be out of sync.
Each transaction is created within the retryable function, but this function may be exited prematurely before committing. A deferred rollback ensures that the transaction will be rolled back and cleaned up in this case, or will be a noop when performed after the commit.
The HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context.
Since the retryable HA function may be executed a few times before succeeding, the inserted heartbeat value will be directly outdated. The heartbeat logic was slightly altered to always use the latest heartbeat time value.
A strange HA behavior was reported in Competing HA takeover results in both instances becoming active #787, resulting in both instances being active.
The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received.
However, as it turns out, executing COMMIT on a database transaction is not bound to the transaction's context, allowing to survive longer. To mitigate this, another context watch was introduced. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database.
Closes #787.