Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort HA Realization Logic After Timeout #800

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

oxzi
Copy link
Member

@oxzi oxzi commented Sep 5, 2024

  • icingadb: Unify select cases for derived contexts
    The main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen.
  • HA: Increase log level for heartbeats from the future
    Timing issues may be the root of future failures. Thus, it is important to be aware if the timing seems to be out of sync.
  • HA: Deferred SQL Transaction Rollback
    Each transaction is created within the retryable function, but this function may be exited prematurely before committing. A deferred rollback ensures that the transaction will be rolled back and cleaned up in this case, or will be a noop when performed after the commit.
  • HA: Insert environment within retryable function
    The HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context.
  • HA/Heartbeat: Use last message's timestamp
    Since the retryable HA function may be executed a few times before succeeding, the inserted heartbeat value will be directly outdated. The heartbeat logic was slightly altered to always use the latest heartbeat time value.
  • HA: Abort Transaction Commit after Timeout
    A strange HA behavior was reported in Competing HA takeover results in both instances becoming active #787, resulting in both instances being active.
    The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received.
    However, as it turns out, executing COMMIT on a database transaction is not bound to the transaction's context, allowing to survive longer. To mitigate this, another context watch was introduced. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database.

Closes #787.

@oxzi oxzi added the bug Something isn't working label Sep 5, 2024
@oxzi oxzi added this to the 1.3.0 milestone Sep 5, 2024
@cla-bot cla-bot bot added the cla/signed label Sep 5, 2024
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
@oxzi
Copy link
Member Author

oxzi commented Sep 23, 2024

After going over this with @lippserd, I reworked this PR. The changes are, in a nutshell, that the potentially blocking COMMIT query will be guarded with a select watching at the context and that always the latest heartbeat timestamp will be used, as otherwise an already heartbeat will be retried. The description in the initial post was updated accordingly.

@oxzi oxzi marked this pull request as ready for review September 23, 2024 13:19
@oxzi oxzi requested a review from lippserd September 23, 2024 13:20
Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase and add a comment about COMMIT possibly not respecting context deadlines in ExecTx() and NamedExecTx().

pkg/icingadb/ha.go Outdated Show resolved Hide resolved
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
@oxzi oxzi requested a review from lippserd September 24, 2024 12:00
oxzi added a commit to Icinga/icinga-go-library that referenced this pull request Sep 24, 2024
A key finding of Icinga/icingadb#800 was that committing a transaction
does not necessarily have to respect the context of the transaction,
depending on the database driver. As @lippserd suggested there, I have
added notes to the documentation of all relevant database functions.
lippserd
lippserd previously approved these changes Sep 30, 2024
oxzi added a commit to Icinga/icinga-go-library that referenced this pull request Oct 8, 2024
A key finding of Icinga/icingadb#800 was that committing a transaction
does not necessarily have to respect the context of the transaction,
depending on the database driver. As @lippserd suggested there, I have
added notes to the documentation of all relevant database functions.
pkg/icingaredis/heartbeat.go Outdated Show resolved Hide resolved
pkg/icingaredis/heartbeat.go Outdated Show resolved Hide resolved
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the conflict, everything looks fine to me now. Sorry for that :)

Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I approved earlier, I now have something that I had previously forgotten. It's not about the code itself, but about the commit(s). The changes here are subject to multiple commits:

  • Changes in main
  • Increasing the "heartbeat from future" log to warning
  • Explicit rollback
  • Inserting the environment inside the transaction
  • Introduction and use of LastMessageTime() (plus the corresponding godoc paragraph of realize)
  • Executing commit in a separate Go routine (plus its godoc paragraph of realize)

The main loop select cases for hactx.Done() and ctx.Done() were unified,
as hactx is a derived ctx. A closed ctx case may be lost as the hactx
case could have been chosen.
Timing issues may be the root of future failures. Thus, it is important
to be aware if the timing seems to be out of sync.
Each transaction is created within the retryable function, but this
function may be exited prematurely before committing. A deferred
rollback ensures that the transaction will be rolled back and cleaned up
in this case, or will be a noop when performed after the commit.
The HA.insertEnvironment() method was inlined into the retryable
function to use the deadlined context. Otherwise, this might block
afterwards, as it was used within HA.realize(), but without the passed
context.
Since the retryable HA function may be executed a few times before
succeeding, the inserted heartbeat value will be directly outdated. The
heartbeat logic was slightly altered to always use the latest heartbeat
time value.
A strange HA behavior was reported in #787, resulting in both instances
being active.

The logs contained an entry of the previous active instance exiting the
HA.realize() method successfully after 1m9s. This, however, should not
be possible as the method's context is deadlined to a minute after the
heartbeat was received.

However, as it turns out, executing COMMIT on a database transaction is
not bound to the transaction's context, allowing to survive longer. To
mitigate this, another context watch was introduced. Doing so allows
directly handing over, while the other instance can now take over due to
the expired heartbeat in the database.
Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks a lot @oxzi!

@yhabteab
Copy link
Member

Shouldn't this pull request close issue #787?

@oxzi oxzi linked an issue Oct 25, 2024 that may be closed by this pull request
@oxzi
Copy link
Member Author

oxzi commented Oct 25, 2024

Shouldn't this pull request close issue #787?

This was the initial motivation, yes. However, as I was unable to reproduce the reported behavior, I cannot verify this. My intention was to make the observed bug impossible with this change, but maybe there is still another bug lurking around the corner.

However, with a bit more trust in this change, I have linked this PR to the issue. Otherwise, we can reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla/signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Competing HA takeover results in both instances becoming active
4 participants