Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev: sbd: Adjust timeout related values #890

Merged

Conversation

liangxin1300
Copy link
Collaborator

@liangxin1300 liangxin1300 commented Nov 23, 2021

Changes

  • Consolidate sbd timeout related methods/constants/formulas into class SBDTimeout

  • Adjust stonith-timeout value, formulas are:

  stonith-timeout >= 1.2 * (pcmk_delay_max + msgwait)  # for disk-based sbd
  stonith-timeout >= 1.2 * max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)  # for disk-less sbd
  stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus)  # for ALL situations
  • Adjust SBD_DELAY_START value, formulas are:
  SBD_DELAY_START = no # for non virtualization environment, which is the system default
  SBD_DELAY_START >= (token + consensus + pcmk_delay_max + msgwait)  # for disk-based sbd
  SBD_DELAY_START >= (token + consensus + 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd
  • Adjust pcmk_delay_max:
  pcmk_delay_max=30 # only for the single stonith device in the 2-node cluster without qdevice
  pcmk_delay_max deletion # only for the single stonith device, not in the 2-node cluster without qdevice
  • Dev: behave: Add functional test for previous changes and cases

@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch 3 times, most recently from 7f08657 to 3e221bc Compare November 24, 2021 02:20
@zzhou1
Copy link
Contributor

zzhou1 commented Nov 24, 2021

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations".
eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait)
eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

Also enforce ">=" to leave the room for any manual change outside of crmsh. That says, don't use "=" to set the value directly, that likely override some intentional configuration by sysadmin.

crmsh/bootstrap.py Outdated Show resolved Hide resolved
crmsh/bootstrap.py Outdated Show resolved Hide resolved
crmsh/corosync.py Outdated Show resolved Hide resolved
crmsh/corosync.py Show resolved Hide resolved
crmsh/sbd.py Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch 2 times, most recently from 57f1456 to e16e31f Compare November 25, 2021 03:11
Copy link
Contributor

@zzhou1 zzhou1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major change is needed for the conversation around get_sbd_delay_start()

crmsh/corosync.py Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Outdated Show resolved Hide resolved
@zzhou1
Copy link
Contributor

zzhou1 commented Nov 25, 2021

crm cluster remove tw3 works nicely so far.

Well, I expect the following alternative will do the same job.
crm node delete tw3

@liangxin1300
Copy link
Collaborator Author

liangxin1300 commented Nov 25, 2021

crm cluster remove tw3 works nicely so far.

Well, I expect the following alternative will do the same job. crm node delete tw3

Do you mean crm cluster remove <node> and crm node delete share the same code functions and execute the the process?
Then we need another PR to do this

crmsh/sbd.py Outdated Show resolved Hide resolved
crmsh/sbd.py Show resolved Hide resolved
@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch 2 times, most recently from 116f7ab to 37addb7 Compare November 26, 2021 02:07
Copy link
Contributor

@zzhou1 zzhou1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes are expected to resolve conversations.

@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 37addb7 to 619fcfc Compare November 26, 2021 04:17
@zzhou1
Copy link
Contributor

zzhou1 commented Nov 26, 2021

crm cluster remove tw3 works nicely so far.
Well, I expect the following alternative will do the same job. crm node delete tw3

Do you mean crm cluster remove <node> and crm node delete share the same code functions and execute the the process? Then we need another PR to do this

If the code can be reused directly, that will be great. From the behavior wise, crm node delete <node> should adjust_pcmk_delay_max here. Maybe, most of other behavior should be the same too. Anyway, worth an issue to track this.

@gao-yan
Copy link
Member

gao-yan commented Nov 29, 2021

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...

Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

@wenningerk
Copy link

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...

Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

That corresponds to the final conclusion Jan Friesse and me came to.
And is the reason why I finally didn't implement that test in sbd itself - as done with sync-timeout.
https://bugzilla.redhat.com/show_bug.cgi?id=1877434 was closed a while ago.
Let me copy here a statement from the private part of that bug.

After reconsideration it looks as if there are no relevant cases that would
require a relationship token timeout and sbd watchdog timeout (and from
that derived stonith-watchdog-timeout).

All fencing actions are determined from the current CIB state. Due to
corosync's virtual synchrony, there should be no situations where the
target node is in the membership, and the DC sees a CIB change that
requires fencing but the target node does not also see that change.
Therefore, the DC and sbd on the target node should always be in sync
as far as whether fencing is required, and it should happen within the
sbd timeout.

Sry for not interfering here earlier.

@gao-yan
Copy link
Member

gao-yan commented Nov 29, 2021

That corresponds to the final conclusion Jan Friesse and me came to. And is the reason why I finally didn't implement that test in sbd itself - as done with sync-timeout. https://bugzilla.redhat.com/show_bug.cgi?id=1877434 was closed a while ago. Let me copy here a statement from the private part of that bug.

After reconsideration it looks as if there are no relevant cases that would
require a relationship token timeout and sbd watchdog timeout (and from
that derived stonith-watchdog-timeout).

All fencing actions are determined from the current CIB state. Due to
corosync's virtual synchrony, there should be no situations where the
target node is in the membership, and the DC sees a CIB change that
requires fencing but the target node does not also see that change.
Therefore, the DC and sbd on the target node should always be in sync
as far as whether fencing is required, and it should happen within the
sbd timeout.

Thanks for the information, @wenningerk!

IIUC it means, for the case of diskless sbd, the only requirement in regard of SBD_WATCHDOG_TIMEOUT remains -- A value of SBD_WATCHDOG_TIMEOUT is safe as long as it's longer than qdevice sync timeout if qdevice is used, right?

@wenningerk
Copy link

Thanks for the information, @wenningerk!

IIUC it means, for the case of diskless sbd, the only requirement in regard of SBD_WATCHDOG_TIMEOUT remains -- A value of SBD_WATCHDOG_TIMEOUT is safe as long as it's longer than qdevice sync timeout if qdevice is used, right?

yes, as of my current understanding at least ;-)

@gao-yan
Copy link
Member

gao-yan commented Nov 29, 2021

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...

Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

@liangxin1300 @zzhou1 What's confusing me here is, you seem to be talking about determination of stonith-timeout for non-sbd/generic cases? But token + consensus is not relevant even for the cases of sbd with or without shared disk.

I mean, yes, token + consensus should be added into SBD_DELAY_START, but I don't see them as a factor of the generic stonith-timeout.

@zzhou1
Copy link
Contributor

zzhou1 commented Nov 30, 2021

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...
Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

@liangxin1300 @zzhou1 What's confusing me here is, you seem to be talking about determination of stonith-timeout for non-sbd/generic cases? But token + consensus is not relevant even for the cases of sbd with or without shared disk.

I mean, yes, token + consensus should be added into SBD_DELAY_START, but I don't see them as a factor of the generic stonith-timeout.

@gao-yan I treat is a generic situation. I observed this in my lab and referred to the detailed discussion here.
https://bugzilla.redhat.com/show_bug.cgi?id=1941108

Well, this triggers me to correlate to "Pacemaker Explain" document: "It(stonith-timeout) has been replaced by the pcmk_reboot_timeout and pcmk_off_timeout properties. " I might over look stonith-timeout here. Maybe, stonith-timeout is not used without SBD at all? Thanks to clarify on this.

@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 0544ee9 to 487cd4f Compare November 30, 2021 08:56
@gao-yan
Copy link
Member

gao-yan commented Nov 30, 2021

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...
Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

@liangxin1300 @zzhou1 What's confusing me here is, you seem to be talking about determination of stonith-timeout for non-sbd/generic cases? But token + consensus is not relevant even for the cases of sbd with or without shared disk.
I mean, yes, token + consensus should be added into SBD_DELAY_START, but I don't see them as a factor of the generic stonith-timeout.

@gao-yan I treat is a generic situation. I observed this in my lab and referred to the detailed discussion here. https://bugzilla.redhat.com/show_bug.cgi?id=1941108

Ah, now I recall it's about the case where the fencing target is still technically in the membership when fencing action returns, and how to prevent the unnecessary second fencing...

If to address this case, token + consensus needs to be generally added into stonith-timeout.

Well, this triggers me to correlate to "Pacemaker Explain" document: "It(stonith-timeout) has been replaced by the pcmk_reboot_timeout and pcmk_off_timeout properties. " I might over look stonith-timeout here. Maybe, stonith-timeout is not used without SBD at all? Thanks to clarify on this.

I think this is about the parameters of fencing resource rather than the global stonith-timeout.

And of course any such parameter configured for specific fencing resource takes precedence over the global one.

@zzhou1
Copy link
Contributor

zzhou1 commented Dec 1, 2021

Well, this triggers me to correlate to "Pacemaker Explain" document: "It(stonith-timeout) has been replaced by the pcmk_reboot_timeout and pcmk_off_timeout properties. " I might over look stonith-timeout here. Maybe, stonith-timeout is not used without SBD at all? Thanks to clarify on this.

I think this is about the parameters of fencing resource rather than the global stonith-timeout.

And of course any such parameter configured for specific fencing resource takes precedence over the global one.

Super! This does clear my mind so far.

However, one more confusing text from man pacemaker-schedulerd says "This value(stonith-timeout) is not used by Pacemaker, but is kept for backward " ???

Copy link
Member

@gao-yan gao-yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good work, @liangxin1300 !

crmsh/constants.py Show resolved Hide resolved
SBDManager.is_delay_start():
pacemaker_start_msg += "(waiting for sbd {}s)".format(SBDManager.get_suitable_sbd_systemd_timeout())
SBDTimeout.is_sbd_delay_start():
pacemaker_start_msg += "(waiting for sbd {}s)".format(SBDTimeout.get_sbd_delay_start_sec_from_sysconfig())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message would probably be better to indicate it's about delay start? For example "delaying start of sbd for {}s".

Besides, have we lost the logic for disk-based sbd, in case SBD_DELAY_START was enabled with a boolean true, msgwait would be retrieved from disk metadata for this?

if not res or int(res) < SBDTimeout.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE:
sbd_watchdog_timeout_qdevice = SBDTimeout.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE
SBDManager.update_configuration({"SBD_WATCHDOG_TIMEOUT": str(sbd_watchdog_timeout_qdevice)})
utils.set_property(stonith_timeout=int(1.2*2*sbd_watchdog_timeout_qdevice))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like something could be put into get_stonith_timeout_runtime() and commonly formulated there ...

crmsh/sbd.py Outdated
self.disk_based = False
self.sbd_watchdog_timeout = SBDTimeout.get_sbd_watchdog_timeout()
self.stonith_watchdog_timeout = SBDTimeout.get_stonith_watchdog_timeout()
self.sbd_delay_start_value_runtime = self.get_sbd_delay_start_runtime() if utils.detect_virt() else "no"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been kind of confused by the usages of the term "runtime" :-)

It's alright, but would it be clearer with something like "expected" rather than "runtime"?

current_num = len(list_cluster_nodes())
remove_num = 1 if removing else 0
qdevice_num = 1 if is_qdevice_configured() else 0
return (current_num - remove_num + qdevice_num) == 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could an 1-node cluster with qdevice fall into the case this way? Would it be more sensible to check current_num - remove_num and qdevice_num separately?

Set cluster property if calculated value is larger then current cib value
"""
_value = get_property(property_name)
value_from_cib = int(_value.strip('s')) if _value else 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the property is defaulted, should the default value for example STONITH_TIMEOUT_DEFAULT be passed into the function to be referenced rather than 0 for the case of stonith-timeout?

Check if SBD_DELAY_START is not no or not set
"""
res = SBDManager.get_sbd_value_from_config("SBD_DELAY_START")
return res and res != "no"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically only explicit true or a value means being enabled:

https://github.com/ClusterLabs/sbd/blob/master/src/sbd-inquisitor.c#L991-L998

Similarly, is_boolean_true() should be used.

crmsh/sbd.py Outdated
"""
# TODO 5ms, 5us, 5s, 5m, 5h are also valid for sbd sysconfig
value = SBDManager.get_sbd_value_from_config("SBD_DELAY_START")
if value in ["yes", "1"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we used to use is_boolean_true() for this?

Adjust start timeout for sbd when set SBD_DELAY_START
"""
sbd_delay_start_value = SBDManager.get_sbd_value_from_config("SBD_DELAY_START")
if sbd_delay_start_value == "no":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use is_sbd_delay_start() function here?

crmsh/sbd.py Outdated
@@ -26,14 +275,11 @@ class SBDManager(object):
specify here will be destroyed.
"""
SBD_WARNING = "Not configuring SBD - STONITH will be disabled."
DISKLESS_SBD_WARNING = """Diskless SBD requires cluster with three or more nodes.
If you want to use diskless SBD for two-nodes cluster, should be combined with QDevice."""
DISKLESS_SBD_WARNING = "Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for two-nodes cluster, should be combined with QDevice."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: two-node or 2-node rather than two-nodes :-)

@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 487cd4f to 4bf6742 Compare December 2, 2021 07:02
@liangxin1300
Copy link
Collaborator Author

@gao-yan @zzhou1 Thanks for your nice suggestions!

Basically token + consensus needs to be generally added into the value for that.

I changed get_stonith_timeout_runtime as get_stonith_timeout_expected, and compare the value from sbd with max(STONITH_TIMEOUT_DEFAULT, token+consensus), return the larger one

    def get_stonith_timeout_expected(self):
        """
        Get stonith-timeout value for sbd cases, formulas are:

        stonith-timeout >= 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd
        stonith-timeout >= 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd

        And above value will compare with max(STONITH_TIMEOUT_DEFAULT, token+consensus),
        then return the larger one
        """
        if self.disk_based:
            value_from_sbd = int(1.2*(self.pcmk_delay_max + self.msgwait))
        else:
            value_from_sbd = int(1.2*max(self.stonith_watchdog_timeout, 2*self.sbd_watchdog_timeout))

        value = max(value_from_sbd, bootstrap.get_stonith_timeout_generally_expected())
        logger.debug("Result of SBDTimeout.get_stonith_timeout_expected %d", value)
        return value

@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 4bf6742 to d8fc1f7 Compare December 2, 2021 07:27
@gao-yan
Copy link
Member

gao-yan commented Dec 2, 2021

@gao-yan @zzhou1 Thanks for your nice suggestions!

Basically token + consensus needs to be generally added into the value for that.

I changed get_stonith_timeout_runtime as get_stonith_timeout_expected, and compare the value from sbd with max(STONITH_TIMEOUT_DEFAULT, token+consensus), return the larger one

    def get_stonith_timeout_expected(self):
        """
        Get stonith-timeout value for sbd cases, formulas are:

        stonith-timeout >= 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd
        stonith-timeout >= 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd

        And above value will compare with max(STONITH_TIMEOUT_DEFAULT, token+consensus),
        then return the larger one
        """
        if self.disk_based:
            value_from_sbd = int(1.2*(self.pcmk_delay_max + self.msgwait))
        else:
            value_from_sbd = int(1.2*max(self.stonith_watchdog_timeout, 2*self.sbd_watchdog_timeout))

        value = max(value_from_sbd, bootstrap.get_stonith_timeout_generally_expected())
        logger.debug("Result of SBDTimeout.get_stonith_timeout_expected %d", value)
        return value

If to address:
https://bugzilla.redhat.com/show_bug.cgi?id=1941108

, token+consensus needs to be generally additionally added onto the time that fencing itself potentially will need, meaning for the default:

stonith-timeout = STONITH_TIMEOUT_DEFAULT + (token+consensus)

For the calculated:
stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from d8fc1f7 to 36a3966 Compare December 2, 2021 07:59
crmsh/sbd.py Show resolved Hide resolved
crmsh/sbd.py Outdated
"""
Adjust SBD_DELAY_START in /etc/sysconfig/sbd
"""
run_time_value = str(self.sbd_delay_start_value_expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you've replaced "runtime" with "expected", probably also "run_time" here? :-)

@gao-yan gao-yan closed this Dec 2, 2021
@gao-yan gao-yan reopened this Dec 2, 2021
@zzhou1
Copy link
Contributor

zzhou1 commented Dec 2, 2021

@gao-yan

See my view down below,

If to address: https://bugzilla.redhat.com/show_bug.cgi?id=1941108

Over there, actually, I didn't figure out how to illustrate the precise rationale for "pcmk_reboot_timeout to a value more than (token) + (consensus) + (how long they expect the reboot to take)". At least I couldn't prove that addition part for how_long_they_expect_the_reboot_to_take. Noted that, the description of brc#1941108#c0, node2 won't start the cluster stack at all. With that, how_long_they_expect_the_reboot_to_take can be a infinite number in theory ;)

My understanding and expectation is that as long as pacemaker fencer op returns "OK" successfully, then this fence failure can avoid, hence no double fencing. And, I observe pacemaker-fenced do ack 'reboot' ok once the new corosync membership formed at the time of token+timeout. That says, as long as stonith-timeout > token+timeout, there won't be fence failure, hence no double fencing. BTW, sbd poison pill is special and could avoid double fencing after all. This is true. But other normal fence devices is not. I can prove this double fencing with fence_virsh in my lab with the disabled cluster service .

My wild guess is whether brc#1941108 might pull in the similar concept of SBD_DELAY_START, and mess two different concepts?! ( BTW, as the side note, this brings up the one more finding of the challenge if the cluster node rebooted too fast without sbd configured. I agree that needs another thread elsewhere to discuss in the future ;) )

, token+consensus needs to be generally additionally added onto the time that fencing itself potentially will need, meaning for the default:

stonith-timeout = STONITH_TIMEOUT_DEFAULT + (token+consensus)

For the calculated: stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

All in all, with my above narrative, in theory, I stick with
stonith-timeout >= max(value_from_sbd, STONITH_TIMEOUT_DEFAULT, token+consensus )
instead of adding them together as the above. Though, practically, the bigger number surely add enough more buffer.

@gao-yan
Copy link
Member

gao-yan commented Dec 2, 2021

Considering the current default/calculated stonith-timeout alone, it's already the time that is expected to be possibly taken for a fencing to return. A fencing target can be safely assumed being down only when a fencing has returned. And only until then we can assume the node starts leaving corosync membership, and it'll then take (token+consensus) for us get a new membership reformed.

As you can see from the bug entry, although the fencing action itself succeeded very soon, but it could not be acked until the new membership is reformed. So as long as stonith-timeout timer is popped before that, it will anyway cause fencing failure/timeout and result into another fencing.

Fencing itself only took a couple of seconds to return, which might make you think it's insignificant. But it consumed stonith-timeout. It'd be more obvious with sbd of course.

, token+consensus needs to be generally additionally added onto the time that fencing itself potentially will need, meaning for the default:
stonith-timeout = STONITH_TIMEOUT_DEFAULT + (token+consensus)
For the calculated: stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

All in all, with my above narrative, in theory, I stick with stonith-timeout >= max(value_from_sbd, STONITH_TIMEOUT_DEFAULT, token+consensus ) instead of adding them together as the above. Though, practically, the bigger number surely add enough more buffer.

I don't really get what this is for :-)

To me, it's like either not to address the case with:

stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT)

, or address it with:

stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

Frankly I don't think having a longer stonith-timeout could do harm as long as we have the good reason.

@zzhou1
Copy link
Contributor

zzhou1 commented Dec 3, 2021

Considering the current default/calculated stonith-timeout alone, it's already the time that is expected to be possibly taken for a fencing to return. A fencing target can be safely assumed being down only when a fencing has returned. And only until then we can assume the node starts leaving corosync membership, and it'll then take (token+consensus) for us get a new membership reformed.

From my experience to read the log sbd and fence_virsh, I could see two places pacemaker-fenced try to return "OK" or "error"

  1. once reach stonith-timeout,
  2. once reach token+consensus

As you can see from the bug entry, although the fencing action itself succeeded very soon, but it could not be acked until the new membership is reformed. So as long as stonith-timeout timer is popped before that, it will anyway cause fencing failure/timeout and result into another fencing.

Yes, I think I understand this part. Also because of this it leads me to stick with stonith-timeout > token+consensus.

Fencing itself only took a couple of seconds to return, which might make you think it's insignificant. But it consumed stonith-timeout. It'd be more obvious with sbd of course.

, token+consensus needs to be generally additionally added onto the time that fencing itself potentially will need, meaning for the default:
stonith-timeout = STONITH_TIMEOUT_DEFAULT + (token+consensus)
For the calculated: stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

All in all, with my above narrative, in theory, I stick with stonith-timeout >= max(value_from_sbd, STONITH_TIMEOUT_DEFAULT, token+consensus ) instead of adding them together as the above. Though, practically, the bigger number surely add enough more buffer.

I don't really get what this is for :-)

To me, it's like either not to address the case with:

stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT)

, or address it with:

stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

Let me put this way by using the use cases

a) sbd use cases: ​
stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT)

​This is safe even not take "token+consensus" into account. The ugly
situation is acceptable when it is less than (token+consensus). Because of
SBD_DELAY_START, two poison pills makes no difference as one poison pill.

b) fence_virsh use cases, as the example for non-sbd situation:
​stonith-timeout = max(STONITH_TIMEOUT_DEFAULT, (token+consensus))

There is no influence from value_from_sbd at all.

Frankly I don't think having a longer stonith-timeout could do harm as long as we have the good reason.

I'm kind of agree , because of no obviously harm. However, pacemaker code internally will multiply 1.2 to the user-end stonith-timeout, and it does gives the enough buffer. I don't have to insist to add too much buffer.

Another point from my technical debate is I couldn't find the real example to illustrates to add all above values together.

That says,
stonith-timeout >= max(value_from_sbd, STONITH_TIMEOUT_DEFAULT, token+consensus)
it covers both sbd and non-sbd situation as the generic formula.

BTW, here is my test configuration with fence_virsh

- Bad configuration cause fencing failure (aka. double fencing)
  totem.token (u32) = 30000  => token+consensus will be 66
  stonith-timeout=50  => internally pacameker will get 60, which < 66

- Good configuration has only one single fencing
  totem.token (u32) = 30000  => token+consensus will be 66
  stonith-timeout=70  => internally pacameker will get 84, which > 66

@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 36a3966 to af2a30a Compare December 3, 2021 07:59
@gao-yan
Copy link
Member

gao-yan commented Dec 3, 2021

Considering the current default/calculated stonith-timeout alone, it's already the time that is expected to be possibly taken for a fencing to return. A fencing target can be safely assumed being down only when a fencing has returned. And only until then we can assume the node starts leaving corosync membership, and it'll then take (token+consensus) for us get a new membership reformed.

From my experience to read the log sbd and fence_virsh, I could see two places pacemaker-fenced try to return "OK" or "error"

  1. once reach stonith-timeout,
  2. once reach token+consensus

The point is, we wouldn't want to unnecessarily pop stonith-timeout only because we are still in reforming of new membership hence DC is not ready to acknowledge a fencing which was even successful ...

b) fence_virsh use cases, as the example for non-sbd situation:​stonith-timeout = max(STONITH_TIMEOUT_DEFAULT, (token+consensus))

Things might not be always as easy as with fence_virsh for example in public clouds purely with their own fencing mechanisms...

There is no influence from value_from_sbd at all.

Frankly I don't think having a longer stonith-timeout could do harm as long as we have the good reason.

I'm kind of agree , because of no obviously harm. However, pacemaker code internally will multiply 1.2 to the user-end stonith-timeout, and it does gives the enough buffer. I don't have to insist to add too much buffer.

Bear in mind, the margin is meant for the internal overheads, broadcasting/processing/acknowledging of fencing requests/replies among cluster nodes, which could be significant depending on the environment/situation ...

Just because we know there's the internal margin and usually it's not consumed up shouldn't encourage us to consider it for other purpose.

Another point from my technical debate is I couldn't find the real example to illustrates to add all above values together.

Well, because it's based on the experience/assumptions with normal/good situations :-)

A lot of fencing mechanisms usually take less than 1 second to finish fencing, but rather just giving it only 1 second stonith-timeout assuming it can always do the job with a single quick shot, we give it longer time for possible bad situations and for it to even retry ...

It's not that I think the case must or must not be addressed, it just doesn't make sense to me to go with a halfway solution, otherwise we could get a conversation like:

Q: Why is my stonith-timeout bootrapped being equal to "token + consensus"?
A: Try to prevent unnecessary second fencing
Q: So does it prevent that?
A: Well, mostly probably ...

@zzhou1
Copy link
Contributor

zzhou1 commented Dec 3, 2021

The point is, we wouldn't want to unnecessarily pop stonith-timeout only because we are still in reforming of new membership hence DC is not ready to acknowledge a fencing which was even successful ...

Agree.

b) fence_virsh use cases, as the example for non-sbd situation:​stonith-timeout = max(STONITH_TIMEOUT_DEFAULT, (token+consensus))

Things might not be always as easy as with fence_virsh for example in public clouds purely with their own fencing mechanisms...

Thanks, yes, the public cloud and openstack fence mechanism on their own are the good example. They need very long stonith-timeout. This re-enforce us to use ">=" in the code, instead of "=".

[...]

A lot of fencing mechanisms usually take less than 1 second to finish fencing, but rather just giving it only 1 second stonith-timeout assuming it can always do the job with a single quick shot, we give it longer time for possible bad situations and for it to even retry ...

That's a valid angle in the sense of the bad situation. I would correlate this to public cloud fence agents again. The final appropriate stonith-timeout should leverage the knowledge of sysadmin or the external orchestration software, for example. That's beyond the scope of crmsh so far.

It's not that I think the case must or must not be addressed, it just doesn't make sense to me to go with a halfway solution, otherwise we could get a conversation like:

Q: Why is my stonith-timeout bootrapped being equal to "token + consensus"? A: To prevent unnecessary second fencing Q: So it does prevent that? A: Well, mostly probably ...

Now, I can see your point, and where I stand for. My point is based on ">=" which tries to stretch stonith-timeout to the reasonable minimum, no necessarily to blindly force "=" to cover all situation.

Anyway, toward closure of this debate with stonith-timeout, I defer the final decision to you and/or Xin in the code ;)

Thanks, I enjoy this kind of debate. Have a nice weekend!

@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from af2a30a to 145c456 Compare December 4, 2021 15:20
@liangxin1300
Copy link
Collaborator Author

Thanks for the nice discussion and suggestions! I've updated the code about stonith-timeout, the formula:

      value_from_sbd = 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd
      value_from_sbd = 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd
      stonith_timeout >= max(value_from_sbd, constants.STONITH_TIMEOUT_DEFAULT) + token + consensus
      stonith-timeout >= STONITH_TIMEOUT_DEFAULT+token+consensus  # for all situations

@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 145c456 to ebd318e Compare December 6, 2021 08:20
Copy link
Member

@gao-yan gao-yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me.

Besides a few more nitpickings here, there are still unresolved conversations including the ones in is_sbd_delay_start(), adjust_systemd_start_timeout() and is_2node_cluster_without_qdevice(). Feel free to resolve or improve in the future.

"""
Adjust stonith-timeout for all scenarios, formula is:

stonith-timeout >= STONITH_TIMEOUT_DEFAULT + token + consensus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are actually going with "=" rather than ">=" here, right? If so, we could let the comment consistent with the fact.

crmsh/sbd.py Outdated
value_from_sbd = 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd
value_from_sbd = 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd

stonith_timeout >= max(value_from_sbd, constants.STONITH_TIMEOUT_DEFAULT) + token + consensus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly about ">=" in the comment.

crmsh/sbd.py Outdated
Get the value for SBD_DELAY_START, formulas are:

SBD_DELAY_START >= (token + consensus + pcmk_delay_max + msgwait) # for disk-based sbd
SBD_DELAY_START >= (token + consensus + 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly.

crmsh/sbd.py Outdated
cmd = "systemctl show -p TimeoutStartUSec sbd --value"
out = utils.get_stdout_or_raise_error(cmd)
start_timeout = utils.get_systemd_timeout_start_in_sec(out)
if start_timeout >= int(sbd_delay_start_value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing "==" would be a little tricky... Probably better require ">".

* Consolidate sbd timeout related methods/constants/formulas into class SBDTimeout

* Adjust stonith-timeout value, formulas are:
  value_from_sbd = 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd
  value_from_sbd = 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd
  stonith_timeout = max(value_from_sbd, constants.STONITH_TIMEOUT_DEFAULT) + token + consensus
  stonith-timeout = STONITH_TIMEOUT_DEFAULT+token+consensus  # for all situations

* Adjust SBD_DELAY_START value, formulas are:
  SBD_DELAY_START = no # for non virtualization environment or non-2node cluster, which is the system default
  SBD_DELAY_START = (token + consensus + pcmk_delay_max + msgwait)  # for disk-based sbd
  SBD_DELAY_START = (token + consensus + 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd

* pcmk_delay_max=30 # only for the single stonith device in the 2-node cluster without qdevice
  pcmk_delay_max deletion # only for the single stonith device, not in the 2-node cluster without qdevice
@liangxin1300 liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from ebd318e to ed7dbb6 Compare December 6, 2021 14:15
@liangxin1300
Copy link
Collaborator Author

@zzhou1 @gao-yan Thanks for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants