Dev: sbd: Adjust timeout related values #890

liangxin1300 · 2021-11-23T01:28:54Z

Changes

Consolidate sbd timeout related methods/constants/formulas into class SBDTimeout
Adjust stonith-timeout value, formulas are:

  stonith-timeout >= 1.2 * (pcmk_delay_max + msgwait)  # for disk-based sbd
  stonith-timeout >= 1.2 * max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)  # for disk-less sbd
  stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus)  # for ALL situations

Adjust SBD_DELAY_START value, formulas are:

  SBD_DELAY_START = no # for non virtualization environment, which is the system default
  SBD_DELAY_START >= (token + consensus + pcmk_delay_max + msgwait)  # for disk-based sbd
  SBD_DELAY_START >= (token + consensus + 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd

Adjust pcmk_delay_max:

  pcmk_delay_max=30 # only for the single stonith device in the 2-node cluster without qdevice
  pcmk_delay_max deletion # only for the single stonith device, not in the 2-node cluster without qdevice

Dev: behave: Add functional test for previous changes and cases

zzhou1 · 2021-11-24T05:49:47Z

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations".
eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait)
eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

Also enforce ">=" to leave the room for any manual change outside of crmsh. That says, don't use "=" to set the value directly, that likely override some intentional configuration by sysadmin.

crmsh/bootstrap.py

crmsh/corosync.py

crmsh/sbd.py

zzhou1

One major change is needed for the conversation around get_sbd_delay_start()

crmsh/corosync.py

crmsh/sbd.py

zzhou1 · 2021-11-25T03:48:48Z

crm cluster remove tw3 works nicely so far.

Well, I expect the following alternative will do the same job.
crm node delete tw3

liangxin1300 · 2021-11-25T06:38:38Z

crm cluster remove tw3 works nicely so far.

Well, I expect the following alternative will do the same job. crm node delete tw3

Do you mean crm cluster remove <node> and crm node delete share the same code functions and execute the the process?
Then we need another PR to do this

crmsh/sbd.py

zzhou1

Minor changes are expected to resolve conversations.

zzhou1 · 2021-11-26T08:31:01Z

crm cluster remove tw3 works nicely so far.
Well, I expect the following alternative will do the same job. crm node delete tw3

Do you mean crm cluster remove <node> and crm node delete share the same code functions and execute the the process? Then we need another PR to do this

If the code can be reused directly, that will be great. From the behavior wise, crm node delete <node> should adjust_pcmk_delay_max here. Maybe, most of other behavior should be the same too. Anyway, worth an issue to track this.

gao-yan · 2021-11-29T11:59:16Z

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...

Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

wenningerk · 2021-11-29T12:45:48Z

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...

Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

That corresponds to the final conclusion Jan Friesse and me came to.
And is the reason why I finally didn't implement that test in sbd itself - as done with sync-timeout.
https://bugzilla.redhat.com/show_bug.cgi?id=1877434 was closed a while ago.
Let me copy here a statement from the private part of that bug.

After reconsideration it looks as if there are no relevant cases that would
require a relationship token timeout and sbd watchdog timeout (and from
that derived stonith-watchdog-timeout).

All fencing actions are determined from the current CIB state. Due to
corosync's virtual synchrony, there should be no situations where the
target node is in the membership, and the DC sees a CIB change that
requires fencing but the target node does not also see that change.
Therefore, the DC and sbd on the target node should always be in sync
as far as whether fencing is required, and it should happen within the
sbd timeout.

Sry for not interfering here earlier.

gao-yan · 2021-11-29T15:49:38Z

That corresponds to the final conclusion Jan Friesse and me came to. And is the reason why I finally didn't implement that test in sbd itself - as done with sync-timeout. https://bugzilla.redhat.com/show_bug.cgi?id=1877434 was closed a while ago. Let me copy here a statement from the private part of that bug.
After reconsideration it looks as if there are no relevant cases that would
require a relationship token timeout and sbd watchdog timeout (and from
that derived stonith-watchdog-timeout).

All fencing actions are determined from the current CIB state. Due to
corosync's virtual synchrony, there should be no situations where the
target node is in the membership, and the DC sees a CIB change that
requires fencing but the target node does not also see that change.
Therefore, the DC and sbd on the target node should always be in sync
as far as whether fencing is required, and it should happen within the
sbd timeout.

Thanks for the information, @wenningerk!

IIUC it means, for the case of diskless sbd, the only requirement in regard of SBD_WATCHDOG_TIMEOUT remains -- A value of SBD_WATCHDOG_TIMEOUT is safe as long as it's longer than qdevice sync timeout if qdevice is used, right?

wenningerk · 2021-11-29T15:52:31Z

Thanks for the information, @wenningerk!

IIUC it means, for the case of diskless sbd, the only requirement in regard of SBD_WATCHDOG_TIMEOUT remains -- A value of SBD_WATCHDOG_TIMEOUT is safe as long as it's longer than qdevice sync timeout if qdevice is used, right?

yes, as of my current understanding at least ;-)

gao-yan · 2021-11-29T16:01:33Z

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...

Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

@liangxin1300 @zzhou1 What's confusing me here is, you seem to be talking about determination of stonith-timeout for non-sbd/generic cases? But token + consensus is not relevant even for the cases of sbd with or without shared disk.

I mean, yes, token + consensus should be added into SBD_DELAY_START, but I don't see them as a factor of the generic stonith-timeout.

zzhou1 · 2021-11-30T03:18:07Z

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...
Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

@liangxin1300 @zzhou1 What's confusing me here is, you seem to be talking about determination of stonith-timeout for non-sbd/generic cases? But token + consensus is not relevant even for the cases of sbd with or without shared disk.

I mean, yes, token + consensus should be added into SBD_DELAY_START, but I don't see them as a factor of the generic stonith-timeout.

@gao-yan I treat is a generic situation. I observed this in my lab and referred to the detailed discussion here.
https://bugzilla.redhat.com/show_bug.cgi?id=1941108

Well, this triggers me to correlate to "Pacemaker Explain" document: "It(stonith-timeout) has been replaced by the pcmk_reboot_timeout and pcmk_off_timeout properties. " I might over look stonith-timeout here. Maybe, stonith-timeout is not used without SBD at all? Thanks to clarify on this.

gao-yan · 2021-11-30T16:03:59Z

stonith-timeout >= max(STONITH_TIMEOUT_DEFAULT, token+consensus) # for other situations

It is not "for other situations". It is "for ALL situations". eg. there could be the situation (token+consensus) > (pcmk_delay_max + msgwait) eg. there could be the situation (token+consensus) > max(stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT)

I'm a little lost here...
Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing?

@liangxin1300 @zzhou1 What's confusing me here is, you seem to be talking about determination of stonith-timeout for non-sbd/generic cases? But token + consensus is not relevant even for the cases of sbd with or without shared disk.
I mean, yes, token + consensus should be added into SBD_DELAY_START, but I don't see them as a factor of the generic stonith-timeout.

@gao-yan I treat is a generic situation. I observed this in my lab and referred to the detailed discussion here. https://bugzilla.redhat.com/show_bug.cgi?id=1941108

Ah, now I recall it's about the case where the fencing target is still technically in the membership when fencing action returns, and how to prevent the unnecessary second fencing...

If to address this case, token + consensus needs to be generally added into stonith-timeout.

Well, this triggers me to correlate to "Pacemaker Explain" document: "It(stonith-timeout) has been replaced by the pcmk_reboot_timeout and pcmk_off_timeout properties. " I might over look stonith-timeout here. Maybe, stonith-timeout is not used without SBD at all? Thanks to clarify on this.

I think this is about the parameters of fencing resource rather than the global stonith-timeout.

And of course any such parameter configured for specific fencing resource takes precedence over the global one.

zzhou1 · 2021-12-01T03:57:29Z

Well, this triggers me to correlate to "Pacemaker Explain" document: "It(stonith-timeout) has been replaced by the pcmk_reboot_timeout and pcmk_off_timeout properties. " I might over look stonith-timeout here. Maybe, stonith-timeout is not used without SBD at all? Thanks to clarify on this.

I think this is about the parameters of fencing resource rather than the global stonith-timeout.

And of course any such parameter configured for specific fencing resource takes precedence over the global one.

Super! This does clear my mind so far.

However, one more confusing text from man pacemaker-schedulerd says "This value(stonith-timeout) is not used by Pacemaker, but is kept for backward " ???

gao-yan

Thanks for the good work, @liangxin1300 !

crmsh/constants.py

gao-yan · 2021-12-01T13:54:40Z

crmsh/bootstrap.py

-            SBDManager.is_delay_start():
-        pacemaker_start_msg += "(waiting for sbd {}s)".format(SBDManager.get_suitable_sbd_systemd_timeout())
+            SBDTimeout.is_sbd_delay_start():
+        pacemaker_start_msg += "(waiting for sbd {}s)".format(SBDTimeout.get_sbd_delay_start_sec_from_sysconfig())


The message would probably be better to indicate it's about delay start? For example "delaying start of sbd for {}s".

Besides, have we lost the logic for disk-based sbd, in case SBD_DELAY_START was enabled with a boolean true, msgwait would be retrieved from disk metadata for this?

gao-yan · 2021-12-01T14:06:51Z

crmsh/bootstrap.py

+            if not res or int(res) < SBDTimeout.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE:
+                sbd_watchdog_timeout_qdevice = SBDTimeout.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE
+                SBDManager.update_configuration({"SBD_WATCHDOG_TIMEOUT": str(sbd_watchdog_timeout_qdevice)})
+                utils.set_property(stonith_timeout=int(1.2*2*sbd_watchdog_timeout_qdevice))


Looks like something could be put into get_stonith_timeout_runtime() and commonly formulated there ...

gao-yan · 2021-12-01T14:25:15Z

crmsh/sbd.py

+            self.disk_based = False
+            self.sbd_watchdog_timeout = SBDTimeout.get_sbd_watchdog_timeout()
+            self.stonith_watchdog_timeout = SBDTimeout.get_stonith_watchdog_timeout()
+        self.sbd_delay_start_value_runtime = self.get_sbd_delay_start_runtime() if utils.detect_virt() else "no"


Been kind of confused by the usages of the term "runtime" :-)

It's alright, but would it be clearer with something like "expected" rather than "runtime"?

gao-yan · 2021-12-01T14:37:10Z

crmsh/utils.py

+    current_num = len(list_cluster_nodes())
+    remove_num = 1 if removing else 0
+    qdevice_num = 1 if is_qdevice_configured() else 0
+    return (current_num - remove_num + qdevice_num) == 2


Could an 1-node cluster with qdevice fall into the case this way? Would it be more sensible to check current_num - remove_num and qdevice_num separately?

gao-yan · 2021-12-01T20:25:19Z

crmsh/utils.py

+    Set cluster property if calculated value is larger then current cib value
+    """
+    _value = get_property(property_name)
+    value_from_cib = int(_value.strip('s')) if _value else 0


If the property is defaulted, should the default value for example STONITH_TIMEOUT_DEFAULT be passed into the function to be referenced rather than 0 for the case of stonith-timeout?

gao-yan · 2021-12-01T20:48:31Z

crmsh/sbd.py

+        Check if SBD_DELAY_START is not no or not set
+        """
+        res = SBDManager.get_sbd_value_from_config("SBD_DELAY_START")
+        return res and res != "no"


Technically only explicit true or a value means being enabled:

https://github.com/ClusterLabs/sbd/blob/master/src/sbd-inquisitor.c#L991-L998

Similarly, is_boolean_true() should be used.

gao-yan · 2021-12-01T20:50:33Z

crmsh/sbd.py

+        """
+        # TODO 5ms, 5us, 5s, 5m, 5h are also valid for sbd sysconfig
+        value = SBDManager.get_sbd_value_from_config("SBD_DELAY_START")
+        if value in ["yes", "1"]:


IIRC, we used to use is_boolean_true() for this?

gao-yan · 2021-12-01T20:53:11Z

crmsh/sbd.py

+        Adjust start timeout for sbd when set SBD_DELAY_START
+        """
+        sbd_delay_start_value = SBDManager.get_sbd_value_from_config("SBD_DELAY_START")
+        if sbd_delay_start_value == "no":


Better use is_sbd_delay_start() function here?

gao-yan · 2021-12-01T21:00:06Z

crmsh/sbd.py

@@ -26,14 +275,11 @@ class SBDManager(object):
  specify here will be destroyed.
 """
    SBD_WARNING = "Not configuring SBD - STONITH will be disabled."
-    DISKLESS_SBD_WARNING = """Diskless SBD requires cluster with three or more nodes.
-If you want to use diskless SBD for two-nodes cluster, should be combined with QDevice."""
+    DISKLESS_SBD_WARNING = "Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for two-nodes cluster, should be combined with QDevice."


Nitpicking: two-node or 2-node rather than two-nodes :-)

liangxin1300 · 2021-12-02T07:27:15Z

@gao-yan @zzhou1 Thanks for your nice suggestions!

Basically token + consensus needs to be generally added into the value for that.

I changed get_stonith_timeout_runtime as get_stonith_timeout_expected, and compare the value from sbd with max(STONITH_TIMEOUT_DEFAULT, token+consensus), return the larger one

    def get_stonith_timeout_expected(self):
        """
        Get stonith-timeout value for sbd cases, formulas are:

        stonith-timeout >= 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd
        stonith-timeout >= 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd

        And above value will compare with max(STONITH_TIMEOUT_DEFAULT, token+consensus),
        then return the larger one
        """
        if self.disk_based:
            value_from_sbd = int(1.2*(self.pcmk_delay_max + self.msgwait))
        else:
            value_from_sbd = int(1.2*max(self.stonith_watchdog_timeout, 2*self.sbd_watchdog_timeout))

        value = max(value_from_sbd, bootstrap.get_stonith_timeout_generally_expected())
        logger.debug("Result of SBDTimeout.get_stonith_timeout_expected %d", value)
        return value

gao-yan · 2021-12-02T07:52:11Z

@gao-yan @zzhou1 Thanks for your nice suggestions!

Basically token + consensus needs to be generally added into the value for that.

I changed get_stonith_timeout_runtime as get_stonith_timeout_expected, and compare the value from sbd with max(STONITH_TIMEOUT_DEFAULT, token+consensus), return the larger one

    def get_stonith_timeout_expected(self):
        """
        Get stonith-timeout value for sbd cases, formulas are:

        stonith-timeout >= 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd
        stonith-timeout >= 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd

        And above value will compare with max(STONITH_TIMEOUT_DEFAULT, token+consensus),
        then return the larger one
        """
        if self.disk_based:
            value_from_sbd = int(1.2*(self.pcmk_delay_max + self.msgwait))
        else:
            value_from_sbd = int(1.2*max(self.stonith_watchdog_timeout, 2*self.sbd_watchdog_timeout))

        value = max(value_from_sbd, bootstrap.get_stonith_timeout_generally_expected())
        logger.debug("Result of SBDTimeout.get_stonith_timeout_expected %d", value)
        return value

If to address:
https://bugzilla.redhat.com/show_bug.cgi?id=1941108

, token+consensus needs to be generally additionally added onto the time that fencing itself potentially will need, meaning for the default:

stonith-timeout = STONITH_TIMEOUT_DEFAULT + (token+consensus)

For the calculated:
stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

crmsh/sbd.py

gao-yan · 2021-12-02T13:15:16Z

crmsh/sbd.py

+        """
+        Adjust SBD_DELAY_START in /etc/sysconfig/sbd
+        """
+        run_time_value = str(self.sbd_delay_start_value_expected)


Given that you've replaced "runtime" with "expected", probably also "run_time" here? :-)

zzhou1 · 2021-12-02T15:02:47Z

@gao-yan

See my view down below,

If to address: https://bugzilla.redhat.com/show_bug.cgi?id=1941108

Over there, actually, I didn't figure out how to illustrate the precise rationale for "pcmk_reboot_timeout to a value more than (token) + (consensus) + (how long they expect the reboot to take)". At least I couldn't prove that addition part for how_long_they_expect_the_reboot_to_take. Noted that, the description of brc#1941108#c0, node2 won't start the cluster stack at all. With that, how_long_they_expect_the_reboot_to_take can be a infinite number in theory ;)

My understanding and expectation is that as long as pacemaker fencer op returns "OK" successfully, then this fence failure can avoid, hence no double fencing. And, I observe pacemaker-fenced do ack 'reboot' ok once the new corosync membership formed at the time of token+timeout. That says, as long as stonith-timeout > token+timeout, there won't be fence failure, hence no double fencing. BTW, sbd poison pill is special and could avoid double fencing after all. This is true. But other normal fence devices is not. I can prove this double fencing with fence_virsh in my lab with the disabled cluster service .

My wild guess is whether brc#1941108 might pull in the similar concept of SBD_DELAY_START, and mess two different concepts?! ( BTW, as the side note, this brings up the one more finding of the challenge if the cluster node rebooted too fast without sbd configured. I agree that needs another thread elsewhere to discuss in the future ;) )

, token+consensus needs to be generally additionally added onto the time that fencing itself potentially will need, meaning for the default:

stonith-timeout = STONITH_TIMEOUT_DEFAULT + (token+consensus)

For the calculated: stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

All in all, with my above narrative, in theory, I stick with
stonith-timeout >= max(value_from_sbd, STONITH_TIMEOUT_DEFAULT, token+consensus )
instead of adding them together as the above. Though, practically, the bigger number surely add enough more buffer.

gao-yan · 2021-12-02T16:21:06Z

Considering the current default/calculated stonith-timeout alone, it's already the time that is expected to be possibly taken for a fencing to return. A fencing target can be safely assumed being down only when a fencing has returned. And only until then we can assume the node starts leaving corosync membership, and it'll then take (token+consensus) for us get a new membership reformed.

As you can see from the bug entry, although the fencing action itself succeeded very soon, but it could not be acked until the new membership is reformed. So as long as stonith-timeout timer is popped before that, it will anyway cause fencing failure/timeout and result into another fencing.

Fencing itself only took a couple of seconds to return, which might make you think it's insignificant. But it consumed stonith-timeout. It'd be more obvious with sbd of course.

, token+consensus needs to be generally additionally added onto the time that fencing itself potentially will need, meaning for the default:
stonith-timeout = STONITH_TIMEOUT_DEFAULT + (token+consensus)
For the calculated: stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

All in all, with my above narrative, in theory, I stick with stonith-timeout >= max(value_from_sbd, STONITH_TIMEOUT_DEFAULT, token+consensus ) instead of adding them together as the above. Though, practically, the bigger number surely add enough more buffer.

I don't really get what this is for :-)

To me, it's like either not to address the case with:

stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT)

, or address it with:

stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

Frankly I don't think having a longer stonith-timeout could do harm as long as we have the good reason.

zzhou1 · 2021-12-03T04:21:06Z

Considering the current default/calculated stonith-timeout alone, it's already the time that is expected to be possibly taken for a fencing to return. A fencing target can be safely assumed being down only when a fencing has returned. And only until then we can assume the node starts leaving corosync membership, and it'll then take (token+consensus) for us get a new membership reformed.

From my experience to read the log sbd and fence_virsh, I could see two places pacemaker-fenced try to return "OK" or "error"

once reach stonith-timeout,
once reach token+consensus

As you can see from the bug entry, although the fencing action itself succeeded very soon, but it could not be acked until the new membership is reformed. So as long as stonith-timeout timer is popped before that, it will anyway cause fencing failure/timeout and result into another fencing.

Yes, I think I understand this part. Also because of this it leads me to stick with stonith-timeout > token+consensus.

Fencing itself only took a couple of seconds to return, which might make you think it's insignificant. But it consumed stonith-timeout. It'd be more obvious with sbd of course.

, token+consensus needs to be generally additionally added onto the time that fencing itself potentially will need, meaning for the default:
stonith-timeout = STONITH_TIMEOUT_DEFAULT + (token+consensus)
For the calculated: stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

All in all, with my above narrative, in theory, I stick with stonith-timeout >= max(value_from_sbd, STONITH_TIMEOUT_DEFAULT, token+consensus ) instead of adding them together as the above. Though, practically, the bigger number surely add enough more buffer.

I don't really get what this is for :-)

To me, it's like either not to address the case with:

stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT)

, or address it with:

stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT) + (token+consensus)

Let me put this way by using the use cases

a) sbd use cases:
stonith-timeout = max(value_from_sbd, STONITH_TIMEOUT_DEFAULT)

This is safe even not take "token+consensus" into account. The ugly
situation is acceptable when it is less than (token+consensus). Because of
SBD_DELAY_START, two poison pills makes no difference as one poison pill.

b) fence_virsh use cases, as the example for non-sbd situation:
stonith-timeout = max(STONITH_TIMEOUT_DEFAULT, (token+consensus))

There is no influence from value_from_sbd at all.

Frankly I don't think having a longer stonith-timeout could do harm as long as we have the good reason.

I'm kind of agree , because of no obviously harm. However, pacemaker code internally will multiply 1.2 to the user-end stonith-timeout, and it does gives the enough buffer. I don't have to insist to add too much buffer.

Another point from my technical debate is I couldn't find the real example to illustrates to add all above values together.

That says,
stonith-timeout >= max(value_from_sbd, STONITH_TIMEOUT_DEFAULT, token+consensus)
it covers both sbd and non-sbd situation as the generic formula.

BTW, here is my test configuration with fence_virsh

- Bad configuration cause fencing failure (aka. double fencing)
  totem.token (u32) = 30000  => token+consensus will be 66
  stonith-timeout=50  => internally pacameker will get 60, which < 66

- Good configuration has only one single fencing
  totem.token (u32) = 30000  => token+consensus will be 66
  stonith-timeout=70  => internally pacameker will get 84, which > 66

gao-yan · 2021-12-03T08:48:33Z

Considering the current default/calculated stonith-timeout alone, it's already the time that is expected to be possibly taken for a fencing to return. A fencing target can be safely assumed being down only when a fencing has returned. And only until then we can assume the node starts leaving corosync membership, and it'll then take (token+consensus) for us get a new membership reformed.

From my experience to read the log sbd and fence_virsh, I could see two places pacemaker-fenced try to return "OK" or "error"

once reach stonith-timeout,

once reach token+consensus

The point is, we wouldn't want to unnecessarily pop stonith-timeout only because we are still in reforming of new membership hence DC is not ready to acknowledge a fencing which was even successful ...

b) fence_virsh use cases, as the example for non-sbd situation:stonith-timeout = max(STONITH_TIMEOUT_DEFAULT, (token+consensus))

Things might not be always as easy as with fence_virsh for example in public clouds purely with their own fencing mechanisms...

There is no influence from value_from_sbd at all.

Frankly I don't think having a longer stonith-timeout could do harm as long as we have the good reason.

I'm kind of agree , because of no obviously harm. However, pacemaker code internally will multiply 1.2 to the user-end stonith-timeout, and it does gives the enough buffer. I don't have to insist to add too much buffer.

Bear in mind, the margin is meant for the internal overheads, broadcasting/processing/acknowledging of fencing requests/replies among cluster nodes, which could be significant depending on the environment/situation ...

Just because we know there's the internal margin and usually it's not consumed up shouldn't encourage us to consider it for other purpose.

Another point from my technical debate is I couldn't find the real example to illustrates to add all above values together.

Well, because it's based on the experience/assumptions with normal/good situations :-)

A lot of fencing mechanisms usually take less than 1 second to finish fencing, but rather just giving it only 1 second stonith-timeout assuming it can always do the job with a single quick shot, we give it longer time for possible bad situations and for it to even retry ...

It's not that I think the case must or must not be addressed, it just doesn't make sense to me to go with a halfway solution, otherwise we could get a conversation like:

Q: Why is my stonith-timeout bootrapped being equal to "token + consensus"?
A: Try to prevent unnecessary second fencing
Q: So does it prevent that?
A: Well, mostly probably ...

zzhou1 · 2021-12-03T11:16:22Z

The point is, we wouldn't want to unnecessarily pop stonith-timeout only because we are still in reforming of new membership hence DC is not ready to acknowledge a fencing which was even successful ...

Agree.

b) fence_virsh use cases, as the example for non-sbd situation:stonith-timeout = max(STONITH_TIMEOUT_DEFAULT, (token+consensus))

Things might not be always as easy as with fence_virsh for example in public clouds purely with their own fencing mechanisms...

Thanks, yes, the public cloud and openstack fence mechanism on their own are the good example. They need very long stonith-timeout. This re-enforce us to use ">=" in the code, instead of "=".

[...]

A lot of fencing mechanisms usually take less than 1 second to finish fencing, but rather just giving it only 1 second stonith-timeout assuming it can always do the job with a single quick shot, we give it longer time for possible bad situations and for it to even retry ...

That's a valid angle in the sense of the bad situation. I would correlate this to public cloud fence agents again. The final appropriate stonith-timeout should leverage the knowledge of sysadmin or the external orchestration software, for example. That's beyond the scope of crmsh so far.

It's not that I think the case must or must not be addressed, it just doesn't make sense to me to go with a halfway solution, otherwise we could get a conversation like:

Q: Why is my stonith-timeout bootrapped being equal to "token + consensus"? A: To prevent unnecessary second fencing Q: So it does prevent that? A: Well, mostly probably ...

Now, I can see your point, and where I stand for. My point is based on ">=" which tries to stretch stonith-timeout to the reasonable minimum, no necessarily to blindly force "=" to cover all situation.

Anyway, toward closure of this debate with stonith-timeout, I defer the final decision to you and/or Xin in the code ;)

Thanks, I enjoy this kind of debate. Have a nice weekend!

liangxin1300 · 2021-12-05T04:31:02Z

Thanks for the nice discussion and suggestions! I've updated the code about stonith-timeout, the formula:

      value_from_sbd = 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd
      value_from_sbd = 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd
      stonith_timeout >= max(value_from_sbd, constants.STONITH_TIMEOUT_DEFAULT) + token + consensus
      stonith-timeout >= STONITH_TIMEOUT_DEFAULT+token+consensus  # for all situations

gao-yan

Generally looks good to me.

Besides a few more nitpickings here, there are still unresolved conversations including the ones in is_sbd_delay_start(), adjust_systemd_start_timeout() and is_2node_cluster_without_qdevice(). Feel free to resolve or improve in the future.

gao-yan · 2021-12-06T10:42:16Z

crmsh/bootstrap.py

+    """
+    Adjust stonith-timeout for all scenarios, formula is:
+
+    stonith-timeout >= STONITH_TIMEOUT_DEFAULT + token + consensus


We are actually going with "=" rather than ">=" here, right? If so, we could let the comment consistent with the fact.

gao-yan · 2021-12-06T11:01:45Z

crmsh/sbd.py

+        value_from_sbd = 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd
+        value_from_sbd = 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd
+
+        stonith_timeout >= max(value_from_sbd, constants.STONITH_TIMEOUT_DEFAULT) + token + consensus


Similarly about ">=" in the comment.

gao-yan · 2021-12-06T11:06:01Z

crmsh/sbd.py

+        Get the value for SBD_DELAY_START, formulas are:
+
+        SBD_DELAY_START >= (token + consensus + pcmk_delay_max + msgwait)  # for disk-based sbd
+        SBD_DELAY_START >= (token + consensus + 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd


gao-yan · 2021-12-06T11:18:16Z

crmsh/sbd.py

+        cmd = "systemctl show -p TimeoutStartUSec sbd --value"
+        out = utils.get_stdout_or_raise_error(cmd)
+        start_timeout = utils.get_systemd_timeout_start_in_sec(out)
+        if start_timeout >= int(sbd_delay_start_value):


Allowing "==" would be a little tricky... Probably better require ">".

* Consolidate sbd timeout related methods/constants/formulas into class SBDTimeout * Adjust stonith-timeout value, formulas are: value_from_sbd = 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd value_from_sbd = 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd stonith_timeout = max(value_from_sbd, constants.STONITH_TIMEOUT_DEFAULT) + token + consensus stonith-timeout = STONITH_TIMEOUT_DEFAULT+token+consensus # for all situations * Adjust SBD_DELAY_START value, formulas are: SBD_DELAY_START = no # for non virtualization environment or non-2node cluster, which is the system default SBD_DELAY_START = (token + consensus + pcmk_delay_max + msgwait) # for disk-based sbd SBD_DELAY_START = (token + consensus + 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd * pcmk_delay_max=30 # only for the single stonith device in the 2-node cluster without qdevice pcmk_delay_max deletion # only for the single stonith device, not in the 2-node cluster without qdevice

liangxin1300 · 2021-12-06T14:28:12Z

@zzhou1 @gao-yan Thanks for reviewing!

liangxin1300 requested review from zzhou1 and gao-yan November 23, 2021 01:41

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch 3 times, most recently from 7f08657 to 3e221bc Compare November 24, 2021 02:20

zzhou1 requested changes Nov 24, 2021

View reviewed changes

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch 2 times, most recently from 57f1456 to e16e31f Compare November 25, 2021 03:11

zzhou1 requested changes Nov 25, 2021

View reviewed changes

crmsh/corosync.py Show resolved Hide resolved

crmsh/sbd.py Outdated Show resolved Hide resolved

crmsh/sbd.py Outdated Show resolved Hide resolved

crmsh/sbd.py Outdated Show resolved Hide resolved

zzhou1 requested changes Nov 25, 2021

View reviewed changes

crmsh/sbd.py Outdated Show resolved Hide resolved

crmsh/sbd.py Show resolved Hide resolved

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch 2 times, most recently from 116f7ab to 37addb7 Compare November 26, 2021 02:07

zzhou1 requested changes Nov 26, 2021

View reviewed changes

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 37addb7 to 619fcfc Compare November 26, 2021 04:17

zzhou1 mentioned this pull request Nov 26, 2021

crm node delete should adopt most of functionalities of crm cluster remove as mentioned in PR890 #891

Closed

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch 3 times, most recently from d864eda to 0544ee9 Compare November 27, 2021 15:08

zzhou1 approved these changes Nov 29, 2021

View reviewed changes

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 0544ee9 to 487cd4f Compare November 30, 2021 08:56

gao-yan reviewed Dec 1, 2021

View reviewed changes

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 487cd4f to 4bf6742 Compare December 2, 2021 07:02

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 4bf6742 to d8fc1f7 Compare December 2, 2021 07:27

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from d8fc1f7 to 36a3966 Compare December 2, 2021 07:59

gao-yan reviewed Dec 2, 2021

View reviewed changes

gao-yan closed this Dec 2, 2021

gao-yan reopened this Dec 2, 2021

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 36a3966 to af2a30a Compare December 3, 2021 07:59

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from af2a30a to 145c456 Compare December 4, 2021 15:20

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from 145c456 to ebd318e Compare December 6, 2021 08:20

gao-yan approved these changes Dec 6, 2021

View reviewed changes

liangxin1300 added 3 commits December 6, 2021 22:08

Dev: behave: Add functional test for previous changes and cases

46b18a1

Dev: unittest: Add unit test for previous changes

ed7dbb6

liangxin1300 force-pushed the 20211103_sbd_timeout_start_delay branch from ebd318e to ed7dbb6 Compare December 6, 2021 14:15

liangxin1300 merged commit e34e2ff into ClusterLabs:master Dec 6, 2021

liangxin1300 mentioned this pull request Dec 14, 2021

SBD_DELAY_START=yes is not enough and also need adjust TimeoutStartSec accordingly #874

Closed

Dev: sbd: Adjust timeout related values #890

Dev: sbd: Adjust timeout related values #890

Conversation

liangxin1300 commented Nov 23, 2021 • edited by zzhou1 Loading

Changes

zzhou1 commented Nov 24, 2021 • edited Loading

zzhou1 left a comment

Choose a reason for hiding this comment

zzhou1 commented Nov 25, 2021

liangxin1300 commented Nov 25, 2021 • edited Loading

zzhou1 left a comment

Choose a reason for hiding this comment

zzhou1 commented Nov 26, 2021

gao-yan commented Nov 29, 2021 • edited Loading

wenningerk commented Nov 29, 2021

gao-yan commented Nov 29, 2021

wenningerk commented Nov 29, 2021

gao-yan commented Nov 29, 2021 • edited Loading

zzhou1 commented Nov 30, 2021 • edited Loading

gao-yan commented Nov 30, 2021 • edited Loading

zzhou1 commented Dec 1, 2021 • edited Loading

gao-yan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liangxin1300 commented Dec 2, 2021

gao-yan commented Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

zzhou1 commented Dec 2, 2021 • edited Loading

gao-yan commented Dec 2, 2021 • edited Loading

zzhou1 commented Dec 3, 2021

gao-yan commented Dec 3, 2021 • edited Loading

zzhou1 commented Dec 3, 2021

liangxin1300 commented Dec 5, 2021

gao-yan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liangxin1300 commented Dec 6, 2021

liangxin1300 commented Nov 23, 2021 •

edited by zzhou1

Loading

zzhou1 commented Nov 24, 2021 •

edited

Loading

liangxin1300 commented Nov 25, 2021 •

edited

Loading

gao-yan commented Nov 29, 2021 •

edited

Loading

gao-yan commented Nov 29, 2021 •

edited

Loading

zzhou1 commented Nov 30, 2021 •

edited

Loading

gao-yan commented Nov 30, 2021 •

edited

Loading

zzhou1 commented Dec 1, 2021 •

edited

Loading

gao-yan commented Dec 2, 2021 •

edited

Loading

zzhou1 commented Dec 2, 2021 •

edited

Loading

gao-yan commented Dec 2, 2021 •

edited

Loading

gao-yan commented Dec 3, 2021 •

edited

Loading

gao-yan left a comment •

edited

Loading