-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev: sbd: Adjust timeout related values #890
Dev: sbd: Adjust timeout related values #890
Conversation
7f08657
to
3e221bc
Compare
It is not "for other situations". It is "for ALL situations". Also enforce ">=" to leave the room for any manual change outside of crmsh. That says, don't use "=" to set the value directly, that likely override some intentional configuration by sysadmin. |
57f1456
to
e16e31f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One major change is needed for the conversation around get_sbd_delay_start()
Well, I expect the following alternative will do the same job. |
Do you mean |
116f7ab
to
37addb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes are expected to resolve conversations.
37addb7
to
619fcfc
Compare
If the code can be reused directly, that will be great. From the behavior wise, |
d864eda
to
0544ee9
Compare
I'm a little lost here... Fencing starts, meaning stonith timer starts only when a new membership has been reformed, which is after "token + consensus" have timed out. Why does "token + consensus" matter for determining the value of stonith-timeout or need to be compared with the time that will be spent on fencing? |
That corresponds to the final conclusion Jan Friesse and me came to.
Sry for not interfering here earlier. |
Thanks for the information, @wenningerk! IIUC it means, for the case of diskless sbd, the only requirement in regard of SBD_WATCHDOG_TIMEOUT remains -- A value of SBD_WATCHDOG_TIMEOUT is safe as long as it's longer than qdevice sync timeout if qdevice is used, right? |
yes, as of my current understanding at least ;-) |
@liangxin1300 @zzhou1 What's confusing me here is, you seem to be talking about determination of I mean, yes, |
@gao-yan I treat is a generic situation. I observed this in my lab and referred to the detailed discussion here. Well, this triggers me to correlate to "Pacemaker Explain" document: "It(stonith-timeout) has been replaced by the pcmk_reboot_timeout and pcmk_off_timeout properties. " I might over look stonith-timeout here. Maybe, stonith-timeout is not used without SBD at all? Thanks to clarify on this. |
0544ee9
to
487cd4f
Compare
Ah, now I recall it's about the case where the fencing target is still technically in the membership when fencing action returns, and how to prevent the unnecessary second fencing... If to address this case,
I think this is about the parameters of fencing resource rather than the global stonith-timeout. And of course any such parameter configured for specific fencing resource takes precedence over the global one. |
Super! This does clear my mind so far. However, one more confusing text from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good work, @liangxin1300 !
crmsh/bootstrap.py
Outdated
SBDManager.is_delay_start(): | ||
pacemaker_start_msg += "(waiting for sbd {}s)".format(SBDManager.get_suitable_sbd_systemd_timeout()) | ||
SBDTimeout.is_sbd_delay_start(): | ||
pacemaker_start_msg += "(waiting for sbd {}s)".format(SBDTimeout.get_sbd_delay_start_sec_from_sysconfig()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message would probably be better to indicate it's about delay start? For example "delaying start of sbd for {}s".
Besides, have we lost the logic for disk-based sbd, in case SBD_DELAY_START
was enabled with a boolean true, msgwait
would be retrieved from disk metadata for this?
crmsh/bootstrap.py
Outdated
if not res or int(res) < SBDTimeout.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE: | ||
sbd_watchdog_timeout_qdevice = SBDTimeout.SBD_WATCHDOG_TIMEOUT_DEFAULT_WITH_QDEVICE | ||
SBDManager.update_configuration({"SBD_WATCHDOG_TIMEOUT": str(sbd_watchdog_timeout_qdevice)}) | ||
utils.set_property(stonith_timeout=int(1.2*2*sbd_watchdog_timeout_qdevice)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like something could be put into get_stonith_timeout_runtime() and commonly formulated there ...
crmsh/sbd.py
Outdated
self.disk_based = False | ||
self.sbd_watchdog_timeout = SBDTimeout.get_sbd_watchdog_timeout() | ||
self.stonith_watchdog_timeout = SBDTimeout.get_stonith_watchdog_timeout() | ||
self.sbd_delay_start_value_runtime = self.get_sbd_delay_start_runtime() if utils.detect_virt() else "no" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Been kind of confused by the usages of the term "runtime" :-)
It's alright, but would it be clearer with something like "expected" rather than "runtime"?
current_num = len(list_cluster_nodes()) | ||
remove_num = 1 if removing else 0 | ||
qdevice_num = 1 if is_qdevice_configured() else 0 | ||
return (current_num - remove_num + qdevice_num) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could an 1-node cluster with qdevice fall into the case this way? Would it be more sensible to check current_num - remove_num
and qdevice_num
separately?
Set cluster property if calculated value is larger then current cib value | ||
""" | ||
_value = get_property(property_name) | ||
value_from_cib = int(_value.strip('s')) if _value else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the property is defaulted, should the default value for example STONITH_TIMEOUT_DEFAULT
be passed into the function to be referenced rather than 0 for the case of stonith-timeout?
Check if SBD_DELAY_START is not no or not set | ||
""" | ||
res = SBDManager.get_sbd_value_from_config("SBD_DELAY_START") | ||
return res and res != "no" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically only explicit true or a value means being enabled:
https://github.com/ClusterLabs/sbd/blob/master/src/sbd-inquisitor.c#L991-L998
Similarly, is_boolean_true()
should be used.
crmsh/sbd.py
Outdated
""" | ||
# TODO 5ms, 5us, 5s, 5m, 5h are also valid for sbd sysconfig | ||
value = SBDManager.get_sbd_value_from_config("SBD_DELAY_START") | ||
if value in ["yes", "1"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we used to use is_boolean_true() for this?
Adjust start timeout for sbd when set SBD_DELAY_START | ||
""" | ||
sbd_delay_start_value = SBDManager.get_sbd_value_from_config("SBD_DELAY_START") | ||
if sbd_delay_start_value == "no": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use is_sbd_delay_start()
function here?
crmsh/sbd.py
Outdated
@@ -26,14 +275,11 @@ class SBDManager(object): | |||
specify here will be destroyed. | |||
""" | |||
SBD_WARNING = "Not configuring SBD - STONITH will be disabled." | |||
DISKLESS_SBD_WARNING = """Diskless SBD requires cluster with three or more nodes. | |||
If you want to use diskless SBD for two-nodes cluster, should be combined with QDevice.""" | |||
DISKLESS_SBD_WARNING = "Diskless SBD requires cluster with three or more nodes. If you want to use diskless SBD for two-nodes cluster, should be combined with QDevice." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking: two-node or 2-node rather than two-nodes :-)
487cd4f
to
4bf6742
Compare
@gao-yan @zzhou1 Thanks for your nice suggestions!
I changed
|
4bf6742
to
d8fc1f7
Compare
If to address: ,
For the calculated: |
d8fc1f7
to
36a3966
Compare
crmsh/sbd.py
Outdated
""" | ||
Adjust SBD_DELAY_START in /etc/sysconfig/sbd | ||
""" | ||
run_time_value = str(self.sbd_delay_start_value_expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you've replaced "runtime" with "expected", probably also "run_time" here? :-)
See my view down below,
Over there, actually, I didn't figure out how to illustrate the precise rationale for "pcmk_reboot_timeout to a value more than (token) + (consensus) + (how long they expect the reboot to take)". At least I couldn't prove that addition part for how_long_they_expect_the_reboot_to_take. Noted that, the description of brc#1941108#c0, node2 won't start the cluster stack at all. With that, how_long_they_expect_the_reboot_to_take can be a infinite number in theory ;) My understanding and expectation is that as long as pacemaker fencer op returns "OK" successfully, then this fence failure can avoid, hence no double fencing. And, I observe pacemaker-fenced do ack 'reboot' ok once the new corosync membership formed at the time of token+timeout. That says, as long as stonith-timeout > token+timeout, there won't be fence failure, hence no double fencing. BTW, sbd poison pill is special and could avoid double fencing after all. This is true. But other normal fence devices is not. I can prove this double fencing with fence_virsh in my lab with the disabled cluster service . My wild guess is whether brc#1941108 might pull in the similar concept of SBD_DELAY_START, and mess two different concepts?! ( BTW, as the side note, this brings up the one more finding of the challenge if the cluster node rebooted too fast without sbd configured. I agree that needs another thread elsewhere to discuss in the future ;) )
All in all, with my above narrative, in theory, I stick with |
Considering the current default/calculated As you can see from the bug entry, although the fencing action itself succeeded very soon, but it could not be acked until the new membership is reformed. So as long as stonith-timeout timer is popped before that, it will anyway cause fencing failure/timeout and result into another fencing. Fencing itself only took a couple of seconds to return, which might make you think it's insignificant. But it consumed stonith-timeout. It'd be more obvious with sbd of course.
I don't really get what this is for :-) To me, it's like either not to address the case with:
, or address it with:
Frankly I don't think having a longer stonith-timeout could do harm as long as we have the good reason. |
From my experience to read the log sbd and fence_virsh, I could see two places pacemaker-fenced try to return "OK" or "error"
Yes, I think I understand this part. Also because of this it leads me to stick with stonith-timeout > token+consensus.
Let me put this way by using the use cases a) sbd use cases: This is safe even not take "token+consensus" into account. The ugly b) fence_virsh use cases, as the example for non-sbd situation: There is no influence from value_from_sbd at all.
I'm kind of agree , because of no obviously harm. However, pacemaker code internally will multiply 1.2 to the user-end stonith-timeout, and it does gives the enough buffer. I don't have to insist to add too much buffer. Another point from my technical debate is I couldn't find the real example to illustrates to add all above values together. That says, BTW, here is my test configuration with fence_virsh
|
36a3966
to
af2a30a
Compare
The point is, we wouldn't want to unnecessarily pop stonith-timeout only because we are still in reforming of new membership hence DC is not ready to acknowledge a fencing which was even successful ...
Things might not be always as easy as with fence_virsh for example in public clouds purely with their own fencing mechanisms...
Bear in mind, the margin is meant for the internal overheads, broadcasting/processing/acknowledging of fencing requests/replies among cluster nodes, which could be significant depending on the environment/situation ... Just because we know there's the internal margin and usually it's not consumed up shouldn't encourage us to consider it for other purpose.
Well, because it's based on the experience/assumptions with normal/good situations :-) A lot of fencing mechanisms usually take less than 1 second to finish fencing, but rather just giving it only 1 second stonith-timeout assuming it can always do the job with a single quick shot, we give it longer time for possible bad situations and for it to even retry ... It's not that I think the case must or must not be addressed, it just doesn't make sense to me to go with a halfway solution, otherwise we could get a conversation like: Q: Why is my stonith-timeout bootrapped being equal to "token + consensus"? |
Agree.
Thanks, yes, the public cloud and openstack fence mechanism on their own are the good example. They need very long stonith-timeout. This re-enforce us to use ">=" in the code, instead of "=". [...]
That's a valid angle in the sense of the bad situation. I would correlate this to public cloud fence agents again. The final appropriate stonith-timeout should leverage the knowledge of sysadmin or the external orchestration software, for example. That's beyond the scope of crmsh so far.
Now, I can see your point, and where I stand for. My point is based on ">=" which tries to stretch stonith-timeout to the reasonable minimum, no necessarily to blindly force "=" to cover all situation. Anyway, toward closure of this debate with stonith-timeout, I defer the final decision to you and/or Xin in the code ;) Thanks, I enjoy this kind of debate. Have a nice weekend! |
af2a30a
to
145c456
Compare
Thanks for the nice discussion and suggestions! I've updated the code about stonith-timeout, the formula:
|
145c456
to
ebd318e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me.
Besides a few more nitpickings here, there are still unresolved conversations including the ones in is_sbd_delay_start(), adjust_systemd_start_timeout() and is_2node_cluster_without_qdevice(). Feel free to resolve or improve in the future.
crmsh/bootstrap.py
Outdated
""" | ||
Adjust stonith-timeout for all scenarios, formula is: | ||
|
||
stonith-timeout >= STONITH_TIMEOUT_DEFAULT + token + consensus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are actually going with "=" rather than ">=" here, right? If so, we could let the comment consistent with the fact.
crmsh/sbd.py
Outdated
value_from_sbd = 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd | ||
value_from_sbd = 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd | ||
|
||
stonith_timeout >= max(value_from_sbd, constants.STONITH_TIMEOUT_DEFAULT) + token + consensus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly about ">=" in the comment.
crmsh/sbd.py
Outdated
Get the value for SBD_DELAY_START, formulas are: | ||
|
||
SBD_DELAY_START >= (token + consensus + pcmk_delay_max + msgwait) # for disk-based sbd | ||
SBD_DELAY_START >= (token + consensus + 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly.
crmsh/sbd.py
Outdated
cmd = "systemctl show -p TimeoutStartUSec sbd --value" | ||
out = utils.get_stdout_or_raise_error(cmd) | ||
start_timeout = utils.get_systemd_timeout_start_in_sec(out) | ||
if start_timeout >= int(sbd_delay_start_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing "==" would be a little tricky... Probably better require ">".
* Consolidate sbd timeout related methods/constants/formulas into class SBDTimeout * Adjust stonith-timeout value, formulas are: value_from_sbd = 1.2 * (pcmk_delay_max + msgwait) # for disk-based sbd value_from_sbd = 1.2 * max (stonith_watchdog_timeout, 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd stonith_timeout = max(value_from_sbd, constants.STONITH_TIMEOUT_DEFAULT) + token + consensus stonith-timeout = STONITH_TIMEOUT_DEFAULT+token+consensus # for all situations * Adjust SBD_DELAY_START value, formulas are: SBD_DELAY_START = no # for non virtualization environment or non-2node cluster, which is the system default SBD_DELAY_START = (token + consensus + pcmk_delay_max + msgwait) # for disk-based sbd SBD_DELAY_START = (token + consensus + 2*SBD_WATCHDOG_TIMEOUT) # for disk-less sbd * pcmk_delay_max=30 # only for the single stonith device in the 2-node cluster without qdevice pcmk_delay_max deletion # only for the single stonith device, not in the 2-node cluster without qdevice
ebd318e
to
ed7dbb6
Compare
Changes
Consolidate sbd timeout related methods/constants/formulas into class SBDTimeout
Adjust stonith-timeout value, formulas are: