neutron: set a failure-timeout on neutron-ha-tool #2063

dirkmueller · 2019-03-18T18:05:53Z

We don't want the l3 ha tool service to be stopped after 3 weeks of weekly
patching and rebooting the rabbitmq cluster. Set a timeout of a failure
if it happened more than 10 minutes ago.

hound · 2019-03-18T18:06:01Z

chef/cookbooks/neutron/recipes/network_agents_ha.rb

@@ -154,6 +154,9 @@
      agent "systemd:neutron-l3-ha-service"
      op node[:neutron][:ha][:neutron_l3_ha_resource][:op]
      action :update
+      meta ({


Lint/ParenthesesAsGroupedExpression: (...) interpreted as grouped expression. (https://github.com/bbatsov/ruby-style-guide#parens-no-spaces)

aspiers

The commit message references the l3 agent but the change affects neutron-l3-ha-service. It's not clear to me what the exact problem is or why timing out a failure of neutron-l3-ha-service would address it. I'm guessing there is some missing detail regarding the interaction between the two - please can you clarify in the commit message?

We don't want the neutron-ha-tool service to be stopped after 3 weeks of weekly patching and rebooting the rabbitmq cluster. Set a timeout of a failure if it happened more than 10 minutes ago.

dirkmueller · 2019-03-25T19:35:04Z

@aspiers sorry, fixed the typo. this is about the neutron-l3-ha-service which randomly but regularly gets stopped by pacemaker because of some sequense of consecutive errors.

For example recently somebody broke keystone for a time of 15 minutes, and that caused pacemaker to stop the service due to repeated failure. this is not helpful for achieving high availability when pacemaker just kills the service that should take care of availability.

aspiers · 2019-03-31T18:26:29Z

@aspiers sorry, fixed the typo. this is about the neutron-l3-ha-service which randomly but regularly gets stopped by pacemaker because of some sequense of consecutive errors.

For example recently somebody broke keystone for a time of 15 minutes, and that caused pacemaker to stop the service due to repeated failure. this is not helpful for achieving high availability when pacemaker just kills the service that should take care of availability.

OK thanks, that makes sense now. Ideally I would prefer that info to be in the commit message too, since the commit message doesn't feel entirely self-explanatory yet. But the main problem seems to be that the CI is currently failing:

+(qa_crowbarsetup.sh:3967) oncontroller_check_crm_failcounts(): [[ 1 = 1 ]]
+(qa_crowbarsetup.sh:3967) oncontroller_check_crm_failcounts(): [[ disallowskipfailcount = \d\i\s\a\l\l\o\w\s\k\i\p\f\a\i\l\c\o\u\n\t ]]
+(qa_crowbarsetup.sh:3968) oncontroller_check_crm_failcounts(): crm_mon --failcounts -1
+(qa_crowbarsetup.sh:3968) oncontroller_check_crm_failcounts(): grep fail-count=
+(qa_crowbarsetup.sh:3968) oncontroller_check_crm_failcounts(): complain 55 'Cluster resources'\'' failures detected'
+(mkcloud-common.sh:114) complain(): local ex=55
+(mkcloud-common.sh:114) complain(): shift
   neutron-l3-ha-service: migration-threshold=3 fail-count=3 last-failure='Mon Mar 25 21:27:23 2019'
+(mkcloud-common.sh:115) complain(): printf 'Error (55): %s\n' 'Cluster resources'\'' failures detected'
Error (55): Cluster resources' failures detected
+(mkcloud-common.sh:116) complain(): [[ 55 = - ]]
+(mkcloud-common.sh:116) complain(): exit 55

I guess that's probably related to this change somehow.

aspiers · 2019-03-31T18:39:26Z

@aspiers commented on March 31, 2019 1:26 PM:

But the main problem seems to be that the CI is currently failing:

[snipped]

I'm going to see if logreduce can help with this...

dirkmueller requested a review from aspiers as a code owner March 18, 2019 18:05

hound bot reviewed Mar 18, 2019

View reviewed changes

aspiers requested changes Mar 22, 2019

View reviewed changes

neutron: set a failure-timeout on neutron-ha-tool

25eab46

We don't want the neutron-ha-tool service to be stopped after 3 weeks of weekly patching and rebooting the rabbitmq cluster. Set a timeout of a failure if it happened more than 10 minutes ago.

dirkmueller force-pushed the rabbit_l3_ha_failcount branch from f194469 to 25eab46 Compare March 25, 2019 19:31

dirkmueller added needs backport to SOC7 needs backport to SOC8 labels Mar 25, 2019

JanZerebecki added the wip label Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neutron: set a failure-timeout on neutron-ha-tool #2063

neutron: set a failure-timeout on neutron-ha-tool #2063

dirkmueller commented Mar 18, 2019 •

edited

Loading

hound bot Mar 18, 2019

aspiers left a comment

dirkmueller commented Mar 25, 2019

aspiers commented Mar 31, 2019

aspiers commented Mar 31, 2019

neutron: set a failure-timeout on neutron-ha-tool #2063

Are you sure you want to change the base?

neutron: set a failure-timeout on neutron-ha-tool #2063

Conversation

dirkmueller commented Mar 18, 2019 • edited Loading

hound bot Mar 18, 2019

Choose a reason for hiding this comment

aspiers left a comment

Choose a reason for hiding this comment

dirkmueller commented Mar 25, 2019

aspiers commented Mar 31, 2019

aspiers commented Mar 31, 2019

dirkmueller commented Mar 18, 2019 •

edited

Loading