Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nagios hangs on reload while sending external command to cmd file #319

Closed
xaoc-krsk opened this issue Jan 26, 2017 · 9 comments
Closed

nagios hangs on reload while sending external command to cmd file #319

xaoc-krsk opened this issue Jan 26, 2017 · 9 comments
Labels

Comments

@xaoc-krsk
Copy link

Original ticket: http://tracker.nagios.org/view.php?id=548

When you send a command to cmd file and reload nagios at the same time, you end up in command_input_handler, spinning in this block indefinitely:

                while (sigrestart) {
#ifdef USE_NANOSLEEP
                        ts.tv_sec = 0;
                        ts.tv_nsec = 10000000;
                        nanosleep(&ts, NULL);
#else
                        sleep(1);
#endif
                }

Since this is main nagios process, we will never set sigrestart back to false.

Proposed patch fixes this problem, by immediately returning from command_input_handler while we are reloading. We just don't read any command during reload, next command will be read immediately after reload is finished.

Attached patch is for version 4.2.4.
base-commands.c.txt

@jfrickson jfrickson self-assigned this Jan 30, 2017
@jfrickson jfrickson added the Bug label Jan 30, 2017
jfrickson pushed a commit that referenced this issue Feb 22, 2017
Fix for issue #319

In an earlier fix, I was apparently trying to be overly clever.
This fix, proposed by xaoc-krsk, does the trick.
@jfrickson
Copy link
Contributor

Fixed in branch maint via commit cde8780

@wleese
Copy link

wleese commented Jun 29, 2017

Considering Nagios 4.3.x is unstable with mod_gearman at the moment (sni/mod_gearman#110) .. think we can get a 4.2.x release with this patch?

@hedenface
Copy link
Contributor

@wleese Are you confirming that john's patch in 4.3.3 did not work?

@wleese
Copy link

wleese commented Jun 29, 2017

No, sorry. I'm saying that we cannot run 4.3.3 due to sni/mod_gearman#110 (which is an assumption, it's not something that I have tested personally).
So being stuck on 4.2.4, I'd like a 4.2.5 with this patch included.

@hedenface
Copy link
Contributor

@wleese I could create a branch for you to compile from source, but I cannot release a 4.2.5 publicly as we do not maintain multiple minor revisions simultaneously.

This would be beneficial as I could actually have someone verify that the fix worked :)

Does this sound ok to you?

@wleese
Copy link

wleese commented Jun 29, 2017

I'm afraid I'm working on too many assumptions, but I can reproduce 'my' issue by repeating these commands:

/etc/init.d/nagios reload; echo "[1386672918] PROCESS_SERVICE_CHECK_RESULT;local-xxxx-001.localdomain;check_nrpe_status_on_local-xxxx-001.localdomain;0;I'm OK"

..after about 3 times the nagios startup will hang:

[1498742722] Caught SIGHUP, restarting...
[1498742723] Event broker module 'NERD' deinitialized successfully.
[1498742723] Warning: use_embedded_perl_implicitly is deprecated and will be removed.
[1498742723] Warning: enable_embedded_perl is deprecated and will be removed.
[1498742723] Warning: p1_file is deprecated and will be removed.
[1498742723] Warning: sleep_time is deprecated and will be removed.
[1498742723] Warning: external_command_buffer_slots is deprecated and will be removed. All commands are always processed upon arrival
[1498742723] Warning: command_check_interval is deprecated and will be removed. Commands are always handled on arrival
[1498742723] Nagios 4.2.4 starting... (PID=8288)
[1498742723] Local time is Thu Jun 29 15:25:23 CEST 2017
[1498742723] LOG VERSION: 2.0
[1498742723] qh: Socket '/var/spool/nagios/cmd/nagios.qh' successfully initialized
[1498742723] qh: core query handler registered
[1498742723] nerd: Channel hostchecks registered successfully
[1498742723] nerd: Channel servicechecks registered successfully
[1498742723] nerd: Channel opathchecks registered successfully
[1498742723] nerd: Fully initialized and ready to rock!
[1498742723] wproc: Successfully registered manager as @wproc with query handler

I also end up with 2 nagios processes (which kinda makes sense?):

nagios    7155  0.5  1.0 574960 16104 ?        Ss   15:18   0:00 /usr/sbin/nagios -xd /etc/nagios/nagios.cfg
nagios    7157  0.0  0.0      0     0 ?        Z    15:18   0:00  \_ [nagios] <defunct>
nagios    7158  0.0  0.0      0     0 ?        Z    15:18   0:00  \_ [nagios] <defunct>
nagios    7159  0.0  0.0      0     0 ?        Z    15:18   0:00  \_ [nagios] <defunct>
nagios    7160  0.0  0.0      0     0 ?        Z    15:18   0:00  \_ [nagios] <defunct>
nagios    7163  0.0  0.9  49024 14036 ?        S    15:18   0:00  \_ /usr/sbin/nagios -xd /etc/nagios/nagios.cfg

/var/log/nagios/nagios.debug is empty.
despite:

# DEBUG
debug_level=1
debug_verbosity=1
debug_file=/var/log/nagios/nagios.debug
max_debug_file_size=1000000

nagios-4.2.4

I'm willing to test the patch to see if it helps. Again, assuming this is the same problem. Whatever is happening i've ruled out mod_gearman and livestatus by disabling them.

@wleese
Copy link

wleese commented Jun 30, 2017

Confirmed that this issue affects us.
backtrace on stuck nagios:

(gdb) bt
#0  0x00007fa4d6bb866d in nanosleep () from /lib64/libc.so.6
#1  0x00007fa4d761346c in command_input_handler ()
#2  0x00007fa4d765b6d3 in iobroker_poll ()
#3  0x00007fa4d75f777d in main ()

@wleese
Copy link

wleese commented Jul 3, 2017

Tested cde8780 .. works like a charm

@hedenface
Copy link
Contributor

Glad to hear it! Also, I'll be digging into the mod gearman issue on Wednesday, so hopefully we won't have anyone hung up on not upgrading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants