Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nagios 4.3.1 crashes when using mod_gearman #110

Closed
dan-m-joh opened this issue Mar 3, 2017 · 18 comments
Closed

Nagios 4.3.1 crashes when using mod_gearman #110

dan-m-joh opened this issue Mar 3, 2017 · 18 comments

Comments

@dan-m-joh
Copy link

dan-m-joh commented Mar 3, 2017

I have upgraded Nagios from 4.2.4 to 4.3.1 (luckily only on my development box) and now it crashes with a SIGSEGV / SIGTERM repeatedly (about once a minute).
For me it looks like a problem when a broker_module sends data "back" to nagios.

I base this on the following facts.

  1. If I disable mod_gearman in nagios.cfg, everything works OK.
  2. If I enable mod_gearman in nagios.cfg, but do not use it for host-/service-checks, everything works OK.
  3. If I enable mod_gearman and use it for host-/service-checks it starts crashing.

Sadly, the only thing I can see in the nagios-log are:
Caught SIGSEGV, shutting down...
Caught SIGTERM, shutting down...

In the debug-log I do not see anything strange.
Here are my SW releases:
OS: RHEL 7.3
Nagios 4.3.1 (build from source)
mod_gearman 3.0.1-1 (labs.consol.de)
gearmand 0.33-5 (labs.consol.de)

Running nagios under gdb I see the following when it crashes:

Program received signal SIGSEGV, Segmentation fault.
clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851
2851 my_free(this_customvariablesmember->variable_name);
Missing separate debuginfos, use: debuginfo-install boost-system-1.53.0-26.el7.x86_64 gearmand-0.33-5.x86_64 glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 sssd-client-1.14.0-43.el7_3.11.x86_64
(gdb) bt
#0 clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851
#1 0x00005555555916bc in clear_contact_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:3001
#2 0x00005555555918b7 in clear_volatile_macros_r (mac=mac@entry=0x7ffffffed2e0) at ../common/macros.c:2870
#3 0x00007ffff64aaa9e in handle_svc_check (event_type=, data=0x7fffffffda30) at neb_module_nagios4/../neb_module/mod_gearman.c:851
#4 0x000055555556bb2f in neb_make_callbacks (callback_type=callback_type@entry=6, data=data@entry=0x7fffffffda30) at nebmods.c:529
#5 0x0000555555569f10 in broker_service_check (type=type@entry=704, flags=flags@entry=0, attr=attr@entry=0, svc=svc@entry=0x555555e97310, check_type=check_type@entry=0,
start_time=..., end_time=..., cmd=, latency=0, exectime=exectime@entry=0, timeout=timeout@entry=0, early_timeout=early_timeout@entry=0,
retcode=retcode@entry=0, cmdline=cmdline@entry=0x0, timestamp=timestamp@entry=0x0, cr=cr@entry=0x0) at broker.c:326
#6 0x000055555557172f in run_async_service_check (svc=svc@entry=0x555555e97310, check_options=check_options@entry=0, latency=latency@entry=0.0008800000068731606,
scheduled_check=scheduled_check@entry=1, reschedule_check=reschedule_check@entry=1, time_is_valid=time_is_valid@entry=0x7fffffffe29c,
preferred_time=preferred_time@entry=0x7fffffffe2a8) at checks.c:199
#7 0x0000555555571cb1 in run_scheduled_service_check (svc=svc@entry=0x555555e97310, check_options=0, latency=latency@entry=0.0008800000068731606) at checks.c:90
#8 0x0000555555587adb in handle_timed_event (event=event@entry=0x555555e8fc20) at events.c:1171
#9 0x0000555555588623 in event_execution_loop () at events.c:1110
#10 0x0000555555568a56 in main (argc=, argv=) at nagios.c:814

I hope you see something there to help you find the issue.
If you need more debugging info, I would be glad to help.

Regards,
D/\N

@sni
Copy link
Owner

sni commented Mar 3, 2017

@hedenface do you want to have a look?

@dan-m-joh
Copy link
Author

dan-m-joh commented Mar 3, 2017

I have also done a diff between the nagios-headers that you use for nagios4 and the "real" once for nagios-4.3.1.
Here is the result:

diff -r nagios4/macros.h nagios-4.3.1/include/macros.h
41c41
< #define MACRO_X_COUNT                         156     /* size of macro_x[] array */
---
> #define MACRO_X_COUNT                         157     /* size of macro_x[] array */
219a220
> #define MACRO_HOSTGROUPMEMBERADDRESSES          156
diff -r nagios4/nagios.h nagios-4.3.1/include/nagios.h
533c534
< void clear_service_flap(service *, double, double, double);   /* handles a service that has stopped flapping */
---
> void clear_service_flap(service *, double, double, double, int);      /* handles a service that has stopped flapping */
535c536
< void clear_host_flap(host *, double, double, double);         /* handles a host that has stopped flapping */
---
> void clear_host_flap(host *, double, double, double, int);            /* handles a host that has stopped flapping */
diff -r nagios4/nebstructs.h nagios-4.3.1/include/nebstructs.h
521a521
>       char            *longoutput;
diff -r nagios4/objects.h nagios-4.3.1/include/objects.h
34c34
< #define CURRENT_OBJECT_STRUCTURE_VERSION        402     /* increment when changes are made to data structures... */
---
> #define CURRENT_OBJECT_STRUCTURE_VERSION        403     /* increment when changes are made to data structures... */
diff -r nagios4/lib/libnagios.h nagios-4.3.1/lib/libnagios.h
24a25
> #include "nwrite.h"
diff -r nagios4/lib/runcmd.h nagios-4.3.1/lib/runcmd.h
105a106,113
>
> /**
>  * If you're using libnagios to execute a remote command, the
>  * static pid_t pids is not freed after runcmd_open
>  * You can call this function when you're sure pids is no longer
>  * in use, to keep down memory leaks
>  */
> extern void runcmd_free_pids(void);

D/\N

@hedenface
Copy link
Contributor

I'll take a look today. @dan-m-joh Can I see your contact definitions, please?

@dan-m-joh
Copy link
Author

Of cause you can... (email redacted)

###############################################################################
###############################################################################
#
# CONTACTS
#
###############################################################################
###############################################################################

define contact{
        name                            generic-contact
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r,f,s
        host_notification_options       d,u,r,f,s
        service_notification_commands   notify-service-by-email 
        host_notification_commands      notify-host-by-email    
        register                        0
        }

define contact{
        contact_name                    nagiosadmin
        use                             generic-contact
        alias                           Nagios Admin
        email                           my.email@comp.org
        }

###############################################################################
###############################################################################
#
# CONTACT GROUPS
#
###############################################################################
###############################################################################

define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 nagiosadmin
        }

@hedenface
Copy link
Contributor

This looks like it may be a Core bug. I was able to replicate with pre-built and compiled from source ModGearman modules. Keep this issue open if you want, and I'll post the relevant fix when/if discovered.

@dan-m-joh
Copy link
Author

Great to hear, not that we have a bug, but that you could replicate it. Now I at least know that it is not just in my environment.
OK, I'll keep this open and wait for feedback.

D/\N

@BoomerET
Copy link

We've had this issue for months. We're testing moving to Naemon, but sure wish this would work with Nagios 4 Core.

@hedenface
Copy link
Contributor

@dan-m-joh Did you by chance happen to compile mod-gearman with the proper Nagios header? I'll get it set up on Wednesday and try and get this thing fixed.

@dan-m-joh
Copy link
Author

No, sorry I have had no chance to test with the "new" nagios headers.
Is it just as simple as to copy the "new" nagios headers to the nagios4 header directory?

@sni sni closed this as completed Jul 14, 2017
@dan-m-joh
Copy link
Author

dan-m-joh commented Jul 24, 2017

F.Y.I. Compiling mod_gearman with the Nagios-4.3.2 headers (replacing all (except epn_utils.h) headers in include/ and include/lib/ with the ones from the Nagios sources) seems to fix the issue for me. I will let it run on my test rig for a few days, than I will update my production rig.

D/\N

@rcgreenw
Copy link

Was this ever fixed? I know it is closed, but there was no comment on the closing. I'm getting the same behavior with the following:

CentOS 6.9
Nagios 4.3.4 (EPEL RPMs)
mod_gearman 3.0.6.20170929 (ConSol Labs RPMs)
gearmand 0.33-6 (ConSol Labs RPMs)

It happened with mod_gearman 3.0.6 from the sable repo too, I moved to the testing repo to see if it was fixed. Everything works fine until I enable active checks, then it dies with SIGSEGV.

@hedenface
Copy link
Contributor

The problem is the headers that are used for compiling the binaries in the package you mention I believe @rcgreenw . What happens if you compile using the Nagios 4.3.4 headers? I suspect the issue will go away.

@rcgreenw
Copy link

rcgreenw commented Dec 8, 2017

I haven't had a chance to try that yet, the machine really isn't set up for development. I was hoping for updated packages so I wouldn't have to build my own. I'll see if I can get everything needed to build it installed. Thanks.

@smallsam
Copy link

We have a similar setup to rcgreenw, in terms of RPM package sources. What's the recommended solution here given we want to upgrade easily with RPMs?
Can mod_gearman be enhanced to deal with nagios 4.3.x automatically? It sounds like one of the best options in order to maintain automatic RPM patching is to move to naemon, unless mod_gearman can be patched.

@rcgreenw
Copy link

I was able to get an RPM built with minor modifications. I pulled from git, then removed the include/nagios4 directory and replaced it with a symlink to /usr/include/nagios (from the nagios-devel rpm). Then, I did an rpmbuild using the spec file in the support directory. There is a copy of the rpm here, but don't count on updates in the future.

http://mirror.tausd.org/tausd/RHEL/6/tausd/x86_64/mod_gearman-3.0.5-9.1.el6.x86_64.rpm

@sni
Copy link
Owner

sni commented Jan 27, 2018

How about changing the configure script to detect /usr/include/nagios and only use the shiped nagios4 folder as fallback. And i am open to pull requests to update the nagios4 folder as well.

@smallsam
Copy link

smallsam commented Feb 2, 2018

It sounds like mod_gearman no longer supports nagios core now the nagios core has changed its interface. I see a few options:

  1. Build a different module for naemon and nagios 4.x as statusengine have done with their module: https://github.com/statusengine/module/tree/master/src. The binary releases for mod_gearman could then package and distribute differently named binaries for naemon, nagios etc..
  2. Drop support for nagios core.
  3. @sni's suggestion, user can compile mod_gearman against headers of their choice.

I'd prefer 1, because I tend to avoid compiling software encouraging sysadmins to use supported binary repositories when at all possible (e.g. consol labs' yum repo).

A cursory look at the folders in the repo suggests you already have some structure to support different neb module versions, perhaps this is an extensive of these to support the new nagios interface?

@sni
Copy link
Owner

sni commented Feb 3, 2018

  1. thats the case already. We already build 3 neb modules for Nagios 3, Nagios 4 and Naemon.
    Nagios 3 does not change anymore, thats easy, so we just ship the headers and build against them. Naemon is easy as well, there is a naemon-devel package containing the headers and it just works. Nagios 4 is difficult and error prone due to the lack of a nagios4-devel package available for all supported systems. So we need to ship headers again but this breaks as soon as the abi changes.
    So right now, the only way for Nagios 4 is to compile the plugin yourself with the headers from your setup. Well, or switch to Naemon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants