Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Multiple ni_process_reap: issue in wickedd logs #741

Open
MikhailKasimov opened this issue Mar 11, 2018 · 7 comments
Open

[Question] Multiple ni_process_reap: issue in wickedd logs #741

MikhailKasimov opened this issue Mar 11, 2018 · 7 comments

Comments

@MikhailKasimov
Copy link
Contributor

Hello!

Have Leap 42.3 installed on VirtualBox.

Wicked version: 0.6.40.

Multiple ni_process_reap: issue in logs, such like:

-- Reboot --
мар 11 18:02:25 linux-mk500 systemd[1]: Starting wicked network management service daemon...
мар 11 18:02:25 linux-mk500 systemd[1]: Started wicked network management service daemon.
мар 11 18:02:28 linux-mk500 wickedd[870]: ni_process_reap: process 1172 has not exited yet; now doing a blocking waitpid()
мар 11 18:13:27 linux-mk500 systemd[1]: Stopping wicked network management service daemon...
мар 11 18:13:27 linux-mk500 systemd[1]: Stopped wicked network management service daemon.
-- Reboot --
мар 11 18:13:46 linux-mk500 systemd[1]: Starting wicked network management service daemon...
мар 11 18:13:46 linux-mk500 systemd[1]: Started wicked network management service daemon.
мар 11 18:13:49 linux-mk500 wickedd[876]: ni_process_reap: process 1176 has not exited yet; now doing a blocking waitpid()
мар 11 19:15:47 linux-mk500 systemd[1]: Stopping wicked network management service daemon...
мар 11 19:15:47 linux-mk500 systemd[1]: Stopped wicked network management service daemon.
-- Reboot --
мар 11 19:17:01 linux-mk500 systemd[1]: Starting wicked network management service daemon...
мар 11 19:17:01 linux-mk500 systemd[1]: Started wicked network management service daemon.
мар 11 19:17:03 linux-mk500 wickedd[892]: ni_process_reap: process 1188 has not exited yet; now doing a blocking waitpid()

journalctl -e _PID=1188 command gives no result. So, I can't catch what the problem process (1176, 1188, etc) is.

Can you, please, give a hint what ni_process_reap: is and how to solve it, if it is problem indeed?

Thanks!
wickedd_0_6_40_ni_process_reap.txt

@archiecobbs
Copy link

Still happening 3.5 years later... lots of errors... very verbose...

openSUSE Leap 15.3
wicked-0.6.65-2.1.x86_64

$ tail -n 12 /var/log/warn
Sep 14 14:25:49 demo wickedd[985]: ni_process_reap: process 7495 has not exited yet; now doing a blocking waitpid()
Sep 14 14:28:25 demo wickedd[985]: ni_process_reap: process 5925 has not exited yet; now doing a blocking waitpid()
Sep 14 14:45:08 demo wickedd[985]: ni_process_reap: process 10878 has not exited yet; now doing a blocking waitpid()
Sep 14 14:55:49 demo wickedd[985]: ni_process_reap: process 8612 has not exited yet; now doing a blocking waitpid()
Sep 14 15:09:21 demo wickedd[985]: ni_process_reap: process 8084 has not exited yet; now doing a blocking waitpid()
Sep 14 15:19:17 demo wickedd[985]: ni_process_reap: process 29526 has not exited yet; now doing a blocking waitpid()
Sep 14 15:32:16 demo wickedd[985]: ni_process_reap: process 22557 has not exited yet; now doing a blocking waitpid()
Sep 14 15:34:41 demo wickedd[985]: ni_process_reap: process 19040 has not exited yet; now doing a blocking waitpid()
Sep 14 15:47:04 demo wickedd[985]: ni_process_reap: process 4441 has not exited yet; now doing a blocking waitpid()
Sep 14 15:57:26 demo wickedd[985]: ni_process_reap: process 30766 has not exited yet; now doing a blocking waitpid()

What information do you need from me to debug this??

mtomaschewski added a commit to mtomaschewski/wicked that referenced this issue Sep 15, 2021
@mtomaschewski
Copy link
Member

It's about the handle_hangup callback called on poll(2) hangup event:

       POLLHUP
              Hang up (only returned in revents; ignored in events).  Note that when reading from
              a channel such as a pipe or a stream socket, this event merely indicates  that  the
              peer  closed its end of the channel.  Subsequent reads from the channel will return
              0 (end of file) only after all outstanding data in the channel has been consumed.

In src/process.c, there is a:

        rv = waitpid(pi->pid, &pi->status, WNOHANG);
        if (rv == 0) {
                /* This is an ugly workaround. Sometimes, we seem to get a hangup on the socket even
                 * though the script (provably) still has its end of the socket pair open for writing. */
                ni_error("%s: process %u has not exited yet; now doing a blocking waitpid()",
                                __func__, pi->pid);
                rv = waitpid(pi->pid, &pi->status, 0);
        }

I've pushed some test code to https://github.com/mtomaschewski/wicked/tree/pipe-test branch which may be helpful to narrow down what is causing it -- it shows the command that were executed and there is a skeleton in testing using a very simple ni_process exec.

# journalctl | grep ni_process_reap
# journalctl | wc -l
114384

I didn't found it in the logs on my machines. As you're getting them, could you please provide the logs with the command?

@archiecobbs
Copy link

Here's what I see (so far) after applying the patch and rebooting:

Sep 15 13:07:38 demo wickedd[5818]: ni_process_reap: process 5822 (/etc/wicked/extensions/firewall up) has not exited yet; now doing a blocking waitpid()
Sep 15 13:07:38 demo wickedd[5818]: ni_process_reap: process 5822 (/etc/wicked/extensions/firewall up) reaped in blocking waitpid after 0m0.000s
Sep 15 13:07:47 demo wickedd[5818]: ni_process_reap: process 7890 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:07:47 demo wickedd[5818]: ni_process_reap: process 7890 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s
Sep 15 13:30:15 demo wickedd[973]: ni_process_reap: process 4533 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:30:15 demo wickedd[973]: ni_process_reap: process 4533 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s
Sep 15 13:32:36 demo wickedd[973]: ni_process_reap: process 30420 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:32:36 demo wickedd[973]: ni_process_reap: process 30420 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s

Looking at the code and noting that the wait times above are all less that one millisecond, it appears that the bug in the code is simply a mistaken assumption: Just because a socket that is connected to process X has been closed, that doesn't mean that process X has completed exiting.

So it's possible that the first invocation of wait() will be too soon; if so, instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.

@mtomaschewski
Copy link
Member

mtomaschewski commented Sep 16, 2021

mistaken assumption

Yes. A process can close it's fd's and still do something (e.g. cleanup) before it exits. Further, it can also be, that the process already exited, just the kernel didn't cleaned up the process resources yet and is not ready to report the exit status yet. We use the ni_process functions for many things, to send or to receive some data (or nothing), so it's a quite important place with several use-cases. And "almost always" we need the data + error code at the end.

(When you look into e.g. extensions/firewall and schema/firewall.xml, you'll see that we send data to the scripts, in firewall case it's the <firewall><zone>foobar</zone></firewall> config node that is passed as <arguments><config><zone>foobar</zone></config></arguments>)

instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.

Yes, I'd tend keep it as is (no fixed "deadline" or e.g. 100ms what is already "very long time"), but to change the error msg to debug, call blocking waitpid as we did until now and then either log debug message again (so we see in debug how long it took) or when it took 1sec or more, log this as error or warning?

mtomaschewski added a commit to mtomaschewski/wicked that referenced this issue Sep 16, 2021
Changed error log message to debug with an additional warning
when it took 1sec or more.
@mtomaschewski
Copy link
Member

Better would be to carefully change the code to use a signal handler (use signalfd, which allows to poll them too) and finally call the notify_callback, but this would be much more intrusive than the above and needs extensive tests.

@archiecobbs
Copy link

instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.

Yes, I'd tend keep it as is (no fixed "deadline" or e.g. 100ms what is already "very long time"), but to change the error msg to debug, call blocking waitpid as we did until now and then either log debug message again (so we see in debug how long it took) or when it took 1sec or more, log this as error or warning?

Yes I think we're saying more or less the same thing. Your approach is slightly more airtight because it guarantees the process has fully exited. But either way, we would define some "error threshold" of 1 second or whatever that would be considered abnormal and trigger a warning, whereas below this value no warning wold be generated.

Thanks.

@mtomaschewski
Copy link
Member

Yes I think we're saying more or less the same thing.

Yes, I think so too.

But either way, we would define some "error threshold" of 1 second or whatever that would be considered abnormal and trigger a warning, whereas below this value no warning wold be generated.

Yes, when you prefer we can keep it an error. When we run into cases where it needs >=1s to get the exit status, there is something fishy and it's needed to take a much closer look at all this (rewrite to use signalfd IMO).

Thanks!

mtomaschewski added a commit to mtomaschewski/wicked that referenced this issue Sep 29, 2021
Changed error log message to debug with an additional warning
when it took 1sec or more.
cfconrad added a commit that referenced this issue Sep 29, 2021
process: log command of reaped child (gh##741)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants