[Question] Multiple ni_process_reap: issue in wickedd logs #741

MikhailKasimov · 2018-03-11T17:33:40Z

Hello!

Have Leap 42.3 installed on VirtualBox.

Wicked version: 0.6.40.

Multiple ni_process_reap: issue in logs, such like:

-- Reboot --
мар 11 18:02:25 linux-mk500 systemd[1]: Starting wicked network management service daemon...
мар 11 18:02:25 linux-mk500 systemd[1]: Started wicked network management service daemon.
мар 11 18:02:28 linux-mk500 wickedd[870]: ni_process_reap: process 1172 has not exited yet; now doing a blocking waitpid()
мар 11 18:13:27 linux-mk500 systemd[1]: Stopping wicked network management service daemon...
мар 11 18:13:27 linux-mk500 systemd[1]: Stopped wicked network management service daemon.
-- Reboot --
мар 11 18:13:46 linux-mk500 systemd[1]: Starting wicked network management service daemon...
мар 11 18:13:46 linux-mk500 systemd[1]: Started wicked network management service daemon.
мар 11 18:13:49 linux-mk500 wickedd[876]: ni_process_reap: process 1176 has not exited yet; now doing a blocking waitpid()
мар 11 19:15:47 linux-mk500 systemd[1]: Stopping wicked network management service daemon...
мар 11 19:15:47 linux-mk500 systemd[1]: Stopped wicked network management service daemon.
-- Reboot --
мар 11 19:17:01 linux-mk500 systemd[1]: Starting wicked network management service daemon...
мар 11 19:17:01 linux-mk500 systemd[1]: Started wicked network management service daemon.
мар 11 19:17:03 linux-mk500 wickedd[892]: ni_process_reap: process 1188 has not exited yet; now doing a blocking waitpid()

journalctl -e _PID=1188 command gives no result. So, I can't catch what the problem process (1176, 1188, etc) is.

Can you, please, give a hint what ni_process_reap: is and how to solve it, if it is problem indeed?

Thanks!
wickedd_0_6_40_ni_process_reap.txt

The text was updated successfully, but these errors were encountered:

archiecobbs · 2021-09-14T21:38:27Z

Still happening 3.5 years later... lots of errors... very verbose...

openSUSE Leap 15.3
wicked-0.6.65-2.1.x86_64

$ tail -n 12 /var/log/warn
Sep 14 14:25:49 demo wickedd[985]: ni_process_reap: process 7495 has not exited yet; now doing a blocking waitpid()
Sep 14 14:28:25 demo wickedd[985]: ni_process_reap: process 5925 has not exited yet; now doing a blocking waitpid()
Sep 14 14:45:08 demo wickedd[985]: ni_process_reap: process 10878 has not exited yet; now doing a blocking waitpid()
Sep 14 14:55:49 demo wickedd[985]: ni_process_reap: process 8612 has not exited yet; now doing a blocking waitpid()
Sep 14 15:09:21 demo wickedd[985]: ni_process_reap: process 8084 has not exited yet; now doing a blocking waitpid()
Sep 14 15:19:17 demo wickedd[985]: ni_process_reap: process 29526 has not exited yet; now doing a blocking waitpid()
Sep 14 15:32:16 demo wickedd[985]: ni_process_reap: process 22557 has not exited yet; now doing a blocking waitpid()
Sep 14 15:34:41 demo wickedd[985]: ni_process_reap: process 19040 has not exited yet; now doing a blocking waitpid()
Sep 14 15:47:04 demo wickedd[985]: ni_process_reap: process 4441 has not exited yet; now doing a blocking waitpid()
Sep 14 15:57:26 demo wickedd[985]: ni_process_reap: process 30766 has not exited yet; now doing a blocking waitpid()

What information do you need from me to debug this??

mtomaschewski · 2021-09-15T14:53:17Z

It's about the handle_hangup callback called on poll(2) hangup event:

       POLLHUP
              Hang up (only returned in revents; ignored in events).  Note that when reading from
              a channel such as a pipe or a stream socket, this event merely indicates  that  the
              peer  closed its end of the channel.  Subsequent reads from the channel will return
              0 (end of file) only after all outstanding data in the channel has been consumed.

In src/process.c, there is a:

        rv = waitpid(pi->pid, &pi->status, WNOHANG);
        if (rv == 0) {
                /* This is an ugly workaround. Sometimes, we seem to get a hangup on the socket even
                 * though the script (provably) still has its end of the socket pair open for writing. */
                ni_error("%s: process %u has not exited yet; now doing a blocking waitpid()",
                                __func__, pi->pid);
                rv = waitpid(pi->pid, &pi->status, 0);
        }

I've pushed some test code to https://github.com/mtomaschewski/wicked/tree/pipe-test branch which may be helpful to narrow down what is causing it -- it shows the command that were executed and there is a skeleton in testing using a very simple ni_process exec.

# journalctl | grep ni_process_reap
# journalctl | wc -l
114384

I didn't found it in the logs on my machines. As you're getting them, could you please provide the logs with the command?

archiecobbs · 2021-09-15T18:51:25Z

Here's what I see (so far) after applying the patch and rebooting:

Sep 15 13:07:38 demo wickedd[5818]: ni_process_reap: process 5822 (/etc/wicked/extensions/firewall up) has not exited yet; now doing a blocking waitpid()
Sep 15 13:07:38 demo wickedd[5818]: ni_process_reap: process 5822 (/etc/wicked/extensions/firewall up) reaped in blocking waitpid after 0m0.000s
Sep 15 13:07:47 demo wickedd[5818]: ni_process_reap: process 7890 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:07:47 demo wickedd[5818]: ni_process_reap: process 7890 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s
Sep 15 13:30:15 demo wickedd[973]: ni_process_reap: process 4533 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:30:15 demo wickedd[973]: ni_process_reap: process 4533 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s
Sep 15 13:32:36 demo wickedd[973]: ni_process_reap: process 30420 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:32:36 demo wickedd[973]: ni_process_reap: process 30420 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s

Looking at the code and noting that the wait times above are all less that one millisecond, it appears that the bug in the code is simply a mistaken assumption: Just because a socket that is connected to process X has been closed, that doesn't mean that process X has completed exiting.

So it's possible that the first invocation of wait() will be too soon; if so, instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.

mtomaschewski · 2021-09-16T01:27:28Z

mistaken assumption

Yes. A process can close it's fd's and still do something (e.g. cleanup) before it exits. Further, it can also be, that the process already exited, just the kernel didn't cleaned up the process resources yet and is not ready to report the exit status yet. We use the ni_process functions for many things, to send or to receive some data (or nothing), so it's a quite important place with several use-cases. And "almost always" we need the data + error code at the end.

(When you look into e.g. extensions/firewall and schema/firewall.xml, you'll see that we send data to the scripts, in firewall case it's the <firewall><zone>foobar</zone></firewall> config node that is passed as <arguments><config><zone>foobar</zone></config></arguments>)

instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.

Yes, I'd tend keep it as is (no fixed "deadline" or e.g. 100ms what is already "very long time"), but to change the error msg to debug, call blocking waitpid as we did until now and then either log debug message again (so we see in debug how long it took) or when it took 1sec or more, log this as error or warning?

Changed error log message to debug with an additional warning when it took 1sec or more.

mtomaschewski · 2021-09-16T01:45:42Z

Better would be to carefully change the code to use a signal handler (use signalfd, which allows to poll them too) and finally call the notify_callback, but this would be much more intrusive than the above and needs extensive tests.

archiecobbs · 2021-09-16T13:49:26Z

instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.

Yes, I'd tend keep it as is (no fixed "deadline" or e.g. 100ms what is already "very long time"), but to change the error msg to debug, call blocking waitpid as we did until now and then either log debug message again (so we see in debug how long it took) or when it took 1sec or more, log this as error or warning?

Yes I think we're saying more or less the same thing. Your approach is slightly more airtight because it guarantees the process has fully exited. But either way, we would define some "error threshold" of 1 second or whatever that would be considered abnormal and trigger a warning, whereas below this value no warning wold be generated.

Thanks.

mtomaschewski · 2021-09-17T09:40:19Z

Yes I think we're saying more or less the same thing.

Yes, I think so too.

But either way, we would define some "error threshold" of 1 second or whatever that would be considered abnormal and trigger a warning, whereas below this value no warning wold be generated.

Yes, when you prefer we can keep it an error. When we run into cases where it needs >=1s to get the exit status, there is something fishy and it's needed to take a much closer look at all this (rewrite to use signalfd IMO).

Thanks!

Changed error log message to debug with an additional warning when it took 1sec or more.

process: log command of reaped child (gh##741)

mtomaschewski added a commit to mtomaschewski/wicked that referenced this issue Sep 15, 2021

process: show command of reaped child pid (openSUSE#741)

dbfc015

mtomaschewski added a commit to mtomaschewski/wicked that referenced this issue Sep 16, 2021

process: log command of reaped child (gh#openSUSE#741)

00d38d7

Changed error log message to debug with an additional warning when it took 1sec or more.

mtomaschewski added a commit to mtomaschewski/wicked that referenced this issue Sep 29, 2021

process: log command of reaped child (gh#openSUSE#741)

3de9772

Changed error log message to debug with an additional warning when it took 1sec or more.

cfconrad added a commit that referenced this issue Sep 29, 2021

Merge pull request #883 from mtomaschewski/process-reap-error

da10480

process: log command of reaped child (gh##741)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Multiple ni_process_reap: issue in wickedd logs #741

[Question] Multiple ni_process_reap: issue in wickedd logs #741

MikhailKasimov commented Mar 11, 2018

archiecobbs commented Sep 14, 2021

mtomaschewski commented Sep 15, 2021

archiecobbs commented Sep 15, 2021

mtomaschewski commented Sep 16, 2021 •

edited

Loading

mtomaschewski commented Sep 16, 2021

archiecobbs commented Sep 16, 2021

mtomaschewski commented Sep 17, 2021

[Question] Multiple ni_process_reap: issue in wickedd logs #741

[Question] Multiple ni_process_reap: issue in wickedd logs #741

Comments

MikhailKasimov commented Mar 11, 2018

archiecobbs commented Sep 14, 2021

mtomaschewski commented Sep 15, 2021

archiecobbs commented Sep 15, 2021

mtomaschewski commented Sep 16, 2021 • edited Loading

mtomaschewski commented Sep 16, 2021

archiecobbs commented Sep 16, 2021

mtomaschewski commented Sep 17, 2021

mtomaschewski commented Sep 16, 2021 •

edited

Loading