Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in libzfs_run_process_impl #16801

Merged
merged 1 commit into from
Dec 4, 2024
Merged

Conversation

shodanshok
Copy link
Contributor

@shodanshok shodanshok commented Nov 22, 2024

When replacing a disk, a child process is forked to run a script called zfs_prepare_disk (which can be useful for disk firmware update or health check). The parent than calls waitpid and checks the child error/status code.

However, the ZED _reap_children thread (created from zed_exec_process to manage zedlets) also waits for all children with the same PGID and can stole the signal, causing the replace operation to be aborted.

As waitpid returns -1, the parent incorrectly assume that the child process had an error or was killed. This, in turn, leaves the newly added disk in REMOVED or UNAVAIL status rather than completing the replace process.

This patch changes the PGID of the child process execuing the prepare script, shielding it from the _reap_children thread.

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 23, 2024
@tonyhutter
Copy link
Contributor

I see the waitpid() man page example code (https://linux.die.net/man/2/waitpid) is a little different from the way we do things in libzfs_run_process_impl(). If we just adapt that code, does it fix the issue you're seeing?:

diff --git a/lib/libzfs/libzfs_util.c b/lib/libzfs/libzfs_util.c
index 1f7e7b0e6..951feb1a0 100644
--- a/lib/libzfs/libzfs_util.c
+++ b/lib/libzfs/libzfs_util.c
@@ -963,12 +963,14 @@ libzfs_run_process_impl(const char *path, char *argv[], char *env[], int flags,
        } else if (pid > 0) {
                /* Parent process */
                int status;
-
-               while ((error = waitpid(pid, &status, 0)) == -1 &&
-                   errno == EINTR)
-                       ;
-               if (error < 0 || !WIFEXITED(status))
-                       return (-1);
+               do {
+                       error = waitpid(pid, &status, WUNTRACED | WCONTINUED);
+                       if (error == -1)
+                               return (-1);
+                       if (WIFEXITED(status) || WIFSIGNALED(status) ||
+                           WIFSTOPPED(status) || WIFCONTINUED(status))
+                               return (-1);
+               } while (!WIFEXITED(status) && !WIFSIGNALED(status));
 
                if (lines != NULL) {
                        close(link[1]);

@shodanshok
Copy link
Contributor Author

@tonyhutter I don't think it would improve the issue at hand.

error = waitpid(pid, &status, WUNTRACED | WCONTINUED);
if (error == -1)
        return (-1);

This code would return error if the child exited before the parent had a chance to check it - the same as current code. While this kind of check is correct for many cases (ie: when a child exiting so fast is not expected), for this specific operation (replacing a disk with an empty prepare script) it is not.

This is how I understand it, at least.
Thanks.

@amotin
Copy link
Member

amotin commented Nov 30, 2024

This code would return error if the child exited before the parent had a chance to check it

@shodanshok I think you misunderstand how it works. There should be no race. Please see the "Notes" section of the man page. Besides I am not sure it is correct to check status if waitpid() returned error. Checking FreeBSD kernel it seems the status is not set in case of syscall error.

@shodanshok
Copy link
Contributor Author

@amotin I see what do you mean, and I think you are right. Upon further inspection, I suspect the issue is related to the double-wait done via zed event handler and libzfs_util.

If I am not mistaken, when replacing a disk via zed the following happens:

  • a disk add event is handled byzfs_process_add, which waits inside libzfs_run_process_impl for the prepare script
  • libzfs_run_process_impl forks and exec the prepare script
  • concurrently, a syslog event handler is started by_zed_exec_fork_child, then waits inside_reap_children

Something seems to go wrong between these two forks/waits. I added some debug printf to libzfs_util.c, inside libzfs_run_process_impl just after waiting for the child process to return:

printf ("DEBUG: error: %d, errno: %d, status: %d, normal: %d, code: %d\n", error, errno, status, WIFEXITED(status), WEXITSTATUS(status));

zed -v -F shows the following output:

Finished "(null)" eid=0 pid=71724 time=0.000979s exit=0
DEBUG: error: -1, errno: 10, status: 0, normal: 1, code: 0

Notice how:

  • zed shows a Finished "(null)" line, meaning some memory was corrupted / zeroed
  • error == -1 even if status == 0, WIFEXITED(status) == 1 and WEXITSTATUS(status) == 0

@shodanshok shodanshok force-pushed the replace branch 3 times, most recently from 72cea01 to 80f09a7 Compare December 2, 2024 12:22
@shodanshok
Copy link
Contributor Author

Indeed, the wait4 call inside _reap_children seems to stole the signal for the waitpid call inside libzfs_run_process_impl (it is timing dependent). This is because the wait4 call waits for all childrens with the same process group ID.

I have updated the patch with a possible solution. Thanks.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 3, 2024
@behlendorf behlendorf requested a review from amotin December 3, 2024 00:56
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My memories in the area are a bit rusty, but rafter reading some man pages seems to make sense.

When replacing a disk, a child process is forked to run a script called
zfs_prepare_disk (which can be useful for disk firmware update or health
check). The parent than calls waitpid and checks the child error/status
code.

However, the _reap_children thread (created from zed_exec_process to
manage zedlets) also waits for all children with the same PGID and can
stole the signal, causing the replace operation to be aborted.

As waitpid returns -1, the parent incorrectly assume that the child
process had an error or was killed. This, in turn, leaves the newly
added disk in REMOVED or UNAVAIL status rather than completing the
replace process.

This patch changes the PGID of the child process execuing the
prepare script, shielding it from the _reap_children thread.

Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
@shodanshok
Copy link
Contributor Author

shodanshok commented Dec 3, 2024

Rebased.

EDIT: I missed that rebasing would remove the accepted label, sorry.

@github-actions github-actions bot removed the Status: Accepted Ready to integrate (reviewed, tested) label Dec 3, 2024
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Dec 4, 2024
@amotin amotin merged commit 1cd2419 into openzfs:master Dec 4, 2024
24 checks passed
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Dec 5, 2024
When replacing a disk, a child process is forked to run a script called
zfs_prepare_disk (which can be useful for disk firmware update or health
check). The parent than calls waitpid and checks the child error/status
code.

However, the _reap_children thread (created from zed_exec_process to
manage zedlets) also waits for all children with the same PGID and can
stole the signal, causing the replace operation to be aborted.

As waitpid returns -1, the parent incorrectly assume that the child
process had an error or was killed. This, in turn, leaves the newly
added disk in REMOVED or UNAVAIL status rather than completing the
replace process.

This patch changes the PGID of the child process execuing the
prepare script, shielding it from the _reap_children thread.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes openzfs#16801
arter97 pushed a commit to arter97/zfs that referenced this pull request Dec 9, 2024
When replacing a disk, a child process is forked to run a script called
zfs_prepare_disk (which can be useful for disk firmware update or health
check). The parent than calls waitpid and checks the child error/status
code.

However, the _reap_children thread (created from zed_exec_process to
manage zedlets) also waits for all children with the same PGID and can
stole the signal, causing the replace operation to be aborted.

As waitpid returns -1, the parent incorrectly assume that the child
process had an error or was killed. This, in turn, leaves the newly
added disk in REMOVED or UNAVAIL status rather than completing the
replace process.

This patch changes the PGID of the child process execuing the
prepare script, shielding it from the _reap_children thread.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes openzfs#16801
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants