Fix race in libzfs_run_process_impl #16801

shodanshok · 2024-11-22T18:41:28Z

When replacing a disk, a child process is forked to run a script called zfs_prepare_disk (which can be useful for disk firmware update or health check). The parent than calls waitpid and checks the child error/status code.

However, the ZED _reap_children thread (created from zed_exec_process to manage zedlets) also waits for all children with the same PGID and can stole the signal, causing the replace operation to be aborted.

As waitpid returns -1, the parent incorrectly assume that the child process had an error or was killed. This, in turn, leaves the newly added disk in REMOVED or UNAVAIL status rather than completing the replace process.

This patch changes the PGID of the child process execuing the prepare script, shielding it from the _reap_children thread.

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

tonyhutter · 2024-11-26T19:13:01Z

I see the waitpid() man page example code (https://linux.die.net/man/2/waitpid) is a little different from the way we do things in libzfs_run_process_impl(). If we just adapt that code, does it fix the issue you're seeing?:

diff --git a/lib/libzfs/libzfs_util.c b/lib/libzfs/libzfs_util.c
index 1f7e7b0e6..951feb1a0 100644
--- a/lib/libzfs/libzfs_util.c
+++ b/lib/libzfs/libzfs_util.c
@@ -963,12 +963,14 @@ libzfs_run_process_impl(const char *path, char *argv[], char *env[], int flags,
        } else if (pid > 0) {
                /* Parent process */
                int status;
-
-               while ((error = waitpid(pid, &status, 0)) == -1 &&
-                   errno == EINTR)
-                       ;
-               if (error < 0 || !WIFEXITED(status))
-                       return (-1);
+               do {
+                       error = waitpid(pid, &status, WUNTRACED | WCONTINUED);
+                       if (error == -1)
+                               return (-1);
+                       if (WIFEXITED(status) || WIFSIGNALED(status) ||
+                           WIFSTOPPED(status) || WIFCONTINUED(status))
+                               return (-1);
+               } while (!WIFEXITED(status) && !WIFSIGNALED(status));
 
                if (lines != NULL) {
                        close(link[1]);

shodanshok · 2024-11-27T07:34:51Z

@tonyhutter I don't think it would improve the issue at hand.

error = waitpid(pid, &status, WUNTRACED | WCONTINUED);
if (error == -1)
        return (-1);

This code would return error if the child exited before the parent had a chance to check it - the same as current code. While this kind of check is correct for many cases (ie: when a child exiting so fast is not expected), for this specific operation (replacing a disk with an empty prepare script) it is not.

This is how I understand it, at least.
Thanks.

amotin · 2024-11-30T15:44:31Z

This code would return error if the child exited before the parent had a chance to check it

@shodanshok I think you misunderstand how it works. There should be no race. Please see the "Notes" section of the man page. Besides I am not sure it is correct to check status if waitpid() returned error. Checking FreeBSD kernel it seems the status is not set in case of syscall error.

shodanshok · 2024-12-01T15:59:15Z

@amotin I see what do you mean, and I think you are right. Upon further inspection, I suspect the issue is related to the double-wait done via zed event handler and libzfs_util.

If I am not mistaken, when replacing a disk via zed the following happens:

a disk add event is handled byzfs_process_add, which waits inside libzfs_run_process_impl for the prepare script
libzfs_run_process_impl forks and exec the prepare script
concurrently, a syslog event handler is started by_zed_exec_fork_child, then waits inside_reap_children

Something seems to go wrong between these two forks/waits. I added some debug printf to libzfs_util.c, inside libzfs_run_process_impl just after waiting for the child process to return:

printf ("DEBUG: error: %d, errno: %d, status: %d, normal: %d, code: %d\n", error, errno, status, WIFEXITED(status), WEXITSTATUS(status));

zed -v -F shows the following output:

Finished "(null)" eid=0 pid=71724 time=0.000979s exit=0
DEBUG: error: -1, errno: 10, status: 0, normal: 1, code: 0

Notice how:

zed shows a Finished "(null)" line, meaning some memory was corrupted / zeroed
error == -1 even if status == 0, WIFEXITED(status) == 1 and WEXITSTATUS(status) == 0

shodanshok · 2024-12-02T12:29:36Z

Indeed, the wait4 call inside _reap_children seems to stole the signal for the waitpid call inside libzfs_run_process_impl (it is timing dependent). This is because the wait4 call waits for all childrens with the same process group ID.

I have updated the patch with a possible solution. Thanks.

amotin

My memories in the area are a bit rusty, but rafter reading some man pages seems to make sense.

When replacing a disk, a child process is forked to run a script called zfs_prepare_disk (which can be useful for disk firmware update or health check). The parent than calls waitpid and checks the child error/status code. However, the _reap_children thread (created from zed_exec_process to manage zedlets) also waits for all children with the same PGID and can stole the signal, causing the replace operation to be aborted. As waitpid returns -1, the parent incorrectly assume that the child process had an error or was killed. This, in turn, leaves the newly added disk in REMOVED or UNAVAIL status rather than completing the replace process. This patch changes the PGID of the child process execuing the prepare script, shielding it from the _reap_children thread. Signed-off-by: Gionatan Danti <g.danti@assyoma.it>

shodanshok · 2024-12-03T12:11:57Z

Rebased.

EDIT: I missed that rebasing would remove the accepted label, sorry.

When replacing a disk, a child process is forked to run a script called zfs_prepare_disk (which can be useful for disk firmware update or health check). The parent than calls waitpid and checks the child error/status code. However, the _reap_children thread (created from zed_exec_process to manage zedlets) also waits for all children with the same PGID and can stole the signal, causing the replace operation to be aborted. As waitpid returns -1, the parent incorrectly assume that the child process had an error or was killed. This, in turn, leaves the newly added disk in REMOVED or UNAVAIL status rather than completing the replace process. This patch changes the PGID of the child process execuing the prepare script, shielding it from the _reap_children thread. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes openzfs#16801

behlendorf requested a review from tonyhutter November 23, 2024 22:33

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 23, 2024

shodanshok force-pushed the replace branch 3 times, most recently from 72cea01 to 80f09a7 Compare December 2, 2024 12:22

shodanshok force-pushed the replace branch from 80f09a7 to ecdb5f9 Compare December 2, 2024 12:30

tonyhutter approved these changes Dec 2, 2024

View reviewed changes

behlendorf approved these changes Dec 3, 2024

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 3, 2024

behlendorf requested a review from amotin December 3, 2024 00:56

amotin approved these changes Dec 3, 2024

View reviewed changes

shodanshok force-pushed the replace branch from ecdb5f9 to 44b6024 Compare December 3, 2024 12:03

github-actions bot removed the Status: Accepted Ready to integrate (reviewed, tested) label Dec 3, 2024

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Dec 4, 2024

amotin merged commit 1cd2419 into openzfs:master Dec 4, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race in libzfs_run_process_impl #16801

Fix race in libzfs_run_process_impl #16801

shodanshok commented Nov 22, 2024 •

edited

Loading

tonyhutter commented Nov 26, 2024

shodanshok commented Nov 27, 2024

amotin commented Nov 30, 2024 •

edited

Loading

shodanshok commented Dec 1, 2024

shodanshok commented Dec 2, 2024

amotin left a comment

shodanshok commented Dec 3, 2024 •

edited

Loading

Fix race in libzfs_run_process_impl #16801

Fix race in libzfs_run_process_impl #16801

Conversation

shodanshok commented Nov 22, 2024 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

tonyhutter commented Nov 26, 2024

shodanshok commented Nov 27, 2024

amotin commented Nov 30, 2024 • edited Loading

shodanshok commented Dec 1, 2024

shodanshok commented Dec 2, 2024

amotin left a comment

Choose a reason for hiding this comment

shodanshok commented Dec 3, 2024 • edited Loading

shodanshok commented Nov 22, 2024 •

edited

Loading

amotin commented Nov 30, 2024 •

edited

Loading

shodanshok commented Dec 3, 2024 •

edited

Loading