Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS: Fix mmp_interval failure #8906

Merged
merged 1 commit into from
Jun 19, 2019
Merged

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

After updating the Fedora CI builder from 29 to 30 the ZTS mmp_interval
test case was observed to consistently fail. This was causing the CI to
mark all PRs as failed.

Description

The mmp_interval test case was failing on Fedora 30 due to the built-in
echo command terminating the script when it was unable to write to
the sysfs module parameter. This change in behavior was observed with
ksh-2020.0.0-alpha1. Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

How Has This Been Tested?

Locally tested on Fedora 30. Pending full run of the ZTS via the CI.

$ ./scripts/zfs-tests.sh -t tests/functional/mmp/mmp_interval
Test: tests/functional/mmp/setup (run as root) [00:00] [PASS]
Test: tests/functional/mmp/mmp_interval (run as root) [00:00] [PASS]
Test: tests/functional/mmp/cleanup (run as root) [00:00] [PASS]

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter.  This change in behavior was observed with
ksh-2020.0.0-alpha1.  Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf behlendorf added Component: Test Suite Indicates an issue with the test framework or a test case Status: Code Review Needed Ready for review and testing labels Jun 14, 2019
@behlendorf behlendorf requested a review from ofaaland June 14, 2019 21:03
@codecov
Copy link

codecov bot commented Jun 15, 2019

Codecov Report

Merging #8906 into master will increase coverage by 0.24%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8906      +/-   ##
==========================================
+ Coverage   78.51%   78.76%   +0.24%     
==========================================
  Files         382      382              
  Lines      117840   117818      -22     
==========================================
+ Hits        92526    92797     +271     
+ Misses      25314    25021     -293
Flag Coverage Δ
#kernel 79.27% <ø> (+0.03%) ⬆️
#user 67.41% <ø> (+0.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1b4ac2...9f4afa3. Read the comment docs.

Copy link
Contributor

@ofaaland ofaaland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ofaaland ofaaland added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 18, 2019
@behlendorf behlendorf merged commit 4ca457b into openzfs:master Jun 19, 2019
@krader1961
Copy link

@behlendorf I'm one of the primary maintainers of ksh and am worried this issue represents a regression to ksh behavior. Any chance I can impose on you to instrument libtest.shlib to report the value of $PATH at the point the echo -n "$value" would have been executed. And do the same echo but redirect its output somewhere else and report whether the output is just the value or -n value. If it's behaving like the BSD echo the -n will suppress the newline. If it's behaving like the SysV echo the -n will be literally echoed.

We recently removed the getconf builtin and I'm worried that refactoring may have resulted in a change to the behavior of echo. That's because whether or not echo behaves like BSD/GNU or SysV/ATT depends on the content of $PATH. That logic used to be in the getconf code. We moved that logic into a new function so that all the other getconf baggage could be jettisoned.

@behlendorf
Copy link
Contributor Author

@krader1961 more than happy to. What I'm seeing after instrumenting libtest.shlib is PATH=/home/behlendo/src/git/zfs/bin, which is what I'd expect since the script explicitly constrains the path when running the test suite. And the logged value does not contain a trailing newline.

Here's the small change I made, and the exact file contents for comparison. Let me know if there's something else you'd like me to instrument.

        Linux)
                typeset zfs_tunables="/sys/module/$module/parameters"
                [[ -w "$zfs_tunables/$tunable" ]] || return 1
+               echo "$PATH" >/var/tmp/PATH.log
+               echo -n "$value" >/var/tmp/echo.log
                echo -n "$value" > "$zfs_tunables/$tunable"
                return "$?"
                ;;
$ hexdump -C /var/tmp/PATH.log 
00000000  2f 68 6f 6d 65 2f 62 65  68 6c 65 6e 64 6f 2f 73  |/home/behlendo/s|
00000010  72 63 2f 67 69 74 2f 7a  66 73 2f 62 69 6e 0a     |rc/git/zfs/bin.|
0000001f
$ hexdump -C /var/tmp/echo.log 
00000000  30                                                |0|
00000001

@krader1961
Copy link

Thanks, @behlendorf. That info shows it's not a regression we introduced recently. I suspect it was introduced by that AT&T team after the ksh93u+ release and before the project was moved to Github. I'll try to confirm that. Replacing the echo with cat shouldn't be necessary.

@krader1961
Copy link

I can't reproduce this using ksh93u+, ksh93v-, or the current ksh beta release on OpenSuse when doing echo 0 > /sys/module/sysrq/parameters/sysrq_downtime_ms; i.e., I can successfully modify that parameter with all three ksh versions.

@behlendorf Can you provide more information regarding the nature of the failure when using echo? Was there an error message? What does ls -l "$zfs_tunables/$tunable" output? Is there any reason to think that path would behave differently than the sysrq path I used for my test?

@behlendorf
Copy link
Contributor Author

@krader1961 I suspect what might be relevant here is that this test case is unusual in that it is testing that the a module option cannot be set to an invalid value. I'm also unable to reproduce the issue when setting it to an allowed value. Here's the error message I'm seeing logged:

13:05:42.64 /home/behlendo/src/git/zfs/tests/zfs-tests/tests/functional/mmp/mmp_interval.ksh[43]: set_tunable64[3468]: set_tunable_impl[3499]: echo: write to 1 failed [Invalid argument]
13:05:42.65 ### 10375 Function backtrace:
13:05:42.65 1   ?? ??:0
13:05:42.65 2   ?? ??:0
13:05:42.65 3   ?? ??:0

I tried installing the Fedora ksh-debuginfo package hoping to get a better backtrace (which I assume is from ksh?) but didn't have any luck. Please let me know if there's a way I can enable some more detailed debugging.

I also verified that replacing ksh's built-in echo with the system's /bin/echo binary behaves as expected. The command cleanly fails. e.g.

13:08:28.02 /bin/echo: write error: Invalid argument

@ikozhukhov
Copy link
Contributor

i think, we have 2 options:

  1. try to use printf instead of echo
  2. move to BASH

@richardelling
Copy link
Contributor

+1 for printf

@krader1961
Copy link

Okay, I can reproduce the write error by doing sudo ksh -c 'echo >/sys/module/sysrq/parameters/reset_seq'. Using cat also results in a write error. The real issue here is that your ksh process is dying. Presumably from a SIGSEGV or SIGABRT (i.e., an abort() or assert() call) since those are the only two situations the current ksh will attempt to emit a function backtrace. Presumably it's not an assert() since we're not seeing the associated error message.

While my attempt to reproduce the problem on x86 Linux results in the write error it does not cause the ksh process to exit. Which is going to make debugging this more difficult if I can't reproduce the failure. And I don't have access to a Linux on Zseries system.

@behlendorf I hate to continue imposing on you but any chance you can enable core dumps (ulimit -c unlimited) and reproduce the failure? Then, as a starting point use gdb's bt command to get us a useful backtrace?

P.S., The ksh backtrace facility I added simply uses the libc backtrace() function. I'm guessing it's not producing useful output because your ksh binary is stripped.

@krader1961
Copy link

Also, I'm more than happy to debug this myself if I can get access to a Linux on Zseries instance.

@behlendorf
Copy link
Contributor Author

@krader1961 happy to help. Using your example I was able to easily cause the failure now. This was using a clean install of Fedora 30 (x86_64) running under libvirt. Nothing fancy, so you should be able to reproduce it the same way.

$ sudo /bin/bash
$ ulimit -c  unlimited

$ ksh -c 'echo >/sys/module/sysrq/parameters/reset_seq'
ksh: echo: write to 1 failed [Invalid argument]
### 18766 Function backtrace:
1   (null) + 94355638756193
2   (null) + 139689715015424
3   (null) + 94355639741216
Aborted (core dumped)

$ coredumpctl debug
           PID: 18766 (ksh)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 6 (ABRT)
     Timestamp: Fri 2019-06-21 15:21:08 PDT (17s ago)
  Command Line: ksh -c echo >/sys/module/sysrq/parameters/reset_seq
    Executable: /usr/bin/ksh93
 Control Group: /user.slice/user-1000.slice/session-1.scope
          Unit: session-1.scope
         Slice: user-1000.slice
       Session: 1
     Owner UID: 1000 (behlendo)
       Boot ID: 91d71c6c20cd4f48a9ed5a161d706a9e
    Machine ID: c3f950cadcc34cbf8192b6dc221e7d33
      Hostname: localhost.localdomain
       Storage: /var/lib/systemd/coredump/core.ksh.0.91d71c6c20cd4f48a9ed5a161d706a9e.18766.1561155668000000.lz4
       Message: Process 18766 (ksh) of user 0 dumped core.
                
                Stack trace of thread 18766:
                #0  0x00007f0c0bd79e75 __GI_raise (libc.so.6)
                #1  0x00007f0c0bd64895 __GI_abort (libc.so.6)
                #2  0x000055d0e254a366 handle_sigsegv (ksh93)
                #3  0x00007f0c0bd79f00 __restore_rt (libc.so.6)
                #4  0x000055d0e263ab20 _Sfstdout (ksh93)

GNU gdb (GDB) Fedora 8.3-3.fc30
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/ksh93...
Reading symbols from /usr/lib/debug/usr/bin/ksh93-2020.0.0-0.2.fc30.x86_64.debug...

warning: core file may not match specified executable file.
[New LWP 18766]

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments
Core was generated by `ksh -c echo >/sys/module/sysrq/parameters/reset_seq'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	  return ret;
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f0c0bd64895 in __GI_abort () at abort.c:79
#2  0x000055d0e254a366 in handle_sigsegv (signo=<optimized out>, info=<optimized out>, context=<optimized out>) at ../src/cmd/ksh93/sh/fault.c:73
#3  <signal handler called>
#4  0x000055d0e263ab20 in _Sfstdout ()
#5  0x000055d0e263ab20 in ?? ()
#6  0x0000000000000020 in ?? ()
#7  0x000055d0e263ab20 in ?? ()
#8  0x0000000000008000 in ?? ()
#9  0x000055d0e25c8e36 in _sfexcept (f=0x7f0c0bd3f6c0, f@entry=0x55d0e263ab20 <_Sfstdout>, type=type@entry=2, io=<optimized out>, io@entry=-1, 
    disc=disc@entry=0x55d0e2d77b60) at ../src/lib/libast/sfio/sfexcept.c:53
#10 0x000055d0e2591c86 in sfwr (f=f@entry=0x55d0e263ab20 <_Sfstdout>, buf=buf@entry=0x55d0e2d79870, n=n@entry=1, disc=0x55d0e2d77b60)
    at ../src/lib/libast/sfio/sfwr.c:214
#11 0x000055d0e25826c7 in _sfflsbuf (f=f@entry=0x55d0e263ab20 <_Sfstdout>, c=<optimized out>, c@entry=-1) at ../src/lib/libast/sfio/sfflsbuf.c:90
#12 0x000055d0e258b3d8 in sfsync (f=f@entry=0x55d0e263ab20 <_Sfstdout>) at ../src/lib/libast/sfio/sfsync.c:124
#13 0x000055d0e25711f6 in b_print (argc=argc@entry=0, argv=0x55d0e2d698f8, argv@entry=0x55d0e2d698f0, context=context@entry=0x7ffd12fb10e0)
    at ../src/cmd/ksh93/bltins/print.c:403
#14 0x000055d0e257171d in B_echo (argc=<optimized out>, argv=0x55d0e2d698f0, context=<optimized out>) at ../src/cmd/ksh93/bltins/print.c:142
#15 0x000055d0e252c8ae in sh_exec () at ../src/cmd/ksh93/sh/xec.c:1218
#16 0x000055d0e24ff478 in exfile (shp=shp@entry=0x55d0e263f400 <sh>, iop=iop@entry=0x55d0e2d76f80, fno=-1, fno@entry=0) at ../src/cmd/ksh93/sh/main.c:515
#17 0x000055d0e25001fb in sh_main (ac=<optimized out>, av=<optimized out>, userinit=<optimized out>) at ../src/cmd/ksh93/sh/main.c:313
#18 0x00007f0c0bd65f33 in __libc_start_main (main=0x55d0e24feef0 <main>, argc=3, argv=0x7ffd12fb1bf8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7ffd12fb1be8) at ../csu/libc-start.c:308
#19 0x000055d0e24fef9e in _start ()
(gdb) 

@krader1961
Copy link

@behlendorf Much thanks for bringing this to our attention. This bug was introduced by me nearly a year ago as a consequence of resolving a lot of lint warnings involving the errormsg() and error() functions. It only manifests when an atypical I/O error occurs and even then only with a non-debug build of the ksh binary. That is, it's essentially a heisenbug . I expect to merge a fix in the next 48 hours after more thought about those functions and how best to undo the previous lint cleanups without introducing more bugs or lint warnings.

FWIW, We would have jumped on this earlier, before you created this PR, had an issue been opened against the ksh project. The only reason I dug into this is because @siteshwar, who works for RedHat, noticed it and brought it to my attention.

@siteshwar
Copy link

siteshwar commented Jun 22, 2019

@behlendorf Can we find a way to continuously test upstream ksh with upstream zfs on linux ? I have been doing package builds through copr for Fedora and Cent OS, you should be able to use them in your CI.

@behlendorf
Copy link
Contributor Author

@krader1961 I'm glad you were able to identify it, and happier to see it fixed upstream. Thanks for the pointer to the ksh project, if we manage uncover any other strange issues I'll make sure to open a new issue.

@siteshwar we could easily enough have the Fedora builder pull the latest packages from copr. But I'm not sure how much benefit you'd really be able to derive from our additional testing.

@siteshwar
Copy link

we could easily enough have the Fedora builder pull the latest packages from copr. But I'm not sure how much benefit you'd really be able to derive from our additional testing.

It would help a lot. The problem is that ksh source code has evolved over more than 30 years and that makes it extremely hard to maintain. As we move forward, we would like to do more code refactoring to make it maintainable. Directly working with projects like zfs will help us ensure a smoother transition to new ksh, and we will be more confident that our changes do not introduce regressions.

@behlendorf
Copy link
Contributor Author

behlendorf commented Jun 28, 2019

@siteshwar in that case, I think we can give it a try. I'll see about updating at least one of our CI bots to install your latest copr packages for testing.

@siteshwar
Copy link

@behlendorf Awesome. Thanks!

tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 13, 2019
The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter.  This change in behavior was observed with
ksh-2020.0.0-alpha1.  Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8906
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 22, 2019
The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter.  This change in behavior was observed with
ksh-2020.0.0-alpha1.  Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8906
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Aug 23, 2019
The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter.  This change in behavior was observed with
ksh-2020.0.0-alpha1.  Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8906
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 17, 2019
The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter.  This change in behavior was observed with
ksh-2020.0.0-alpha1.  Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8906
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 18, 2019
The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter.  This change in behavior was observed with
ksh-2020.0.0-alpha1.  Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8906
tonyhutter pushed a commit that referenced this pull request Sep 26, 2019
The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter.  This change in behavior was observed with
ksh-2020.0.0-alpha1.  Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8906
@behlendorf behlendorf deleted the mmp_interval branch April 19, 2021 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants