-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs-8.1 corrupt kernel stack kernel crash on s390x #8992
Comments
Bisected and double checked the bisect point, found the commit causing the issue: d99a015 is the first bad commit |
It concerns me that one of the commit messages states: " Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow". I'm speculating that s390 is being bitten by stack overflow issues that could have been found with these tests if they had not been disabled. |
So the chain of events is:
8K stacks on s390 is not big enough for the new lua handler. sigh |
zcp_eval() has a nice chunky 128 byte errmsg buffer in it. That's not helpful for small stacks |
Seems to be the lua setjmp/longjmp buff size is causing the stack corruption, this helps. Not sure what the correct size should be though
|
So, in conclusion, the stack size is fine, but the jmp buf size was root cause. |
I wonder if the same issue exists for the Sparc arch |
Thanks for running these tests on s390x. When this feature was merged all of the new ZTS tests were run on the supported platforms... with the exception of s390 since we didn't have access to a test system. Do I understand correctly that increasing the
It's not quite as bad as the commit message implies. In fact, the two tests which are skipped both explicitly test the maximum stack depth. Though I completely agree it needs to be addressed, https://github.com/zfsonlinux/zfs/blob/master/tests/zfs-tests/cmd/nvlist_to_lua/nvlist_to_lua.c#L265 |
I haven't yet tested this on the full set of ZFS regression tests, nor do I know if the change I made is optimal. I think we need a s390x kernel specialist to dig into this to determine the correct optimal size; however, the allocation I provided it has plenty of slop to ensure it worked fine against the smaller set of regression tests we use when packaging ZFS for Ubuntu. |
BTW, I'm on vacation for 2 weeks from today, so I won't be able to provide much more info during this period. |
@ColinIanKing sounds good, thanks for letting me know. @don-brady do you recall how this value was determined for s390x? |
@behlendorf
I'm guessing I may have assumed the So perhaps the fix is:
|
Based on the documentation I could find it looks like it's 4 bytes for s390and 8 bytes for s390x. So your fix looks right to me, but we'll need @ColinIanKing to test it for us. |
I'll be back at work this coming week so I'll give it a test. |
I can confirm that setting JMP_BUF_CNT to 18 works fine and passes all the ubuntu zfs regression tests, so I'm confident that this is a good fix. #elif defined(s390x) |
Please add my Reported-by and Tested-by: Colin Ian King <canonical.com> sign-offs to the fix :-) |
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Issue openzfs#8992
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8992
@ColinIanKing thanks for verifying the fix, I've opened PR #9080. |
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8992
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8992 Closes openzfs#9080
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8992 Closes openzfs#9080
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8992 Closes openzfs#9080
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8992 Closes openzfs#9080
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8992 Closes openzfs#9080
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8992 Closes openzfs#9080
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8992 Closes openzfs#9080
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8992 Closes openzfs#9080
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8992 Closes #9080
When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#8992 Closes openzfs#9080
Summary: On s390x with Linux 5.0 and 5.2 ZFS 7.12 works fine when creating a snapshot, however with zfs 8.1 the snapshot crashes the s390 kernel with: "Corrupt kernel stack, can't continue" error and a stack dump (that I can't capture).
Running the scrub test script below with ZFS 8.1 on a 5.0 or 5.2 kernel on a S390x instance the kernel will dump stack with the kernel error message "Corrupt kernel stack, can't continue". This does not occur with ZFS 7.12 . This test works fine with the 5.0/5.2 kernels on arm64, arm64, ppc64 but not s390x.
Script:
It's getting late here in the UK, so I will pick this up tomorrow and try and bisect the issue down.
The text was updated successfully, but these errors were encountered: