-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
list_del corruption. next->prev should be ffff9df513f39368, but was dead000000000122 and freeze in abd_free_gang_abd #10401
Comments
Could this be related to recent commits on fb82226 ? |
Adding the gang ABD type, which allows for linear and scatter ABDs to be chained together into a single ABD. This can be used to avoid doing memory copies to/from ABDs. An example of this can be found in vdev_queue.c in the vdev_queue_aggregate() function. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian <bwa@clemson.edu> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10069
@bwatkinson or @mmaybee may want to take a look. |
@dioni21 What VDEV's were in your ZPool when this occurred? The only place, currently, that a gang ABD is allocated is in Even that confuses me though. Somehow next->prev for the list entry is set to LIST_POISON2 when looking at the kernel 5.6 code (include/linux/poison.h shows how LIST_POISON evaluates to the 0xdead address + 0x122) . Even though the trace does not show it, this error appears to come from |
@bwatkinson My pool is composed by a single 3-way mirror vdev (please, don’t ask), a mirror special, a mirror slog and a two-disk cache. I can post specifics if really needed. My kernel is just the default from Fedora 31, no local compilation, but some dynamic modules added. By |
Yes the copies is what I was referring to as ditto blocks. This is still super strange to me, and I haven’t been able to figure out how this happen. I will experiment with setting up a mirror with multiple copies for the data blocks to see if I can replicate this with the compile flags you posted. Please let me know if you encounter this again in the mean time. Maybe we can get another more stack traces if it does occur again. |
Note that the error was not immediate, and my home dir (most movimented dataset) has copies. You may need to do some disk burnin to trigger the bug. Other than another stack trace, is there something else I can do to help? Some specific compile flag or runtime parameter? |
I think if it is triggered again, any stack traces you could gather be helpful. Other than that, I will see if I can get this to occur myself. It is just really strange, because ASSERT’s in the ABD code should have been hit/failed before the validation code in the kernel for the list link set off this error. |
@bwatkinson just happened again. This time I may have a personal suspect: integration with kernel mode NFS. I almost never use NFS to access my ZFS data at home, but sometimes I do. Last time it crashed, I was using the NFS client. This may explain the rarity of occurrences and maybe lack detection in testing. Here's today's stack trace. We can see 2 dump. One at 21:18 in abd code, and another at 21:31, for NFS, probably trying to access already frozen ZFS data.
I'll try to force use NFS more now, just in case. |
Note: I do not use encryption. |
Good to know. Unfortunately I am not sure on when I'll be able to test. Due to Covid pandemic I'm a bit far from my main machine. I'm curious: How did you find it? May you quickly debrief? |
So @sdimitro was able to get a stack trace where we were able to see all the link values in the gang ABD after this crash occurred. We found that each of the children ABD's link prev values were pointing at LIST_POISON2. After reasoning this out, we both came to the conclusion a race condition existed on adding and removing a single ABD from multiple gang ABD's at the same time. I previously was not locking the children ABD's when removing them from a gang ABD, but we are now locking the children before removing them so there is a consistent view to the ABD link statuses. If you are able to apply this patch with your previous set up let me know if the issue occurs again. I am also going to fire up a VM replicating your setup, but I have not been able to get the LIST_POISON2 bug to occur previously in my VM's. I am hoping this patch will resolve the issue. |
@bwatkinson I see that there are still some changes being made. Where can I get the latest version? https://github.com/bwatkinson/zfs/tree/gang_abd_debug ? The latest commit seens to be a277205, which is referred as last force push on #10511 , but its date is Jun, 9, not Jul 8! I am still remote, but I my computer keeps freezing for the same bug. I'd love to try your patch. |
You should be able to pull in the changes from the PR now if you would like to try them out. I have just been making minor changes that will have no effect on the functionality of the patch. |
Resolved by #10511 |
System information
Describe the problem you're observing
I just had to reboot because ZFS was frozen
Describe how to reproduce the problem
I don't know what triggered the bug, but my system is compiled with debug options:
./configure --enable-silent-rules --enable-dependency-tracking --enable-asan --enable-debuginfo --enable-debug --enable-debug-kmem --enable-debug-kmem-tracking
Include any warning/errors/backtraces from the system logs
From
/var/log/messages
(it is not stored on ZFS, so it could be written):The text was updated successfully, but these errors were encountered: