-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy files with xattr kills my box #3143
Comments
@tobias-k which server is crashing A or B? Are you able to cause this problem when using two newly created pools? What version of ZoL are you using? |
I do copies with zpool send and that works fine (xattr=sa), perhaps the extra load of the rsync is the real cause? |
Thought I haven't look into the code, this could be the rwsem version of mutex_unlock/free race. |
@behlendorf the destination server B crashed. And yes, same behavior on new created pools. v0.6.3-5~trusty from ubuntu ppa repo. |
@tuxoko now that you mention it, it sure does look similar to the mutex issue. The code must be spinning on the |
@behlendorf Yeah, I also did the quick look. Both ZAP and ZAP leaf are free by dbuf evict path, and I haven't found anything suspicious. But nonetheless, there should be some sort of memory corruption on the spinlock, because every other CPU is in idle state, so there's no way they can hold a spinlock. |
One possible cause is that we're somehow accidentally damaging the leaves in memory. There have been a very small number of bugs reported (#1445, #2861) which show that somehow a trashed leaf was written to disk. If it did somehow get overwritten in memory that is one way the spinlock could be damaged. It's in part what prompted me to open #3138 to make checking for this even more rigorous. However, because the reports have been rare we've never been able to identify a root cause. |
I think you're right. #2861 has very similar lockup. And it consistantly occurs on only a few machine. It's quite possible caused by damaged pool. |
I decided to take a peek at recent changes to the rwsem code in the kernel. I was initially made curious by the realization that openzfs/spl@46aa7b3 and its successors pretty much always are enabled these days (in the mainly x86_64 world) because torvalds/linux@29671f2 to which it implicitly refers (first appearing in 2.6.33) only modified the generic rwsem code and didn't do anything for the machine-specific version which is always used by x86_64. In any case, I wanted to raise the possiblity that torvalds/linux@c8de2fa might be an interesting inflection point (first landing in 3.10) because it pulled in a whole lot of patches to the rwsem code. I've not reviewed any of it in detail yet but a cursory overview makes me wonder if something happened there that may have an impact (positive or negative) on ZoL. |
@dweeezil |
@tobias-k @tomposmiko If possible, please try #3223 to see if it helps with this issue. Thanks |
@nedbass |
@nedbass |
@tomposmiko thanks for trying the patch. You must be running into a different issue. It would be helpful if you collect any stack traces that appear in |
@tomposmiko I see, that does look like the same issue. I suspect the directory is corrupt on disk in your case, so my patch wouldn't help with that, but it may prevent the corruption from happening in the first place. |
@nedbass
If so, how can I test the patch for sure?
|
@tomposmiko Unfortunately your situation is not a useful test for the patch. The patch should prevent the ZAP object from getting damaged in memory before it is written out to disk, but in your case it seems to already be damaged on disk. I'm hoping it helps @tobias-k since the lockup occurs every time his rsync job runs, so it must be getting damaged in memory. |
@nedbass actually I'm not able to do some futher testing. The system is hot. Next week I can creat a lab enviroment and test the patch. |
@tobias-k @tomposmiko is this still an outstanding issue in either the 0.6.4.2 or master branch? |
I disabled xattr and acltype properties after a while. |
Yes if you're running 0.6.4 or newer. All the known SA issues have been addressed and we haven't seen any new ones. |
Based on the above comments it sounds as if all the SA xattr fixes and resolved this issue. |
Situation:
Server A has a zpool and zfs volumes with following settings
Server B clean installationion with new zpool und zfs volumes. Same Settings like A
Copy Files from Server A to B including xattr causes instant crash of the B box. (rsync)
Copy Files from Server A to B WITHOUT xattr works fine. Setting xattr after copy works fine.
Once the box is crahses while copying you can access the volumes and browse. Deleting files copyed from Box A with xattr causes same crash. Only option is deleting the entire zfs and start again.
100% repruduceable in my case.
Looks like a bug in handling xattr in this specific caombination of settings.
The text was updated successfully, but these errors were encountered: