-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filesystems disappearing on reboot #2224
Comments
Thanks for taking the time to open the issue so we can track this. |
Update... this just happened again today on reboot after upgrading from 0.6.2 to 0.6.3. Again, there was very likely a zfs recv in progress at the time. However, this time, when the system rebooted, instead of a filesystem being missing, it was renamed. The normal filesystem name is "ds/ds2", but zfs list showed this: ds/ds2recv-3577-1 1.39T 1.11T 589M /ds2 I then tried "zfs rename ds/ds2recv-3577-1 ds/ds2", and that succeeded, but then only ds/ds2 existed, in other words, the subordinate filesystems of ds/ds2/backups and ds/ds2/vm2 were gone: ds/ds2 589M 2.50T 589M /ds2 and as you can see, the space used by the missing filesystems was freed up. Then, I decided to re-create the filesystem, so, "zfs destroy -r ds/ds2", followed by "zfs create ds/ds2". However, at this point, "zfs list" does not show ds/ds2 existing. At that point, running "zfs create" would appear to succeed, but the filesystem was not created. If I run "zfs create ds/ds2" twice in very rapid succession, I get: cannot create 'ds/ds2': dataset already exists However, if I pause more than about a second between the two zfs create's, then it appears to succeed, but the filesystem isn't created. As before, the only apparent way to recover was to destroy the whole pool and re-create everything from scratch. |
One more update. Sorry for lack of a crash dump, but the receiving system just became inaccessible, i.e., no way to log into it and it was effectively "hung" while doing a recv. After pushing the hardware reset button, it came back up but one of the ZFS filesystems is missing. Below is the kernel log from the first indication that it was stuck. Thanks, Jul 21 08:38:27 ss2 kernel: [472320.484011] INFO: task kswapd0:37 blocked for more than 120 seconds. |
I think we're getting closer to being able to reproduce this. From our experience, 0.6.3 has been somewhat less stable than 0.6.2, having "stopped" a few times. When this happens, the system is running, but any attempt to run a "zfs" command, like to create or destroy a snapshot, etc., will hang. And the "zfs receive" in process is totally stuck. The system won't do a clean reboot; we have to push the reset button. This happened a couple of days ago, and I made sure to check that after rebooting, all of the filesystems were intact and nothing appeared to be missing. The system in question is the target of replication, so the source system was still hung running "zfs send", and in fact, the "rsh zfs receive" was also still hung. After killing those processes and restarting the synchronization, that is when an entire filesystem went missing and the space was freed. Re-creating the filesystem and re-sync'ing, as usual, restored normal operation. Anyway, the thing I wanted to mention is that the system appeared ok on reboot, but a subsequent zfs receive somehow destroyed the target filesystem. I admit, though, these are slightly different symptoms than reported at the beginning of this thread, where a filesystem would disappear and it would then be impossible to re-create it without destroying the whole pool and starting over. |
Last night, we updated the kernel and rebooted the system, and it ran less than 12 hours before having "hang" problems. I'll run through the reboot procedure, but in the mean time here is the kernel log. [39840.452012] INFO: task kswapd0:37 blocked for more than 120 seconds. |
After reboot, everything looked fine on both systems, and both had the same set of snapshots. However, as soon as I started the zfs send from the primary to the backup, the target filesystem was destroyed. Here's the stdout from the sending side: send from @replication-2014-07-25-09:11:03.238718041 to ds/ds2@replication-2014-07-25-09:11:25.722351302 estimated size is 0 at that point ds/ds2/vm2 on the target system was destroyed. |
@ldonzis Could you please post the output of |
Certainly, however it's more than 280MB, is that ok to post? Or I could put it on an FTP server. Here are the entries up to where it hung this morning, followed by the reboot and restarting of the sync operation (ds1 and ds2 are running separate/independent zfs send/recv operations). 2014-07-25.07:05:25 zfs receive -F ds/ds2 |
@ldonzis It appears you've uncovered an unlikely buy certainly possible deadlock which can occur in the memory reclaim code. If you're comfortable rebuilding from source I strongly suspect you'll be able to resolve the issue by undefining HAVE_NR_CACHED_OBJECTS and rebuilding from source. You can go about this a few ways.
--- zfs_config.h~ 2014-07-18 12:31:30.130388876 -0700
+++ zfs_config.h 2014-07-25 11:11:27.309866627 -0700
@@ -269,7 +269,7 @@
#define HAVE_MOUNT_NODEV 1
/* sops->nr_cached_objects() exists */
-#define HAVE_NR_CACHED_OBJECTS 1
+/* #undef HAVE_NR_CACHED_OBJECTS */
/* open_bdev_exclusive() is available */
/* #undef HAVE_OPEN_BDEV_EXCLUSIVE */
diff --git a/module/zfs/zpl_super.c b/module/zfs/zpl_super.c
index 45639a6..14693a9 100644
--- a/module/zfs/zpl_super.c
+++ b/module/zfs/zpl_super.c
@@ -29,6 +29,10 @@
#include <sys/zfs_ctldir.h>
#include <sys/zpl.h>
+#ifdef HAVE_NR_CACHED_OBJECTS
+#undef HAVE_NR_CACHED_OBJECTS
+#endif
+
static struct inode *
zpl_inode_alloc(struct super_block *sb) We'll look in to a proper long term fix for this. |
At the moment, it's installed on Ubuntu via apt-get from the repository, so I'll read through the instructions on how to build it from the source. Just out of cusiority, is this something recent? My sense is that it was very stable on 0.6.2. The thing with disappearing filesystems still occurred, but it didn't hang, or at least it doesn't seem like it did. Thanks! |
ok, I modified the source and recompiled it, so we'll see how it goes. Sounds like there are two unrelated things going on, though, one was that it could hang, and the other is zfs send/recv destroying the filesystem. Thanks, |
@ldonzis The HAVE_NR_CACHED_OBJECTS support isn't recent. But the lockup is pretty subtle so it was likely accidentally introduced by some refactoring in 0.6.3. That's why we test but still unlikely cases like this can be missed . You're the first to report this and 0.6.3 has been out since Jun 12th. |
Here's some news... I can reproduce the "lost filesystem" thing pretty much at will. I created two VMs and found that if I reboot the target while it's in the middle of a receive, then a subsequent receive destroys one or more filesystems. I think at this point, I could produce a couple of scripts that would demonstrate it pretty readily, and/or package up the VMs themselves if that's easier. |
@ldonzis if you could provide a script as a test case that would be ideal. |
@ldonzis Are the file systems mounted on the receiving end? If so, does the problem happen when receiving if they're not mounted? |
Here's a script that appears to demonstrate the problem pretty reliably. It works by creating a filesystem with several descendent filesytems, replicating that to a destination system, and then looping on sending incremental streams. If you reboot the destination host while the send/receive is in progress, then after the destination host reboots, when the source host attempts the next transfer, that's when the filesystem(s) get destroyed. (You can also reproduce the problem without rebooting the destination. While the send/receive is in process, just Ctrl-Z to suspend the process and kill %1 to terminate the job.) After reducing this to a simpler test, this whole problem may be a simple matter of sending an incremental stream to a filesystem which doesn't have the origin snapshot. It appears that the entire send/receive is not completely atomic, i.e., when sending an incremental recursive stream, as each filesystem receives its part of the stream, it gets updated with the latest snapshot right away. So if the process is interrupted somehow (by rebooting or the system crashing etc.), the destination system has the latest snapshot on some of the descendent filesystems but not all of them. If a new incremental send/receive is then attempted, the filesystems that don't have the origin snapshot get destroyed entirely. This is exacerbated somewhat by the fact that "zfs send" doesn't exit with a non-zero return code when it gets an error, otherwise the script would have aborted. This non-atomic behavior is slightly unfortunate outside of this exercise, and I don't know whether it's proper since I’ve never seen documentation that guarantees an all-or-nothing result. If you gracefully abort a send/receive operation, then it does appear to back out all of the work that had been done, but that doesn’t help with the unexpected case of a machine rebooting. Interestingly, if the source system still has the older snapshot that does exist on all descendent filesystems at the destination, then creating a third snapshot and attempting to send an incremental stream from the second to the third (where the second doesn’t exist on all descendent filesystems), results in an error: cannot receive incremental stream: most recent snapshot of tank/fs2 does not match incremental source However, if the first snapshot is destroyed, then we see the problem of it destroying the filesystem on the destination: cannot receive incremental stream: destination 'tank/fs2' does not exist For example, if we have: source: destination: (notice that tank/fs2@snap2 does not exist because a previous receive was aborted ungracefully.) Now if we attempt “zfs send -R -I tank@snap2 tank@snap3 | ssh xxx zfs receive -F tank”, then we’ll get an error that the destination doesn’t have the proper origin snapshot, and no harm is done. However, if we “zfs destroy -r tank@snap1”, and then attempt the same transfer, then tank/ds2 on the destination will get destroyed. At least that’s my observation. Here's the script: #!/bin/bash ssh=rsh echo "Creating local pool and filesystems" Initialize local stuffif ! zpool destroy tank 2>/dev/null || Initialize the remote sideecho "Creating remote pool and filesystems" If we want the remote un-mounted#if ! $ssh $dst zfs umount -a; then exit 1; fi while true; do |
To dweeezil's question, it appears to make no difference whether the destination filesystems are mounted or not. |
And I stand corrected, it appears that just Ctrl-C'ing the send leaves the destination with some filesystems updated to the latest snapshot and some not, so it's not inconsistent at all. It just seems like if you attempt to send an incremental stream and the destination doesn't have the origin snapshot, it shouldn't just destroy the filesystem? I note the comment about receive -F which states that snapshots and filesystems that don't exist at the source are destroyed at the destination, but this doesn't seem to match that case. In this case, it's destroying a filesystem that does exist on the source, but the destination doesn't have the origin snapshot. |
@ldonzis In my testing, the problem seems to happen when a file system on the receiving end has no snapshots at all. Apparently if the "does not have fromsnap" error is never triggered, the destination file system is deleted even if it exists on the source. This behavior matches your example above in which you did As to your question of atomicity: a complete |
Ah, that's a very good point... certainly an even simpler test case! I presume, by the way, that this is not "good" behavior, i.e., this isn't my fault for attempting to do such a thing. I think I understand what you're saying about atomicity... that for a given filesystem, a send/receive is guaranteed atomic, but when you are using -R with descendent filesystems, the entire operation is not atomic. That all sounds logical, and is pretty much what I had suspected. I guess it would be super cool if the whole operation could guaranteed atomic, because then my scripts wouldn't have to consider a partially completed transfer. But anyway, it's not that big of a deal, as long as we understand it. Thanks! |
I'd say this behavior is at best, a violation of the POLA, and at worst, an outright bug. This problem, of course, doesn't have anything to do with the deadlock condition you discovered. I suppose this should be split into two issues: one for the "zfs receive -F erroneously destroys file systems" problem and the other for the deadlock. |
lol. Yes, in retrospect, the two obviously have nothing to do with each other. Brian already commented about working around the deadlock, so we're running with that change applied. I am curious as to whether this is specifically a ZFS on Linux issue or if it would happen on any port. If it's a general ZFS issue, I'd be especially "astonished". (Mainly because ZFS is just so fantastically cool and practically always does what you expect.) Another (third) possible issue is that which I mentioned at the very beginning of the thread, which is that there was apparently some way for a filesystem to be destroyed such that "zfs create" was rendered inop. It didn't give any error, it's just that the filesystem wasn't created, and if you tried to destroy it, it said it didn't exist. However, I haven't seen this behavior in any of the recent testing with 0.6.3, so maybe it's no longer a problem. |
Closing as out of date. |
Sorry if this is a duplicate, but I've searched and can't find anything exactly like this. And sorry for the lack of details, but I think I can reproduce it as it's happened twice.
We have two NAS servers that are sync'ed by almost continuous replication of snapshots, and it works very well.
However, after rebooting the secondary server (the one that's running zfs recv), a couple of filesystems have vanished. (zfs list does not show them.) However, doing a zfs create on one of the missing ones does not return any error, yet the created filesystem still doesn't show up. Attempting to destroy the filesystem says that it doesn't exist. The only recovery I've found so far is to destroy the parent filesystem, then re-create the parent and children, and then do a full replication.
It's entirely possible/likely that a recv was in progress when a "reboot" command was issued, which may or may not be related. If you have any suggestions for obtaining better information that would be useful, I'm willing to try it.
Thanks,
lew
The text was updated successfully, but these errors were encountered: