Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect spare vdev state after offline drive has been replaced by a hot spare #5766

Closed
thegreatgazoo opened this issue Feb 9, 2017 · 3 comments

Comments

@thegreatgazoo
Copy link

I was able to reproduce it on current master and 0.6.3-1.3 across different hardware:

# zpool create -f -o ashift=12 tank raidz1 sdd sde sdf sdg sdh spare sdu
# zpool offline tank sdh
# zpool replace tank sdh sdu
# zpool status
  scan: resilvered 361M in 0h0m with 0 errors on Thu Feb  9 05:01:02 2017
config: 

        NAME         STATE     READ WRITE CKSUM
        tank         DEGRADED     0     0     0
          raidz1-0   DEGRADED     0     0     0
            sdd      ONLINE       0     0     0
            sde      ONLINE       0     0     0
            sdf      ONLINE       0     0     0
            sdg      ONLINE       0     0     0
            spare-4  OFFLINE      0     0     0
              sdh    OFFLINE      0     0     0
              sdu    ONLINE       0     0     0
        spares
          sdu        INUSE     currently in use

The spare-4 was in OFFLINE state, which seemed incorrect (@ahrens agreed). The spare vdev is essentially mirror and with one child ONLINE its state should be degraded. In fact, if I export and import the pool, the spare-4 would become DEGRADED even though there's no change in its children state. In addition, OFFLINE doesn't seem to be a valid state for a non-leaf vdev at all, in vdev_offline_locked():

       if (!vd->vdev_ops->vdev_op_leaf)
               return (spa_vdev_state_exit(spa, NULL, ENOTSUP));

Also I found an Oracle document, which indicated DEGRADED should be the correct state.

This OFFLINE state will not affect new writes, as zio_vdev_io_start() checks vdev_accessible() only for leaf vdevs - writes do pass through the OFFLINE spare-4 to reach the sdu. This is correct. But for normal read IO, the parent raidz1 vdev will simply skip any OFFLINE child. I verified this by reading a 1.5G file that was not in ARC - my traces showed that vdev_raidz_reconstruct() was called 8176 times during the read, and iostat showed 0 read for sdu. Which is wrong.

In spa_vdev_attach():

        /*
         * If the parent is not a mirror, or if we're replacing, insert the new
         * mirror/replacing/spare vdev above oldvd.
         */
        if (pvd->vdev_ops != pvops)
                pvd = vdev_add_parent(oldvd, pvops);

        /*
         * Extract the new device from its root and add it to pvd.
         */
        vdev_remove_child(newrootvd, newvd);
        newvd->vdev_id = pvd->vdev_children;
        newvd->vdev_crtxg = oldvd->vdev_crtxg;
        vdev_add_child(pvd, newvd);

The spare-4 vdev, i.e. pvd in the code, was created by vdev_add_parent(oldvd, pvops) and got its state from oldvd (the OFFLINE sdh). So its state became OFFLINE. But after vdev_add_child(pvd, newvd), where newvd is the ONLINE spare sdu, its state didn't get refreshed. That's why it stayed OFFLINE, and why export/import would fix its state.

BTW @don-brady mentioned he was able to reproduce it on some old illumos, so it may affect more than our Linux ZFS.

@thegreatgazoo
Copy link
Author

Seemed easy to fix:

@@ -4787,6 +4787,11 @@ spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing)
        newvd->vdev_crtxg = oldvd->vdev_crtxg;
        vdev_add_child(pvd, newvd);
 
+       /*
+        * Reevaluate the parent vdev state.
+        */
+       vdev_propagate_state(pvd);
+
        tvd = newvd->vdev_top;
        ASSERT(pvd->vdev_top == tvd);
        ASSERT(tvd->vdev_parent == rvd);

There's a similar vdev_propagate_state() call in spa_vdev_detach() as well.

@behlendorf
Copy link
Contributor

@thegreatgazoo nice find! Can you open a new PR with the proposed fix which looks reasonable and we'll get some reviewers.

thegreatgazoo pushed a commit to thegreatgazoo/zfs that referenced this issue Feb 10, 2017
After a hot spare replaces an OFFLINE vdev, the new
parent spare vdev state is set incorrectly to OFFLINE.
The correct state should be DEGRADED. The incorrect
OFFLINE state will prevent top-level vdev from reading
the spare vdev, thus causing unnecessary reconstruction.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes openzfs#5766
@thegreatgazoo
Copy link
Author

On second thought, if we add vdev_propagate_state(pvd) as shown above, both the pvd and newvd still had empty DTLs, at this point could a zio come at the top raidz1 vdev and cause the spare-4 to read bad data from newvd? But since we're already changing the vdev tree structure here, should the existing locking already able to prevent such races? I'm not sure.

Even if we'd move the vdev_propagate_state(pvd) down a bit after newvd DTL_MISSING has been updated, could there still be a similar race as vdev_dtl_reassess(spa->spa_root_vdev, 0, 0, B_FALSE) is called much later from spa_vdev_exit()?

behlendorf pushed a commit that referenced this issue Feb 23, 2017
After a hot spare replaces an OFFLINE vdev, the new
parent spare vdev state is set incorrectly to OFFLINE.
The correct state should be DEGRADED. The incorrect
OFFLINE state will prevent top-level vdev from reading
the spare vdev, thus causing unnecessary reconstruction.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #5766 
Closes #5770
wli5 pushed a commit to wli5/zfs that referenced this issue Feb 28, 2017
After a hot spare replaces an OFFLINE vdev, the new
parent spare vdev state is set incorrectly to OFFLINE.
The correct state should be DEGRADED. The incorrect
OFFLINE state will prevent top-level vdev from reading
the spare vdev, thus causing unnecessary reconstruction.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes openzfs#5766 
Closes openzfs#5770
wli5 pushed a commit to wli5/zfs that referenced this issue Feb 28, 2017
After a hot spare replaces an OFFLINE vdev, the new
parent spare vdev state is set incorrectly to OFFLINE.
The correct state should be DEGRADED. The incorrect
OFFLINE state will prevent top-level vdev from reading
the spare vdev, thus causing unnecessary reconstruction.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes openzfs#5766
Closes openzfs#5770
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants