Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Device Symlink detection in zpool import #4500

Closed
FireDrunk opened this issue Apr 8, 2016 · 17 comments
Closed

Improve Device Symlink detection in zpool import #4500

FireDrunk opened this issue Apr 8, 2016 · 17 comments

Comments

@FireDrunk
Copy link
Contributor

This is a rework / improvement of #3892.

I've started trying to figure out how to build a better handling of the Symlink / Device Tree detection mechanisms in the zpool import functions.

Currently i have some raw code to detect the symlinks. It builds a tree of devices with it's "best" symlink attached.

I'm looking into implementing this function into the zpool_find_import_impl function in libzfs_import.c#L1589.

The function starts with looping thru all directory to find "disks", the function i have already does this, and provides the 'correct' device node link on which ZFS should access the device.

In #3892, @ryao mentions that we need to update the respective zpool_do_{import,create,add,remove,attach,detach,replace} functions.

Wouldn't it be better if we did the device / symlink handling in a central place, and then hand off the results to the other zpool functions?

@behlendorf
Copy link
Contributor

@FireDrunk great timing, enough of the groundwork is now in place that you can solve this issue the same way it was done in FreeBSD / Illumos. @don-brady's recent commit to master, 39fc0cb, added support to store the devid and phy_path keys in the vdev label. That means we can reinstate the logic in vdev_disk_open() for opening a device by devid. It will need to be modified slightly for Linux but the intent remains the same, see the upstream OpenZFS code for this. This is the central location you were suggesting.

@FireDrunk
Copy link
Contributor Author

I've read the code on OpenZFS, and what they do once a device is reopend on a different device link, is rewrite the VDEV with the 'new' information. Is that something we want to do in ZoL?

Practical example: Someone creates a pool on /dev/sdc when booting from an USB thumb drive, rebootes, and the /dev/sdc is now /dev/sdb. ZFS cannot open the device, so it starts searching for symlinks, and finds the correct disk in /dev/disk/by-id/. After which ZFS rewrites the correct device id in the VDEV physpath value.

After a zpool status, the users sees his disk under /dev/disk/by-id/.

Is this expected behaviour? or do we want to let the user still see the /dev/sdb he first used?

On another note: Where is the code located where ZFS "tastes" a given device to see if it's the correct/valid ZFS device that it's 'needing' when mounting a pool?

@behlendorf
Copy link
Contributor

@FireDrunk good question. That would definitely be a change in behavior from how things behave today, but it's almost certainly a desirable one for devices like /dev/sda which change names between reboots. For other devices under /dev/disk/ that's a little more problematic since those paths can imply additional semantics a user may care about such as physical location.

As for device tasting it depends on how a pool is imported. Either your going to be opening a pool based on the configuration stored in the cache file, or using a configuration dynamically generated from blkid or directory scanning. In either case basic sanity check for the configuration are validated in vdev_open() and vdev_validate(). Adding additional sanity checks check wouldn't be a terrible idea.

@FireDrunk
Copy link
Contributor Author

FireDrunk commented Apr 16, 2016

So what we can do is classify an array of "bad paths" on which we actively rewrite the path to a better one we find using symlinks. If the device requested is 'normally' in a good path, but we can only find it in another good path, we could notify the user on import that it is possible to import the pool, but the user would need to do a -F to rewrite/update the vdev locations. This can cause mixed pools in rare cases (by mixed i mean, some links in /dev/disk/by-id/ and some via /dev/disk/by-path for instance).

We can always allow the use of bad paths thru an environment setting, and maybe it's even possible to let the user declare 'bad-paths' thru module settings or environment variables.

For instance, i'm thinking about /dev/mapper locations that some users might want (in case of Fibre Channel backup LUN's on SAN's, but in other cases (iSCSI?) you might not want to.

Does that seam like a feasible idea?

Now that i'm thinking about it: FreeBSD's behaviour is just import the pool on other device nodes, so if we want to 'allign' the implementation, we might have to look sideways to the other implementations.

@CurlyMoo
Copy link

CurlyMoo commented Apr 16, 2016

Currently i have some raw code to detect the symlinks. It builds a tree of devices with it's "best" symlink attached.

The raw code @FireDrunk speaks of is this:
https://github.com/CurlyMoo/zfs-device-tasting/blob/master/zfs-device-tasting.c

This is what it outputs:

dvd
cdrom
dm-1
- /dev/mapper/ubuntu--vg-swap_1
dm-0
- /dev/mapper/ubuntu--vg-root
sr0
- /dev/disk/by-label/Ubuntu-Server\x2014.04.1\x20LTS\x20amd64
vda5
- /dev/disk/by-id/virtio-BHYVE-4783-BDD5-21E3-part5
vda2
- /dev/disk/by-id/virtio-BHYVE-4783-BDD5-21E3-part2
vda1
- /dev/disk/by-uuid/83bad205-89fe-4cba-89fa-fb8c56493c0d
vda
- /dev/disk/by-id/virtio-BHYVE-4783-BDD5-21E3
...

So it will return the first symlink it will find on a priority list basis. That list is nothing more than the list already implemented in ZoL:

    { "by-vdev", "/dev/disk/by-vdev" },
    { "mapper", "/dev/mapper" },
    { "by-partlabel", "/dev/disk/by-partlabel" },
    { "by-partuuid", "/dev/disk/by-partuuid" },
    { "by-label", "/dev/disk/by-label" },
    { "by-uuid", "/dev/disk/by-uuid" },
    { "by-id", "/dev/disk/by-id" },
    { "by-path", "/dev/disk/by-path" }

After the user normally created the pool, we import the pool again based on the default list. That can imply a mixed symlink import. Only if the user forced the search path with the -F, we can overwrite the list with another priority (by using a map or something):

    { "by-label", "/dev/disk/by-label" },
    { "by-vdev", "/dev/disk/by-vdev" },
    { "mapper", "/dev/mapper" },
    { "by-partlabel", "/dev/disk/by-partlabel" },
    { "by-partuuid", "/dev/disk/by-partuuid" },
    { "by-uuid", "/dev/disk/by-uuid" },
    { "by-id", "/dev/disk/by-id" },
    { "by-path", "/dev/disk/by-path" }

So ZFS will first look for symlinks under by-label, but when that fails, will try the other paths.

@behlendorf
Copy link
Contributor

One other thing regarding "device tasting" I should mention is that we're systematically moving toward depending on libblkid. As you're probably aware this is the standard mechanism used under Linux to determine whats on a block device as they are dynamically added to the system. In master, if you're not using a cache file the default behavior is to ask libblkid for the list of available zpool member devices. This list will be considered authoritative when constructing a pool. To get the old probing behavior you can either specify a list of directories with -d, or use -s to scan the default paths.

Building on top of udev and libblkid allows us to start properly handling the various auto* zpool properties. It also let's us get notifications for device removal and addition.

7d11e37 Require libblkid
39fc0cb Add support for devid and phys_path keys in vdev disk labels

@FireDrunk
Copy link
Contributor Author

Does libblkid 'identify' disks as being Valid ZPool members by use of probing? Or does it rely on GPT partition type information or something?

If so, we could just ask libblkid to disregard any /dev/sd* links and just ask for "the rest" and make a decent table of devices with the result.

@behlendorf
Copy link
Contributor

The way blkid works is when a disk is added to the system a udev event is run which results in blkid tasting the device to identify it. It determines what's on the disk by accessing it and looking for certain characteristics. Based on this information it can tell if a disk contains an ext3/4, xfs, zfs filesystem or a myraid of other things. Once a disk has been identified it updates a cache file which can be used via libblkid future reference.

As I mentioned before we've been gradually integrating with the blkid cache. What happens today in master when zpool import is run is the blkid cache is consulted for all possible ZFS member devices rather than directory scanning. The vdev labels are then read off of each of these devices and the pool is constructed from that information.

Recently libblkid was made mandatory in the master code and that resulted in us uncovering a few new issues. @FireDrunk since I know you've been looking in to this it would be great if you could review the patches in #4523.

It would be great if you could also post your thoughts on what is still needs to be done.

@CurlyMoo
Copy link

The vdev labels are then read off of each of these devices and the pool is constructed from that information.

Does this then fully rely on proper vdev information? What if that fails? The idea of device tasting is that we don't rely on just one way of identifying a disk, but check various devfs paths to search for a particular zpool member when the previous identification method failed.

@FireDrunk
Copy link
Contributor Author

FireDrunk commented Apr 21, 2016

I think you are misunderstanding Brian.

udev triggers at boot for each block device, regardless of it's path. This means that the blkid cache will contain all succesfully tasted devices. Once ZFS comes along (when doing the zpool import stage) it will ask blkid for all devices that were tasted as being a ZFS device.

This results in a list of devices that ZFS is going to try to probe itself for 'internal' vdev information.
After which it will try to assemble the VDEV's / pools.

I think that should be fine as long as the UDEV / BLKID device tasting is at the same level as ZFS' internal probing.

A situation that i'm thinking of now, is a 'semi-broken' device that is failing to be probed by UDEV but would be properly probed by ZFS.

What is the 'chance' that this is possible @behlendorf ? Can we be 'sure' that the udev / blkid probing of ZFS VDEV Members is 'good' enough?

@FireDrunk
Copy link
Contributor Author

@behlendorf , i've looked into #4523 and found the related blkid functions:
764a2aa#diff-3d8928905ad81f6805b5a63f8d8b99adR1750

It looks like the code asks blkid for the entire cache, and just tries to read a zpool label from it. This seems fine to be as this wil increase our changes of finding zpools on block devices.

One thing tho, we need to choose whether we want to keep the /dev/disk/by-id/ or rework that to something more fancy like upgrading the links if we have 'better' paths.

@CurlyMoo
Copy link

What is the 'chance' that this is possible @behlendorf ? Can we be 'sure' that the udev / blkid probing of ZFS VDEV Members is 'good' enough?

That's what i wanted to say but you said it better 😄

our changes

Chances you mean?

@behlendorf
Copy link
Contributor

Can we be 'sure' that the udev / blkid probing of ZFS VDEV Members is 'good' enough?

Blkid has been able to reliably detect ZFS vdevs for several years now. The probing isn't quite as through as that done by ZFS, for example it does not read all the label information, but it shouldn't miss anything. Much older versions of blkid, such as that used in RHEL6, may not be quite as good.

But yes, with any modern distribution we can depend on it. And there's nothing preventing us to submitting patches to improve it if we find issues.

It looks like the code asks blkid for the entire cache,

It does but we filter out any devices which were not identified as a zfs_member type.

One thing tho, we need to choose whether we want to keep the /dev/disk/by-id/ or rework that to something more fancy like upgrading the links if we have 'better' paths.

With the patches in #4523 the paths used by the previous import are preferred by default. If for some reason they don't work it will fallback to the /dev/disk/by-id/ paths, and as a last resort use the device node. So if for example you always import using the /dev/disk/by-vdev/ paths those will persist.

This is certainly something which could be enhansed and libudev does provide some nice functionality to enumerate all possible paths given just the devnode.

@FireDrunk
Copy link
Contributor Author

FireDrunk commented Apr 22, 2016

Is libudev mandatory for ZFS? I mean, do we really want to fully rely on systemd-udev and libblkid? This might upset some Gentoo users :-)

@behlendorf
Copy link
Contributor

libblkid was made mandatory, libudev was left optional.

@behlendorf
Copy link
Contributor

@FireDrunk @CurlyMoo is there still an outstanding issue here which hasn't been sufficiently addressed in the master branch?

@behlendorf
Copy link
Contributor

Closing. This was resolved by adding a dependency on libblkid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants