Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved libblkid/libudev integration #4523

Closed
wants to merge 4 commits into from

Conversation

behlendorf
Copy link
Contributor

When partitioning a device a name may be specified for each partion.
Internally zfs doesn't use this partition name for anything so it
has always just been set to "zfs".

However this isn't optimal because udev will create symlinks using
this name in /dev/disk/by-partlabel/. If the name isn't unique
then all the links cannot be created.

Therefore a random 64-bit value has been added to the partition
label, i.e "zfs-1234567890abcdef". Additional information could
be encoded here since partitions may be reused that could result
in confusion and was decided against.

Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov
Issue #4517

@rlaager
Copy link
Member

rlaager commented Apr 13, 2016

Is it necessary to set a name at all? What happens with the udev rules if you don't?

@behlendorf
Copy link
Contributor Author

@rlaager in that case the by-partlabel links are never created, but I'm not sure if there are additional consequences. I asked the same question in #4517.

@DeHackEd
Copy link
Contributor

Unfortunately I can't think of a better way of doing this. What with the ability to rename pools and reguid a pool it's hard to make a unique yet persistent name which is still meaningful.

I guess this is as good as you can ask for it.

@rlaager
Copy link
Member

rlaager commented Apr 14, 2016

@behlendorf I could be missing it (I'm battling a cold at the moment), but I don't see the question about no label in #4517.

@AeonJJohnson
Copy link

I tested #4523. It remedies the systemd-udev errors appearing in /var/log/messages but the zpool create issue still occurs. It appears systemd-udev is still interfering with the zpool create process. The partitions are made correctly and have the new style of zfs-random64 partition names.

# zpool create -f spaceballs raidz A0 A1 A2 A3
cannot create 'spaceballs': one or more devices is currently unavailable

However I can manually create device and partition symlinks in a directory not controlled by systemd-udev I am able to create a pool. I am also able to export it, delete the manually created links and then import the pool. The import uses the /dev/disk/by-vdev/... device links with no errors.

I will continue to try and determine the root cause.

@AeonJJohnson
Copy link

It appears that systemd-udev-settle service may be too slow for the zpool create process.

First, Run 'strace zpool create spaceballs raidz A0 A1 A2 A3'

The zpool create fails. strace output here (1648 lines): http://pastebin.com/gL6VUwYf

Next, Look at performance of systemd using 'systemd-analyze blame'

# systemd-analyze blame | grep udev
           699ms systemd-udev-settle.service
            56ms systemd-udev-trigger.service
            33ms systemd-udevd.service

Looking at /dev/disk/by-vdev the device links are created properly, pointing to the right devices and partitions. The device partitions can be written to and read from. A zpool can be successfully created on these same devices by calling out their /dev/sdX names as opposed to dev links in /dev/disk/by-[path,vdev,id].

In the pastebin (linked above) if you look at lines 479-866, 994-1126, 1254-1383 and 1504-1632 the sequence below repeats for each drive declared in the zpool create command.

stat("/dev/disk/by-vdev/A0-part1", 0x7ffd64f59b00) = -1 ENOENT (No such file or directory)
nanosleep({0, 1000000}, NULL)           = 0

It may be that a longer sleep is needed to allow for systemd-udev-settle to finish creating the links. That or perform an exorcism on systemd forcing it to leave RHEL/CentOS 7 and return to the level of Hell where it came from.

@behlendorf
Copy link
Contributor Author

@AeonJJohnson based on the strace output you posted you're definitely past all the usual partition creation logic after the failure occurs. If you take a close look at the log you'll see it polling for the partitions to be created and eventually all four are.

stat("/dev/disk/by-vdev/A0-part1", {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 1), ...}) = 0
...
stat("/dev/disk/by-vdev/A1-part1", {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 113), ...}) = 0
...
stat("/dev/disk/by-vdev/A2-part1", {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 209), ...}) = 0
...
stat("/dev/disk/by-vdev/A3-part1", {st_mode=S_IFBLK|0660, st_rdev=makedev(65, 49), ...}) = 0
...

After that it attempts to do the ZFS_IOC_POOL_CREATE ioctl() which odds fails with ENOENT. That's why you're seeing such a generic error message about not being able to create the pool. Unfortunately, the strace log isn't going to show us what error path we hit in the kernel to cause this and there are a few of them. We're going to need to get a more detailed log from ftrace.

ioctl(3, 0x5a00, 0x7ffd64f5c4f0)        = -1 ENOENT (No such file or directory)
write(2, "cannot create 'spaceballs': no s"..., 52cannot create 'spaceballs': no such pool or dataset

@AeonJJohnson
Copy link

@behlendorf Here is an ftrace of the zpool create operation. (16MB)
trace.dat.zip

@behlendorf
Copy link
Contributor Author

@AeonJJohnson updated fix which fully resolves the issue. The patch relies in part on the new libudev support in master so it won't apply cleanly to 0.6.5.x. However, we can safely drop those hunks and apply a version to the release branch.

@behlendorf
Copy link
Contributor Author

@rlaager @DeHackEd @don-brady this patch is ready for review, if you have time please look it over and provide feedback. It should address multiple issues which have been hard to reliably reproduce in the past.

@rlaager
Copy link
Member

rlaager commented Apr 20, 2016

@behlendorf I don't know how much review I can add, except that I'm still unclear on why it's necessary to name the ZFS partitions at all. If it's not necessary, simply not setting a name is much simpler (both in terms of less code and for the system administrator) than setting these randomized names.

@AeonJJohnson
Copy link

@behlendorf based on reliance on libudev and master when do you expect it would land in a release? 0.7? Sooner?

@behlendorf
Copy link
Contributor Author

@AeonJJohnson definitely by 0.7. No promises but a version of it may make it in to 0.6.5.7. Based on the issue tracker this has clearly been a sore spot for some systems and should be addressed and it's a low risk change change.

@rlaager yeah, I'm not sold on the unique partition labels myself. They're better then systemd warnings but in practice I'm not convinced /dev/disk/by-partlabel/ are of any practical use. However, leaving that issue aside I added a second patch to this PR which addresses a deeper issue which has been causing trouble. Thoughts on that patch would be welcome.

@@ -260,6 +260,91 @@ udev_device_is_ready(struct udev_device *dev)
}

/*
* Wait timeout milliseconds for a newly created device to be available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really pedantic, I know, but it should be "timeout_ms milliseconds" (with the _ms).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with pedantic, carry on! The more eyes on this the better. Easy fix, will do.

@AeonJJohnson
Copy link

@behlendorf @rlaager /dev/disk/by-partlabel isn't really useful but the multiple devices with the same partition name triggers errors reported by systemd-udev and may be causing additional time delays in getting the device links created and stable. We already know we are dealing with a sluggish and bloated systemd environment. Making unique part names makes systemd happy and likely quicker to finish executing enumerating and creating links. I think the usefulness of the unique partition names ends there.


/* Must have a pathname and it must be absolute. */
if (v->vdev_path == NULL || v->vdev_path[0] != '/') {
v->vdev_stat.vs_aux = VDEV_AUX_BAD_LABEL;
return (EINVAL);
return (SET_ERROR(EINVAL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the three SET_ERROR changes logically related to this patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the sense that it would be been far easier to identify the root cause of the problem if they had existed. Initially it wasn't at all clear what part of create path was returning ENOENT.

@rlaager
Copy link
Member

rlaager commented Apr 20, 2016

Is this 50ms delay per-device? (I think so.) Are devices processed serially? (I think so.) Are you concerned about this adding up to meaningful amounts of time on large pools?

This may be a dumb question (since I don't fully understand this)...
What is the advantage of using libudev to verify all of the links--as opposed to using libudev only for udev_device_is_ready() and basically using the fallback/old version of the loop all the time?

@rlaager
Copy link
Member

rlaager commented Apr 20, 2016

@AeonJJohnson: When you say, "Making unique part names makes systemd happy and likely quicker to finish executing enumerating and creating links.", are you saying that unique partition names are better than non-unique partition names? If so, I don't disagree. My question is: are unique partition names better than no partition names?

@AeonJJohnson
Copy link

@rlaager as someone who as had to perform crisis storage brain surgery on non-backed-up data storage I would say that partitions having unique partition names would be useful in the event you had to get elbow deep in the guts to save data. Outside of those situations where you are drawing diagrams and performing trauma surgery there isn't really a need for partition names. But when you are, valuable.

@rlaager
Copy link
Member

rlaager commented Apr 20, 2016

@AeonJJohnson Thanks. That's something I hadn't considered.

@behlendorf
Copy link
Contributor Author

behlendorf commented Apr 20, 2016

Is this 50ms delay per-device? (I think so.) Are devices processed serially? (I think so.) Are you concerned about this adding up to meaningful amounts of time on large pools?

The delays in zpool_label_disk_wait() will be handled serially per-device. On the kernel side in vdev_disk_open() they're done in parallel and a ENOENT failure should be very very rare. I'm not thrilled with the additional settle time but it should only really add up when pools with a large number of devices. Based on the testing I did the average time for udev to create the symlinks was 150-200ms. We're adding another 50ms settle time on to that. This could be done in parallel but that would be a bigger patch. This felt workable and resolved the reliability issues, so I'm not too concerned about adding a fraction of a second to zpool create.

What is the advantage of using libudev to verify all of the links--as opposed to using libudev only for udev_device_is_ready() and basically using the fallback/old version of the loop all the time?

From what I observed in my testing, checking all the links, as opposed to just one, gave a better indication of if udev had fully settled. Somewhat to my surprise I didn't see a better way to do this with libudev. We could definitely rely on a fallback version as you suggested but my subjective impression was it wouldn't be as accurate. My feeling was the more we could catch prior to the ioctl() the better.

@rlaager
Copy link
Member

rlaager commented Apr 20, 2016

Thanks! I have no other review questions/comments.

@behlendorf behlendorf changed the title Create unique partition labels Improved libblkid/libudev integration Apr 20, 2016
@behlendorf
Copy link
Contributor Author

@rlaager I've addressed your comments, and refreshed the patch with a small fix to vdev_disk_open() which was caught by the automated testing.

I've also added a patch to this stack which resolves the issues from #3043. I've renamed the PR so it's no something of a catch-all for liblkid/libudev integration bugs.

if (id == 0)
id = (((uint64_t)rand()) << 32) | (uint64_t)rand();

sprintf(label_name, "zfs-%016llx", (u_longlong_t) id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use: snprintf(label_name, label_size, "zfs-%016llx", (u_longlong_t) id);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just need a unique ID, could you make id static and (atomically) increment it on every zpool_label_name() call? That would remove the need for the random number code.

zpool_label_name() {
     static uint64_t id = 0;
     id++;
     ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work for zpool create, but the trouble is it needs to be a persistent unique ID. We need to avoid the case where a pool ends up with two partitions using the same name. You could see this happening quickly if every zpool replace ended up generating zfs-0 as the partition name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snprintf() is a good idea.

When partitioning a device a name may be specified for each partion.
Internally zfs doesn't use this partition name for anything so it
has always just been set to "zfs".

However this isn't optimal because udev will create symlinks using
this name in /dev/disk/by-partlabel/.  If the name isn't unique
then all the links cannot be created.

Therefore a random 64-bit value has been added to the partition
label, i.e "zfs-1234567890abcdef".  Additional information could
be encoded here but since partitions may be reused that could
result in confusion and it was decided against.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4517
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks.  This process takes
a variable length of time and depends factors such how many links
must be created, the complexity of the rules, etc.  Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the rules.

Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable.  At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.

In order to address this the zpool_label_disk_wait() function
has been updated to use libudev.  Until the registered system
device acknowledges that it in fully initialized the function
will wait.  Once fully initialized all device links are checked
and allowed to settle for 50ms.  This makes it far more certain
that all the device nodes will existing when the kernel modules
need to open them.

For systems without libudev an alternate zpool_label_disk_wait()
was implemented which includes a settle time.  In addition, the
kernel modules were updated to include retry logic for this
ENOENT case.  Due to the improved checks in the utilities it
is unlikely this logic will be invoked, however in the rare
event it is needed to will prevent a failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3708
Issue openzfs#4077
Issue openzfs#4144
Issue openzfs#4214
Issue openzfs#4517
Disable the additional EFI debugging in all builds.  Some users
run debugging builds in production and the extra log messages
can cause confusion.  Beyond that the log messages are rarely
userful.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When importing a pool using the blkid cache only the device
node path was added to the list of known paths for a device.
This results in 'zpool import' always using the sdX names
in preference to the 'path' name stored in the label.

To fix the issue the blkid import path has been updated to
add both the 'path', 'devid', and 'devname' names from the
label to known paths.  Sanity check is done to ensure these
paths do refer to the same device identified by blkid.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3043
@tonyhutter
Copy link
Contributor

Latest changes look good to me.

behlendorf added a commit to behlendorf/zfs that referenced this pull request Apr 25, 2016
Disable the additional EFI debugging in all builds.  Some users
run debug builds in production and the extra log messages can
cause confusion.  Beyond that the log messages are rarely useful.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#4523
behlendorf added a commit to behlendorf/zfs that referenced this pull request Apr 25, 2016
When importing a pool using the blkid cache only the device
node path was added to the list of known paths for a device.
This results in 'zpool import' always using the sdX names
in preference to the 'path' name stored in the label.

To fix the issue the blkid import path has been updated to
add both the 'path', 'devid', and 'devname' names from the
label to the known paths.  A sanity check is done to ensure
these paths do refer to the same device identified by blkid.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#4523
Closes openzfs#3043
@behlendorf
Copy link
Contributor Author

Merged as:

325414e Fix 'zpool import' blkid device names
b8faf0c Disable efi_debug in --enable-debug builds
2d82ea8 Use udev for partition detection
5b4136b Create unique partition labels

nedbass pushed a commit to nedbass/zfs that referenced this pull request May 6, 2016
When importing a pool using the blkid cache only the device
node path was added to the list of known paths for a device.
This results in 'zpool import' always using the sdX names
in preference to the 'path' name stored in the label.

To fix the issue the blkid import path has been updated to
add both the 'path', 'devid', and 'devname' names from the
label to the known paths.  A sanity check is done to ensure
these paths do refer to the same device identified by blkid.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#4523
Closes openzfs#3043
nedbass pushed a commit to nedbass/zfs that referenced this pull request May 6, 2016
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks.  This process takes
a variable length of time and depends on factors such how many links
must be created, the complexity of the rules, etc.  Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the udev rules.

Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable.  At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.

In order to address this the zpool_label_disk_wait() function
has been updated to use libudev.  Until the registered system
device acknowledges that it in fully initialized the function
will wait.  Once fully initialized all device links are checked
and allowed to settle for 50ms.  This makes it far more likely
that all the device nodes will exist when the kernel modules
need to open them.

For systems without libudev an alternate zpool_label_disk_wait()
was updated to include a settle time.  In addition, the kernel
modules were updated to include retry logic for this ENOENT case.
Due to the improved checks in the utilities it is unlikely this
logic will be invoked.  However, if the rare event it is needed
it will prevent a failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes openzfs#4523
Closes openzfs#3708
Closes openzfs#4077
Closes openzfs#4144
Closes openzfs#4214
Closes openzfs#4517
nedbass pushed a commit to nedbass/zfs that referenced this pull request May 6, 2016
When importing a pool using the blkid cache only the device
node path was added to the list of known paths for a device.
This results in 'zpool import' always using the sdX names
in preference to the 'path' name stored in the label.

To fix the issue the blkid import path has been updated to
add both the 'path', 'devid', and 'devname' names from the
label to the known paths.  A sanity check is done to ensure
these paths do refer to the same device identified by blkid.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#4523
Closes openzfs#3043
nedbass pushed a commit to nedbass/zfs that referenced this pull request May 6, 2016
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks.  This process takes
a variable length of time and depends on factors such how many links
must be created, the complexity of the rules, etc.  Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the udev rules.

Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable.  At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.

In order to address this the zpool_label_disk_wait() function
has been updated to use libudev.  Until the registered system
device acknowledges that it in fully initialized the function
will wait.  Once fully initialized all device links are checked
and allowed to settle for 50ms.  This makes it far more likely
that all the device nodes will exist when the kernel modules
need to open them.

For systems without libudev an alternate zpool_label_disk_wait()
was updated to include a settle time.  In addition, the kernel
modules were updated to include retry logic for this ENOENT case.
Due to the improved checks in the utilities it is unlikely this
logic will be invoked.  However, if the rare event it is needed
it will prevent a failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes openzfs#4523
Closes openzfs#3708
Closes openzfs#4077
Closes openzfs#4144
Closes openzfs#4214
Closes openzfs#4517
ryao pushed a commit to ClusterHQ/zfs that referenced this pull request Jun 7, 2016
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks.  This process takes
a variable length of time and depends on factors such how many links
must be created, the complexity of the rules, etc.  Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the udev rules.

Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable.  At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.

In order to address this the zpool_label_disk_wait() function
has been updated to use libudev.  Until the registered system
device acknowledges that it in fully initialized the function
will wait.  Once fully initialized all device links are checked
and allowed to settle for 50ms.  This makes it far more likely
that all the device nodes will exist when the kernel modules
need to open them.

For systems without libudev an alternate zpool_label_disk_wait()
was updated to include a settle time.  In addition, the kernel
modules were updated to include retry logic for this ENOENT case.
Due to the improved checks in the utilities it is unlikely this
logic will be invoked.  However, if the rare event it is needed
it will prevent a failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes openzfs#4523
Closes openzfs#3708
Closes openzfs#4077
Closes openzfs#4144
Closes openzfs#4214
Closes openzfs#4517
ryao pushed a commit to ClusterHQ/zfs that referenced this pull request Jun 7, 2016
Disable the additional EFI debugging in all builds.  Some users
run debug builds in production and the extra log messages can
cause confusion.  Beyond that the log messages are rarely useful.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#4523
ryao pushed a commit to ClusterHQ/zfs that referenced this pull request Jun 7, 2016
When importing a pool using the blkid cache only the device
node path was added to the list of known paths for a device.
This results in 'zpool import' always using the sdX names
in preference to the 'path' name stored in the label.

To fix the issue the blkid import path has been updated to
add both the 'path', 'devid', and 'devname' names from the
label to the known paths.  A sanity check is done to ensure
these paths do refer to the same device identified by blkid.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#4523
Closes openzfs#3043
@behlendorf behlendorf deleted the issue-4517 branch April 19, 2021 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants