Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdev_id: Support daisy-chained JBODs in multipath mode. #11526

Merged
merged 1 commit into from
Feb 9, 2021

Conversation

arshad512
Copy link
Contributor

Within function sas_handler() userspace commands like
'/usr/sbin/multipath' have been replaced with sourcing
device details from within sysfs which reduced a
significant amount of overhead and processing time.
Multiple JBOD enclosures and their order are sourced
from the bsg driver (/sys/class/enclosure) to isolate
chassis top-level expanders, which are then dynamically
indexed based on host channel of the multipath subordinate
disk member device being processed. Additionally added a
"mixed" mode for slot identification for environments where
a ZFS server system may contain SAS disk slots where there
is no expander (direct connect to HBA) while an attached
external JBOD with an expander have different slot identifier
methods.

How Has This Been Tested?


Testing was performed on a AMD EPYC based dual-server
high-availability multipath environment with multiple
HBAs per ZFS server and four SAS JBODs. The two primary
JBODs were multipath/cross-connected between the two
ZFS-HA servers. The secondary JBODs were daisy-chained
off of the primary JBODs using aligned SAS expander
channels (JBOD-0 expanderA--->JBOD-1 expanderA,
          JBOD-0 expanderB--->JBOD-1 expanderB, etc).
Pools were created, exported and re-imported, imported
globally with 'zpool import -a -d /dev/disk/by-vdev'.
Low level udev debug outputs were traced to isolate
and resolve errors.

Result:
~~~~~~~

Initial testing of a previous version of this change
showed how reliance on userspace utilities like
'/usr/sbin/multipath' and '/usr/bin/lsscsi' were
exacerbated by increasing numbers of disks and JBODs.
With four 60-disk SAS JBODs and 240 disks the time to
process a udevadm trigger was 3 minutes 30 seconds
during which nearly all CPU cores were above 80%
utilization. By switching reliance on userspace
utilities to sysfs in this version, the udevadm
trigger processing time was reduced to 12.2 seconds
and negligible CPU load.

This patch also fixes few shellcheck complains.

Signed-off-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>

<!--- Please fill out the following template, which will help other contributors review your Pull Request. -->

<!--- Provide a general summary of your changes in the Title above -->

<!---
Documentation on ZFS Buildbot options can be found at
https://openzfs.github.io/openzfs-docs/Developer%20Resources/Buildbot%20Options.html
-->

### Motivation and Context
<!--- Why is this change required? What problem does it solve? -->
<!--- If it fixes an open issue, please link to the issue here. -->

### Description
<!--- Describe your changes in detail -->
Details is given in the commit message

### How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
Details is given in the commit message
<!--- Include details of your testing environment, and the tests you ran to -->
<!--- see how your change affects other areas of the code, etc. -->
<!--- If your change is a performance enhancement, please provide benchmarks here. -->
<!--- Please think about using the draft PR feature if appropriate -->

### Types of changes
<!--- What types of changes does your code introduce? Put an `x` in all the boxes that apply: -->
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [X] Performance enhancement (non-breaking change which improves efficiency)
- [X] Code cleanup (non-breaking change which makes code smaller or more readable)
- [X] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs\_core, libnvpair, libuutil and libzfsbootenv)
- [ ] Documentation (a change to man pages or other documentation)

### Checklist:
<!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
<!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
- [ ] My code follows the OpenZFS [code style requirements](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#coding-conventions).
- [ ] I have updated the documentation accordingly.
- [ ] I have read the [**contributing** document](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md).
- [ ] I have added [tests](https://github.com/openzfs/zfs/tree/master/tests) to cover my changes.
- [ ] I have run the ZFS Test Suite with this change applied.
- [ ] All commit messages are properly formatted and contain [`Signed-off-by`](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#signed-off-by).

@arshad512
Copy link
Contributor Author

arshad512 commented Jan 26, 2021

Hi Brian, Gabriel - Please use Jeff's latest PR for your reviews of vdev_id improvement. Jeff's old PR #10736 and #11520 need not be reviewed further as all changes are bought into latest one. (I wanted to update #10736, however was ending up with multiple merges. Easier for me to create a new PR. - Sorry for the multiple versions)

Few comments on your last review...

There's a few checkbashism complaints

This is fixed in the latest update. Thanks for pointing this out.

And here's the current state of shellcheck, I think there's some more fixups that could be made:

Initially running local shellcheck would give us 169 warnings. We bought it down to 127 as with the latest PR. And would like to go with Brian's suggestion, that is, let this patch be as is that focus on functionality and improvement what Jeff has coded. The follow up patch would be purely shellcheck fix. That would be easier. Please let me know your thoughts.

shellcheck run on Master

# shellcheck -s sh -f gcc cmd/vdev_id/vdev_id | wc
    169    1747   15179

shellcheck run on latest code (PR)

# shellcheck -s sh -f gcc cmd/vdev_id/vdev_id | wc
    127    1510   12798

@gdevenyi
Copy link
Contributor

I'm okay to leave some shellchecks unfixed, however I'd just like to highlight that I think any quoting shellchecks need to be carefully handled now, here's a dump of one of my JBOD's enclosure naming:

/sys/class/enclosure/1:0:4:0# ls
components    SLOT 10 09    SLOT 15 12    SLOT  2 01    SLOT 24 1B    SLOT 29 24    SLOT 33 28    SLOT 38 31    SLOT 42 35    SLOT 47 3A    SLOT 51 42    SLOT 56 47    SLOT  6 05    uevent
device        SLOT 11 0A    SLOT 16 13    SLOT 20 17    SLOT 25 20    SLOT  3 02    SLOT 34 29    SLOT 39 32    SLOT 43 36    SLOT 48 3B    SLOT 52 43    SLOT 57 48    SLOT  7 06  
id            SLOT 12 0B    SLOT 17 14    SLOT 21 18    SLOT 26 21    SLOT 30 25    SLOT 35 2A    SLOT  4 03    SLOT 44 37    SLOT 49 40    SLOT 53 44    SLOT 58 49    SLOT  8 07  
power         SLOT 13 10    SLOT 18 15    SLOT 22 19    SLOT 27 22    SLOT 31 26    SLOT 36 2B    SLOT 40 33    SLOT 45 38    SLOT  5 04    SLOT 54 45    SLOT 59 4A    SLOT  9 08  
SLOT  1 00    SLOT 14 11    SLOT 19 16    SLOT 23 1A    SLOT 28 23    SLOT 32 27    SLOT 37 30    SLOT 41 34    SLOT 46 39    SLOT 50 41    SLOT 55 46    SLOT 60 4B    subsystem

Yes, those slot names all have multiple spaces in them.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should be very careful about the quoting. I also think it would be preferable to split this in to two commits. The first one can add the new functionality, and the second can at least make a first pass on addressing the bashisms.

An easy way to do this with git would be to reset the changes and then add back in individual hunks. Something like this should work.

git reset HEAD^

# Add in all of the new hunks for the new feature
git add -p
git commit

# Add in the remaining chunks which are style / portability fixes.
git add -p
git commit

cmd/vdev_id/vdev_id Outdated Show resolved Hide resolved
cmd/vdev_id/vdev_id Outdated Show resolved Hide resolved
cmd/vdev_id/vdev_id Outdated Show resolved Hide resolved
@arshad512
Copy link
Contributor Author

I think any quoting shellchecks need to be carefully handled now...

I will take care of the handling the quoting in my next push. Along with other comments.

@arshad512 arshad512 force-pushed the vdev_id-slot_mix_Jeff_01 branch from caff865 to 49711ed Compare January 27, 2021 11:06
@arshad512
Copy link
Contributor Author

arshad512 commented Jan 27, 2021

I think any quoting shellchecks need to be carefully handled now...

Resolved all "SC2086: Double quote to prevent globbing and word splitting" in the vdev_id code.
PS: This is lightly tested. Jeff's is author and expert here. I will take his advice how to further test this out.

# shellcheck -s sh -f gcc cmd/vdev_id/vdev_id | grep SC2086
#

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the code looks good to me. Though I haven't yet had a chance to test it out on a test system.

@arshad512
Copy link
Contributor Author

Thanks, the code looks good to me. Though I haven't yet had a chance to test it out on a test system.

Hi Brian Thanks for the review. In that case could you hold the merge until we have done few more test runs.

@gdevenyi
Copy link
Contributor

gdevenyi commented Jan 28, 2021

Edit: got it sorted, vdev_id wants dm-XX devices for multipath, not /dev/mapper devices.

@gdevenyi
Copy link
Contributor

This update doesn't solve the logic problems underlying my JBOD described in #11095

Only 39/90 disks in the enclosure get a slot number after trying the various "slot" options.

@arshad512
Copy link
Contributor Author

arshad512 commented Jan 28, 2021

This update doesn't solve the logic problems underlying my JBOD described in #11095
Only 39/90 disks in the enclosure get a slot number after trying the various "slot" options.

I will have a look. I am cleaning up few more shellchecks which I think should be done. Could you please paste your latest output (ran under '-x' (debug)) from updated script here. sh -x .<path>/vdev_id -d <device>

@gdevenyi
Copy link
Contributor

gdevenyi commented Jan 28, 2021

I was a bit too quick. It seems this config will successfully find/ID all my slots, and it seems to also find them with the old code, not sure how I missed this config on the first try.

topology sas_direct
multipath yes
slot id
enclosure_symlinks yes

#       PCI_ID  HBA PORT  CHANNEL NAME
channel 5e:00.0 0         A
channel 5e:00.0 1         B
channel af:00.0 0         A
channel af:00.0 1         B

slot id properly can find 90 disks, however there is a weird off-by-one error, I don't get a slot 12 disk, so the disk IDs go from A0 to A91. Here's the -x output from the A11 and A13 devices attached.
11.log
13.log

Edit: of course, A0 to A91 means there's more missing items, A52 also is missing

cmd/vdev_id/vdev_id Outdated Show resolved Hide resolved
@arshad512
Copy link
Contributor Author

I was a bit too quick. It seems this config will successfully find/ID all my slots, and it seems to also find them with the old code, not sure how I missed this config on the first try.

topology sas_direct
multipath yes
slot id
enclosure_symlinks yes

#       PCI_ID  HBA PORT  CHANNEL NAME
channel 5e:00.0 0         A
channel 5e:00.0 1         B
channel af:00.0 0         A
channel af:00.0 1         B

slot id properly can find 90 disks, however there is a weird off-by-one error, I don't get a slot 12 disk, so the disk IDs go from A0 to A91. Here's the -x output from the A11 and A13 devices attached.
11.log
13.log

Edit: of course, A0 to A91 means there's more missing items, A52 also is missing

Hi Gabriel - Thanks for the log. The script is still getting improved (other bugs resolved) - once we get over this I will have a look at your issue.

@arshad512 arshad512 force-pushed the vdev_id-slot_mix_Jeff_01 branch from 49711ed to 9bfa0ca Compare January 29, 2021 13:12
@arshad512
Copy link
Contributor Author

I was a bit too quick. It seems this config will successfully find/ID all my slots, and it seems to also find them with the old code, not sure how I missed this config on the first try.

topology sas_direct
multipath yes
slot id
enclosure_symlinks yes

#       PCI_ID  HBA PORT  CHANNEL NAME
channel 5e:00.0 0         A
channel 5e:00.0 1         B
channel af:00.0 0         A
channel af:00.0 1         B

slot id properly can find 90 disks, however there is a weird off-by-one error, I don't get a slot 12 disk, so the disk IDs go from A0 to A91. Here's the -x output from the A11 and A13 devices attached.
11.log
13.log
Edit: of course, A0 to A91 means there's more missing items, A52 also is missing

Hi Gabriel - Thanks for the log. The script is still getting improved (other bugs resolved) - once we get over this I will have a look at your issue.

Hi Gabrial - I am now looking into your issue. While I get the local reproducer ... Could you please confirm for me that the issue is seen only on the latest version and old vdev_id works fine ?

@gdevenyi
Copy link
Contributor

Hi Gabriel - I am now looking into your issue. While I get the local reproducer ... Could you please confirm for me that the issue is seen only on the latest version and old vdev_id works fine ?

This issue exists in both the old (MASTER) and your version, so it's somewhere in the common logic/code.

@arshad512
Copy link
Contributor Author

This issue exists in both the old (MASTER) and your version, so it's somewhere in the common logic/code.

Thanks for the info. Then could we move this particular investigation to another bug ? Leaving this PR for Jeff's improvement only?

However, This is what I found from your logs. The code correctly (from what I traced back) picks up and parses info from udevadm info. As long as all disk are covered the numbers could be non-liner (or as reported by udevadm). Maybe this is expected ?

11.log

+ ls -l --full-time /dev/mapper
+ awk /\/dm-10$/{print }
+ DM_NAME=lrwxrwxrwx 1 root root       8 2020-10-30 12:48:05.594464270 -0400 35000cca26f817bf4 -> ../dm-10
+ udevadm info -q path -p /sys/block/sdn
+ sys_path=/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/host0/port-0:0/expander-0:0/port-0:0:11/end_device-0:0:11/target0:0:11/0:0:11:0/block/sdn

+ set -- devices pci0000:5d 0000:5d:00.0 0000:5e:00.0 host0 port-0:0 expander-0:0 port-0:0:11 end_device-0:0:11 target0:0:11 0:0:11:0 block sdn
+ num_dirs=13
This is where it gets the SLOT
+ SLOT=
+ i=10
+ eval echo ${10}
+ echo target0:0:11
+ d=target0:0:11
+ echo target0:0:11
+ sed -e s/^.*://
+ SLOT=11

map_slot() called with $SLOT

+ map_slot 11 A
+ LINUX_SLOT=11

Same for 13.log

+ ls -l --full-time /dev/mapper
+ awk /\/dm-33$/{print }
+ DM_NAME=lrwxrwxrwx 1 root root       8 2020-10-30 12:48:05.542464268 -0400 35000cca26f81db84 -> ../dm-33
+ udevadm info -q path -p /sys/block/sdo
+ sys_path=/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/host0/port-0:0/expander-0:0/port-0:0:13/expander-0:1/port-0:1:0/end_device-0:1:0/target0:0:13/0:0:13:0/block/sdo
+ set -- devices pci0000:5d 0000:5d:00.0 0000:5e:00.0 host0 port-0:0 expander-0:0 port-0:0:13 expander-0:1 port-0:1:0 end_device-0:1:0 target0:0:13 0:0:13:0 block sdo
+ num_dirs=15
+ SLOT=
+ i=12
+ eval echo ${12}
+ echo target0:0:13
+ d=target0:0:13
+ echo target0:0:13
+ sed -e s/^.*://
+ SLOT=13

map_slot() called with $SLOT

+ map_slot 13 A
+ LINUX_SLOT=13

@gdevenyi
Copy link
Contributor

Thanks for the info. Then could we move this particular investigation to another bug ? Leaving this PR for Jeff's improvement only?

I'm happy to leave this off to another issue. It will be far easier to test with your code running so much quicker.

@gdevenyi
Copy link
Contributor

P.S. Thanks for the breakdown of the code pathing, I have found the answer, my JBOD attaches an "enclosure" device at port 12:

   `--4x--expander-0:0  
          |--1x--end_device-0:0:12
          |      `--enclosure  

@tonyhutter
Copy link
Contributor

So far this is working on two types of JBODs we have, and I'm going to test on a third type tomorrow.

@gdevenyi
Copy link
Contributor

gdevenyi commented Feb 2, 2021

So far this is working on two types of JBODs we have, and I'm going to test on a third type tomorrow.

Three types! It works fine on my older model 60-bay from Newisys

@arshad512
Copy link
Contributor Author

Hi Tony, Gabriel

So far this is working on two types of JBODs we have, and I'm going to test on a third type tomorrow.

Three types! It works fine on my older model 60-bay from Newisys

Thanks for the verification. We also continue to to test. Will update soon on this.

Copy link
Contributor

@tonyhutter tonyhutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far this is working on two types of JBODs we have, and I'm going to test on a third type tomorrow.

It works on our third type as well. Approved

@behlendorf
Copy link
Contributor

@arshad512 if you can just confirm you don't have any additional changes to add to this I'll go ahead and get it merged. Thanks for all the testing.

@arshad512
Copy link
Contributor Author

@arshad512 if you can just confirm you don't have any additional changes to add to this I'll go ahead and get it merged. Thanks for all the testing.

Hi Brian, - There is one change (improvement) we are currently doing. This came out of Jeff's review/testing. We should be force pushing new version in a day. Please hold on the merge.

@arshad512 arshad512 force-pushed the vdev_id-slot_mix_Jeff_01 branch from 9bfa0ca to 9f26d55 Compare February 8, 2021 01:43
@arshad512
Copy link
Contributor Author

Hi Brian, Tony, Gabriel

Please review the new version.

Mostly changes/fix are under map_jbod(). The fix/improvement is to read '/sys/class/enclosure/id' which contains a unique ID to identify JBOD instead of reading is serially. The final result is 'JBOD' 1 and 2 instead of 1 and 3. There was another bug which was also identified by Jeff where under dual HBA config only half of the disk was seen.

Before fix

1115099 lrwxrwxrwx. 1 root root   11 Feb  3 02:11 AeonComputing_TEST0-1-9 -> ../../dm-61
1114107 lrwxrwxrwx. 1 root root   11 Feb  3 02:11AeonComputing_TEST0-3-0 -> ../../dm-63

After fix

1115099 lrwxrwxrwx. 1 root root   11 Feb  3 06:15 AeonComputing_TEST0-1-9 -> ../../dm-61
1262966 lrwxrwxrwx. 1 root root   11 Feb  3 06:15 AeonComputing_TEST0-2-0 -> ../../dm-63

@arshad512 arshad512 force-pushed the vdev_id-slot_mix_Jeff_01 branch from 9f26d55 to 9df06cc Compare February 8, 2021 03:23
@AeonJJohnson
Copy link

Based on our testing, and other comments in this PR the multiple JBOD enumeration works on:

Western Digital Ultrastar Data 60 JBOD
Western Digital Ultrastar Data 102 JBOD
Quanta/QCT JB4602
Newisys NDS-4600

@tonyhutter
Copy link
Contributor

@arshad512 I'll give it another test.

@tonyhutter
Copy link
Contributor

tonyhutter commented Feb 9, 2021

One of my JBODs worked with the updated vdev_id, one is not working with it, and one more left to test. I ran out of time before I could dig into why it wasn't working, but I'll try to dig deeper tomorrow.

@arshad512
Copy link
Contributor Author

arshad512 commented Feb 9, 2021

One of my JBODs worked with the updated vdev_id, one is not working with it,...

Thanks for trying the new version. As Jeff pointed out. We tested across multiple JBOD's and this version work. Could you please paste the '-x' output of the run which is not working for you?

Within function sas_handler() userspace commands like
'/usr/sbin/multipath' have been replaced with sourcing
device details from within sysfs which reduced a
significant amount of overhead and processing time.
Multiple JBOD enclosures and their order are sourced
from the bsg driver (/sys/class/enclosure) to isolate
chassis top-level expanders, which are then dynamically
indexed based on host channel of the multipath subordinate
disk member device being processed. Additionally added a
"mixed" mode for slot identification for environments where
a ZFS server system may contain SAS disk slots where there
is no expander (direct connect to HBA) while an attached
external JBOD with an expander have different slot identifier
methods.

How Has This Been Tested?
~~~~~~~~~~~~~~~~~~~~~~~~~

Testing was performed on a AMD EPYC based dual-server
high-availability multipath environment with multiple
HBAs per ZFS server and four SAS JBODs. The two primary
JBODs were multipath/cross-connected between the two
ZFS-HA servers. The secondary JBODs were daisy-chained
off of the primary JBODs using aligned SAS expander
channels (JBOD-0 expanderA--->JBOD-1 expanderA,
          JBOD-0 expanderB--->JBOD-1 expanderB, etc).
Pools were created, exported and re-imported, imported
globally with 'zpool import -a -d /dev/disk/by-vdev'.
Low level udev debug outputs were traced to isolate
and resolve errors.

Result:
~~~~~~~

Initial testing of a previous version of this change
showed how reliance on userspace utilities like
'/usr/sbin/multipath' and '/usr/bin/lsscsi' were
exacerbated by increasing numbers of disks and JBODs.
With four 60-disk SAS JBODs and 240 disks the time to
process a udevadm trigger was 3 minutes 30 seconds
during which nearly all CPU cores were above 80%
utilization. By switching reliance on userspace
utilities to sysfs in this version, the udevadm
trigger processing time was reduced to 12.2 seconds
and negligible CPU load.

This patch also fixes few shellcheck complains.

Signed-off-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>
@arshad512 arshad512 force-pushed the vdev_id-slot_mix_Jeff_01 branch from 9df06cc to f2bdbce Compare February 9, 2021 10:20
@arshad512
Copy link
Contributor Author

Hi Tony,

While I wait for your debug output. I went over the code again just to see if I could catch a corner case in the code. I could not. However, I realized the function get_uniq_encl_id() could be optimized a bit. My new push does that. Nothing functional changed in new push.

@tonyhutter
Copy link
Contributor

tonyhutter commented Feb 9, 2021

@arshad512 I figured it out why it wasn't working - I had copied the new vdev_id to the node without setting +x! 😬

I tried your latest push with all three of our JBODs and it works fine 👍

@behlendorf
Copy link
Contributor

@arshad512 @tonyhutter great, then this PR is ready to merge. Thanks!

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Feb 9, 2021
@behlendorf behlendorf merged commit 677cdf7 into openzfs:master Feb 9, 2021
@gdevenyi
Copy link
Contributor

This is a huge win, thanks @arshad512 and @AeonJJohnson !

jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
Within function sas_handler() userspace commands like
'/usr/sbin/multipath' have been replaced with sourcing
device details from within sysfs which reduced a
significant amount of overhead and processing time.
Multiple JBOD enclosures and their order are sourced
from the bsg driver (/sys/class/enclosure) to isolate
chassis top-level expanders, which are then dynamically
indexed based on host channel of the multipath subordinate
disk member device being processed. Additionally added a
"mixed" mode for slot identification for environments where
a ZFS server system may contain SAS disk slots where there
is no expander (direct connect to HBA) while an attached
external JBOD with an expander have different slot identifier
methods.

How Has This Been Tested?
~~~~~~~~~~~~~~~~~~~~~~~~~

Testing was performed on a AMD EPYC based dual-server
high-availability multipath environment with multiple
HBAs per ZFS server and four SAS JBODs. The two primary
JBODs were multipath/cross-connected between the two
ZFS-HA servers. The secondary JBODs were daisy-chained
off of the primary JBODs using aligned SAS expander
channels (JBOD-0 expanderA--->JBOD-1 expanderA,
          JBOD-0 expanderB--->JBOD-1 expanderB, etc).
Pools were created, exported and re-imported, imported
globally with 'zpool import -a -d /dev/disk/by-vdev'.
Low level udev debug outputs were traced to isolate
and resolve errors.

Result:
~~~~~~~

Initial testing of a previous version of this change
showed how reliance on userspace utilities like
'/usr/sbin/multipath' and '/usr/bin/lsscsi' were
exacerbated by increasing numbers of disks and JBODs.
With four 60-disk SAS JBODs and 240 disks the time to
process a udevadm trigger was 3 minutes 30 seconds
during which nearly all CPU cores were above 80%
utilization. By switching reliance on userspace
utilities to sysfs in this version, the udevadm
trigger processing time was reduced to 12.2 seconds
and negligible CPU load.

This patch also fixes few shellcheck complains.

Reviewed-by: Gabriel A. Devenyi <gdevenyi@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>
Closes openzfs#11526
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
Within function sas_handler() userspace commands like
'/usr/sbin/multipath' have been replaced with sourcing
device details from within sysfs which reduced a
significant amount of overhead and processing time.
Multiple JBOD enclosures and their order are sourced
from the bsg driver (/sys/class/enclosure) to isolate
chassis top-level expanders, which are then dynamically
indexed based on host channel of the multipath subordinate
disk member device being processed. Additionally added a
"mixed" mode for slot identification for environments where
a ZFS server system may contain SAS disk slots where there
is no expander (direct connect to HBA) while an attached
external JBOD with an expander have different slot identifier
methods.

How Has This Been Tested?
~~~~~~~~~~~~~~~~~~~~~~~~~

Testing was performed on a AMD EPYC based dual-server
high-availability multipath environment with multiple
HBAs per ZFS server and four SAS JBODs. The two primary
JBODs were multipath/cross-connected between the two
ZFS-HA servers. The secondary JBODs were daisy-chained
off of the primary JBODs using aligned SAS expander
channels (JBOD-0 expanderA--->JBOD-1 expanderA,
          JBOD-0 expanderB--->JBOD-1 expanderB, etc).
Pools were created, exported and re-imported, imported
globally with 'zpool import -a -d /dev/disk/by-vdev'.
Low level udev debug outputs were traced to isolate
and resolve errors.

Result:
~~~~~~~

Initial testing of a previous version of this change
showed how reliance on userspace utilities like
'/usr/sbin/multipath' and '/usr/bin/lsscsi' were
exacerbated by increasing numbers of disks and JBODs.
With four 60-disk SAS JBODs and 240 disks the time to
process a udevadm trigger was 3 minutes 30 seconds
during which nearly all CPU cores were above 80%
utilization. By switching reliance on userspace
utilities to sysfs in this version, the udevadm
trigger processing time was reduced to 12.2 seconds
and negligible CPU load.

This patch also fixes few shellcheck complains.

Reviewed-by: Gabriel A. Devenyi <gdevenyi@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>
Closes openzfs#11526
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Nov 2, 2021
Within function sas_handler() userspace commands like
'/usr/sbin/multipath' have been replaced with sourcing
device details from within sysfs which reduced a
significant amount of overhead and processing time.
Multiple JBOD enclosures and their order are sourced
from the bsg driver (/sys/class/enclosure) to isolate
chassis top-level expanders, which are then dynamically
indexed based on host channel of the multipath subordinate
disk member device being processed. Additionally added a
"mixed" mode for slot identification for environments where
a ZFS server system may contain SAS disk slots where there
is no expander (direct connect to HBA) while an attached
external JBOD with an expander have different slot identifier
methods.

How Has This Been Tested?
~~~~~~~~~~~~~~~~~~~~~~~~~

Testing was performed on a AMD EPYC based dual-server
high-availability multipath environment with multiple
HBAs per ZFS server and four SAS JBODs. The two primary
JBODs were multipath/cross-connected between the two
ZFS-HA servers. The secondary JBODs were daisy-chained
off of the primary JBODs using aligned SAS expander
channels (JBOD-0 expanderA--->JBOD-1 expanderA,
          JBOD-0 expanderB--->JBOD-1 expanderB, etc).
Pools were created, exported and re-imported, imported
globally with 'zpool import -a -d /dev/disk/by-vdev'.
Low level udev debug outputs were traced to isolate
and resolve errors.

Result:
~~~~~~~

Initial testing of a previous version of this change
showed how reliance on userspace utilities like
'/usr/sbin/multipath' and '/usr/bin/lsscsi' were
exacerbated by increasing numbers of disks and JBODs.
With four 60-disk SAS JBODs and 240 disks the time to
process a udevadm trigger was 3 minutes 30 seconds
during which nearly all CPU cores were above 80%
utilization. By switching reliance on userspace
utilities to sysfs in this version, the udevadm
trigger processing time was reduced to 12.2 seconds
and negligible CPU load.

This patch also fixes few shellcheck complains.

Reviewed-by: Gabriel A. Devenyi <gdevenyi@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>
Closes openzfs#11526
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Nov 13, 2021
Within function sas_handler() userspace commands like
'/usr/sbin/multipath' have been replaced with sourcing
device details from within sysfs which reduced a
significant amount of overhead and processing time.
Multiple JBOD enclosures and their order are sourced
from the bsg driver (/sys/class/enclosure) to isolate
chassis top-level expanders, which are then dynamically
indexed based on host channel of the multipath subordinate
disk member device being processed. Additionally added a
"mixed" mode for slot identification for environments where
a ZFS server system may contain SAS disk slots where there
is no expander (direct connect to HBA) while an attached
external JBOD with an expander have different slot identifier
methods.

How Has This Been Tested?
~~~~~~~~~~~~~~~~~~~~~~~~~

Testing was performed on a AMD EPYC based dual-server
high-availability multipath environment with multiple
HBAs per ZFS server and four SAS JBODs. The two primary
JBODs were multipath/cross-connected between the two
ZFS-HA servers. The secondary JBODs were daisy-chained
off of the primary JBODs using aligned SAS expander
channels (JBOD-0 expanderA--->JBOD-1 expanderA,
          JBOD-0 expanderB--->JBOD-1 expanderB, etc).
Pools were created, exported and re-imported, imported
globally with 'zpool import -a -d /dev/disk/by-vdev'.
Low level udev debug outputs were traced to isolate
and resolve errors.

Result:
~~~~~~~

Initial testing of a previous version of this change
showed how reliance on userspace utilities like
'/usr/sbin/multipath' and '/usr/bin/lsscsi' were
exacerbated by increasing numbers of disks and JBODs.
With four 60-disk SAS JBODs and 240 disks the time to
process a udevadm trigger was 3 minutes 30 seconds
during which nearly all CPU cores were above 80%
utilization. By switching reliance on userspace
utilities to sysfs in this version, the udevadm
trigger processing time was reduced to 12.2 seconds
and negligible CPU load.

This patch also fixes few shellcheck complains.

Reviewed-by: Gabriel A. Devenyi <gdevenyi@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>
Closes openzfs#11526
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Nov 13, 2021
Within function sas_handler() userspace commands like
'/usr/sbin/multipath' have been replaced with sourcing
device details from within sysfs which reduced a
significant amount of overhead and processing time.
Multiple JBOD enclosures and their order are sourced
from the bsg driver (/sys/class/enclosure) to isolate
chassis top-level expanders, which are then dynamically
indexed based on host channel of the multipath subordinate
disk member device being processed. Additionally added a
"mixed" mode for slot identification for environments where
a ZFS server system may contain SAS disk slots where there
is no expander (direct connect to HBA) while an attached
external JBOD with an expander have different slot identifier
methods.

How Has This Been Tested?
~~~~~~~~~~~~~~~~~~~~~~~~~

Testing was performed on a AMD EPYC based dual-server
high-availability multipath environment with multiple
HBAs per ZFS server and four SAS JBODs. The two primary
JBODs were multipath/cross-connected between the two
ZFS-HA servers. The secondary JBODs were daisy-chained
off of the primary JBODs using aligned SAS expander
channels (JBOD-0 expanderA--->JBOD-1 expanderA,
          JBOD-0 expanderB--->JBOD-1 expanderB, etc).
Pools were created, exported and re-imported, imported
globally with 'zpool import -a -d /dev/disk/by-vdev'.
Low level udev debug outputs were traced to isolate
and resolve errors.

Result:
~~~~~~~

Initial testing of a previous version of this change
showed how reliance on userspace utilities like
'/usr/sbin/multipath' and '/usr/bin/lsscsi' were
exacerbated by increasing numbers of disks and JBODs.
With four 60-disk SAS JBODs and 240 disks the time to
process a udevadm trigger was 3 minutes 30 seconds
during which nearly all CPU cores were above 80%
utilization. By switching reliance on userspace
utilities to sysfs in this version, the udevadm
trigger processing time was reduced to 12.2 seconds
and negligible CPU load.

This patch also fixes few shellcheck complains.

Reviewed-by: Gabriel A. Devenyi <gdevenyi@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Jeff Johnson <jeff.johnson@aeoncomputing.com>
Signed-off-by: Arshad Hussain <arshad.hussain@aeoncomputing.com>
Closes openzfs#11526
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants