Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Multipath support (handle duplicate /dev/dsX devices) #81

Closed
velmirslac opened this issue Oct 5, 2020 · 15 comments
Closed

[FEAT] Multipath support (handle duplicate /dev/dsX devices) #81

velmirslac opened this issue Oct 5, 2020 · 15 comments
Labels

Comments

@velmirslac
Copy link

Describe the bug
After running the collector on a system with 70 hard drives, the web app displayed only 42.

Expected behavior
Either listing all drives in a single page or generating multiple pages, or at least an indication that not all reported drives are displayed.

@velmirslac velmirslac added the bug Something isn't working label Oct 5, 2020
@AnalogJ
Copy link
Owner

AnalogJ commented Oct 8, 2020

hm. That definitely shouldn't happen. can you run smartctl --scan directly on the host and confirm that there are 70 devices detected?

Under the hood Scrutiny uses smartctl for device detection, if disk/arrays are not detected automatically, you will be able to force detection using the collector config file (which is still in beta, see #88)

@velmirslac
Copy link
Author

Running smartctl --scan shows 75 devices in total.

@AnalogJ
Copy link
Owner

AnalogJ commented Oct 8, 2020

could you paste the output here?

Is there a pattern to the devices that were detected vs the ones that are missing?

@velmirslac
Copy link
Author

Sure, see below.

77 devices, excuse my typo earlier.

The web app shows /dev/sda through sdal, skips to sdbr, then skips from there to sdbw

smartctl --scan

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/sdn -d scsi # /dev/sdn, SCSI device
/dev/sdo -d scsi # /dev/sdo, SCSI device
/dev/sdp -d scsi # /dev/sdp, SCSI device
/dev/sdq -d scsi # /dev/sdq, SCSI device
/dev/sdr -d scsi # /dev/sdr, SCSI device
/dev/sds -d scsi # /dev/sds, SCSI device
/dev/sdt -d scsi # /dev/sdt, SCSI device
/dev/sdu -d scsi # /dev/sdu, SCSI device
/dev/sdv -d scsi # /dev/sdv, SCSI device
/dev/sdw -d scsi # /dev/sdw, SCSI device
/dev/sdx -d scsi # /dev/sdx, SCSI device
/dev/sdy -d scsi # /dev/sdy, SCSI device
/dev/sdz -d scsi # /dev/sdz, SCSI device
/dev/sdaa -d scsi # /dev/sdaa, SCSI device
/dev/sdab -d scsi # /dev/sdab, SCSI device
/dev/sdac -d scsi # /dev/sdac, SCSI device
/dev/sdad -d scsi # /dev/sdad, SCSI device
/dev/sdae -d scsi # /dev/sdae, SCSI device
/dev/sdaf -d scsi # /dev/sdaf, SCSI device
/dev/sdag -d scsi # /dev/sdag, SCSI device
/dev/sdah -d scsi # /dev/sdah, SCSI device
/dev/sdai -d scsi # /dev/sdai, SCSI device
/dev/sdaj -d scsi # /dev/sdaj, SCSI device
/dev/sdak -d scsi # /dev/sdak, SCSI device
/dev/sdal -d scsi # /dev/sdal, SCSI device
/dev/sdam -d scsi # /dev/sdam, SCSI device
/dev/sdan -d scsi # /dev/sdan, SCSI device
/dev/sdao -d scsi # /dev/sdao, SCSI device
/dev/sdap -d scsi # /dev/sdap, SCSI device
/dev/sdaq -d scsi # /dev/sdaq, SCSI device
/dev/sdar -d scsi # /dev/sdar, SCSI device
/dev/sdas -d scsi # /dev/sdas, SCSI device
/dev/sdat -d scsi # /dev/sdat, SCSI device
/dev/sdau -d scsi # /dev/sdau, SCSI device
/dev/sdav -d scsi # /dev/sdav, SCSI device
/dev/sdaw -d scsi # /dev/sdaw, SCSI device
/dev/sdax -d scsi # /dev/sdax, SCSI device
/dev/sday -d scsi # /dev/sday, SCSI device
/dev/sdaz -d scsi # /dev/sdaz, SCSI device
/dev/sdba -d scsi # /dev/sdba, SCSI device
/dev/sdbb -d scsi # /dev/sdbb, SCSI device
/dev/sdbc -d scsi # /dev/sdbc, SCSI device
/dev/sdbd -d scsi # /dev/sdbd, SCSI device
/dev/sdbe -d scsi # /dev/sdbe, SCSI device
/dev/sdbf -d scsi # /dev/sdbf, SCSI device
/dev/sdbg -d scsi # /dev/sdbg, SCSI device
/dev/sdbh -d scsi # /dev/sdbh, SCSI device
/dev/sdbi -d scsi # /dev/sdbi, SCSI device
/dev/sdbj -d scsi # /dev/sdbj, SCSI device
/dev/sdbk -d scsi # /dev/sdbk, SCSI device
/dev/sdbl -d scsi # /dev/sdbl, SCSI device
/dev/sdbm -d scsi # /dev/sdbm, SCSI device
/dev/sdbn -d scsi # /dev/sdbn, SCSI device
/dev/sdbo -d scsi # /dev/sdbo, SCSI device
/dev/sdbp -d scsi # /dev/sdbp, SCSI device
/dev/sdbq -d scsi # /dev/sdbq, SCSI device
/dev/sdbr -d scsi # /dev/sdbr, SCSI device
/dev/sdbs -d scsi # /dev/sdbs, SCSI device
/dev/sdbt -d scsi # /dev/sdbt, SCSI device
/dev/sdbu -d scsi # /dev/sdbu, SCSI device
/dev/sdbv -d scsi # /dev/sdbv, SCSI device
/dev/sdbw -d scsi # /dev/sdbw, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device

@AnalogJ
Copy link
Owner

AnalogJ commented Oct 8, 2020

can you run (and and attach) the output from the following commands. I want to do a quick check comparing a drive that works vs one that doesn't. It'll confirm that the issue is in the scrutiny collector & not some missing smartctl config.

smartctl --info -j /dev/sda 
smartctl --info -j /dev/sdam 
smartctl -x -j /dev/sda 
smartctl -x -j /dev/sdam 

@velmirslac
Copy link
Author

velmirslac commented Oct 8, 2020

The output is attached.

smart.txt

@AnalogJ
Copy link
Owner

AnalogJ commented Oct 8, 2020

nothing looks weird there. Looks like I'm going to need you to do a full log dump.

Can you run the container with DEBUG and *_LOG_FILE env vars specified, then attach the log files here?

docker run -it --rm -p 8080:8080 \
-v /run/udev:/run/udev:ro \
--cap-add SYS_RAWIO \
--device=/dev/sda \
--device=/dev/sdb \
-e DEBUG=true \
-e COLLECTOR_LOG_FILE=/tmp/collector.log \
-e SCRUTINY_LOG_FILE=/tmp/web.log \
--name scrutiny \
analogj/scrutiny

# in another terminal trigger the collector
docker exec scrutiny scrutiny-collector-metrics run

# then use docker cp to copy the log files out of the container.
docker cp scrutiny:/tmp/collector.log collector.log
docker cp scrutiny:/tmp/web.log web.log

@velmirslac
Copy link
Author

Something I should have mentioned in the op: I'm running the web app in Docker and using the standalone collector on a separate host.

The debug logs from each are attached.

web.log
collector.log

@AnalogJ
Copy link
Owner

AnalogJ commented Oct 9, 2020

Hey @velmirslac Thanks for the log files, they were incredibly helpful.
It looks like all the data is correctly sent to the API/Database, however the API call to return a summary of all devices is missing entries. I'll replicate your environment from the log files and get back to you.

@AnalogJ
Copy link
Owner

AnalogJ commented Oct 11, 2020

Hey @velmirslac can you give me a bit more information about your system?

I was able to figure out why those devices are missing, they're detected with the same WWN number as another disk.

eg. /dev/sdam == /dev/sdd, /dev/sdan == /dev/sde, /dev/sdbq == /dev/sdag

Intially I thought that maybe the WWN was miscalcuated, but then I noticed that each device pair had the same serial number. Are you sure you have 77 separate physical disks? Is this a virtualized environment with virtual disks?

Any information you can provide would be appreciated.

@velmirslac
Copy link
Author

Sure thing.

The system uses multipathd to connect two SATA controllers to the same pool of physical disks. This provides redundancy in the event one SATA controller fails.

In this configuration, each HDD gets 2 /dev/sdX assignments - one on each controller. The devices are then actually accessed by the system, such as ZFS, as /dev/mpathX.

I apologize for not adding this info to the initial post. Since Scrutiny is evaluating WWNs and serials and then de-duplicating entries, it may be more appropriate to reclassify this issue as a feature request for multipath support rather than a bug.

@AnalogJ
Copy link
Owner

AnalogJ commented Oct 12, 2020

Hey @velmirslac

In that case, yeah I guess it makes sense that scrutiny sees duplicate devices.

What would you expect "multipath" support to mean exactly? Since the devices are all detected, I guess scrutiny is "working-as-intended", though it is collecting duplicate SMART data. Would you want scrutiny to use the /dev/mpathX device files rather than the duplicated /dev/sdX entries?

@AnalogJ AnalogJ changed the title [BUG] Limit on number of displayed drives [FEAT] Multipath support (handle duplicate /dev/dsX devices) Oct 12, 2020
@AnalogJ AnalogJ added enhancement New feature or request waiting for response and removed bug Something isn't working labels Oct 12, 2020
@velmirslac
Copy link
Author

I think it would probably be best to keep with displaying the de-duplicated /dev/sdX devices, as these are closer to the baremetal state of the hardware - multipath is an abstraction layer and the /dev/mpathX devices are virtual.

Showing which /dev/sdX device is a member of which /dev/mpathX would be nice to have, but not critical.

@AnalogJ
Copy link
Owner

AnalogJ commented Oct 14, 2020

I think the problem is that Scrutiny doesn't have a way to determine that one device file is duplicated by another. I think the solution would be for you to add a collecter.yaml file and ignore the duplicated devices.

Would that be sufficient?

@AnalogJ
Copy link
Owner

AnalogJ commented Dec 23, 2020

Hey @velmirslac I'm going to close this issue as the solution seems to just ignore the duplicated devices, or to run the collector in a container and only pass in the "real" devices to the container.

If you don't think this is sufficient, feel free to re-open this issue :)

@AnalogJ AnalogJ closed this as completed Dec 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants