Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device open failed, device did not return an IDENTIFY DEVICE structure, #91

Open
Lusitaniae opened this issue Oct 27, 2022 · 4 comments · May be fixed by #242 or prometheus-community/helm-charts#4844

Comments

@Lusitaniae
Copy link

Lusitaniae commented Oct 27, 2022

Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 2"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=readjson.go:122 level=error msg="Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=main.go:57 level=error msg="Error collecting SMART data" err="smartctl returned bad data for device /dev/sdb"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=main.go:57 level=error msg="Error collecting SMART data" err="Device /dev/bus/0 unavialable"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=main.go:57 level=error msg="Error collecting SMART data" err="Device /dev/bus/0 unavialable"
/usr/local/bin/smartctl_exporter  --version
smartctl_exporter, version 0.9.0 (branch: HEAD, revision: 0f32489b4018a21747109a33d7297c1ed85e10ab)
  build user:       root@f07a6d7b35c8
  build date:       20221020-16:19:31
  go version:       go1.18.7
  platform:         linux/amd64

constantly seing NVMe drives fail due to heavy load

Usually will see something like the below in dmesg

But seems smartctl_exporter doesn't pick up any of this? (could be the smart tool itself too)

At least it should should report some kind of error no if it can't scan the drive?

(metrics are not reset to 0 when the exporter can't scan again?)

[Wed Oct 26 06:18:54 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:18:59 2022] nvme nvme0: I/O 718 QID 11 timeout, aborting
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 529 QID 34 timeout, aborting
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 530 QID 34 timeout, aborting
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 544 QID 34 timeout, aborting
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 545 QID 34 timeout, aborting
...
[Wed Oct 26 06:20:17 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:20:19 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:20:19 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:20:19 2022] blk_update_request: I/O error, dev nvme0n1, sector 1875858760 op 0x1:(WRITE) flags 0x1800 phys_seg 1 prio class 0
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): log I/O error -5
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): xfs_do_force_shutdown(0x2) called from line 1250 of file fs/xfs/xfs_log.c. Return address = 00000000dbc93c6d
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): Log I/O Error Detected. Shutting down filesystem
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): Please unmount the filesystem and rectify the problem(s)
[Wed Oct 26 06:20:19 2022] nvme nvme0: Abort status: 0x0
curl localhost:9633/metrics -s | grep crit | grep -v "#"
critical_warning{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
critical_warning{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
smartctl_device_critical_warning{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
smartctl_device_critical_warning{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0

curl localhost:9633/metrics -s | grep err | grep -v "#"
media_errors{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
media_errors{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
smartctl_device_media_errors{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
smartctl_device_media_errors{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
smartctl_device_num_err_log_entries{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 113
smartctl_device_num_err_log_entries{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 209

Also small nitpicks:

  • typo: unavialable
  • there should be a metric with smartctl_exporter version ?
@robryk
Copy link

robryk commented Jan 10, 2023

I encountered the same behaviour when smartctl_exporter was running as a user that couldn't open the device:

  • the errors were logged to stderr,
  • there was no indication of the errors in metrics (just as if I never asked it to scan that device).

Regardless of the reason for errors, I think it's a bug that they are not explicitly reported in metrics.

@NiceGuyIT
Copy link
Member

Hi @Lusitaniae. "Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode" is coming from smartctl as documented by their exit codes. When that happens, smartctl_exporter logs an error and does not produce any metrics. As @robryk mentioned, this can happen if you run the exporter as a user that doesn't have permission.

The two nitpicks have been fixed.


  • there was no indication of the errors in metrics (just as if I never asked it to scan that device).

Regardless of the reason for errors, I think it's a bug that they are not explicitly reported in metrics.

Hey @robryk, if you believe this is a bug, please open a separate issue to address it. One could argue environmental errors, such as permission errors, should not be included in the exporter output.

@nazar-pc
Copy link
Contributor

I have devices in low power/standby mode, but smartctl -a and other commands still return data just fine.

The reason seems to be --nocheck=standby argument, which results in the status code 2 and that error message, why is it necessary? I suspect to not wake up HDDs, but there are two things:

  1. SSDs should still be fine, so exporter can remember which drives are SSDs and still query them
  2. This shouldn't break reporting for other devices, but I have 2 of these low power/standby SSDs (cheap Chinese ones) and 3 Samsung SSDs that are no longer being reported because of it

@nazar-pc
Copy link
Contributor

#61 requested --nocheck=standby and it was implemented in #74. I think it'd be helpful to support exceptions for that option because I have SSDs that report that they are sleeping, though clearly there is no moving parts in there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants