Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SMARTCTL status failed on 0.4.4 but passed on 0.3.x #247

Closed
martadinata666 opened this issue May 13, 2022 · 11 comments
Closed

[BUG] SMARTCTL status failed on 0.4.4 but passed on 0.3.x #247

martadinata666 opened this issue May 13, 2022 · 11 comments
Labels
bug Something isn't working waiting for response

Comments

@martadinata666
Copy link

Describe the bug
SMARTCTL status change from passes to failed

Expected behavior
Should be pass? As 0.3.x result pass.

Screenshots

2022-05-13-111308_1413x916_scrot

I checked all details the only one got fail status is Numb Err Log Entries everything else is passed

Log Files

docker logs -f scrutiny-collector-1 
2022/05/13 10:45:01 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                        linux.amd64-0.4.4

time="2022-05-13T10:45:01+07:00" level=info msg="Verifying required tools" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Executing command: smartctl --scan -j" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Executing command: smartctl --info -j /dev/sda" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Generating WWN" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Executing command: smartctl --info -j /dev/sdb" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Using WWN Fallback" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Executing command: smartctl --info -j /dev/sdc" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Generating WWN" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Executing command: smartctl --info -j /dev/sdd" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Generating WWN" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Executing command: smartctl --info -j -d nvme /dev/nvme0" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Using WWN Fallback" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Sending detected devices to API, for filtering & validation" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2022-05-13T10:45:01+07:00" level=info msg="Executing command: smartctl -x -j /dev/sda" type=metrics
time="2022-05-13T10:45:03+07:00" level=info msg="Publishing smartctl results for 0x5000cca756cc9dce\n" type=metrics
time="2022-05-13T10:45:13+07:00" level=error msg="An error occurred while publishing SMART data for device (0x5000cca756cc9dce): Post http://app:8080/api/device/0x5000cca756cc9dce/smart: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" type=metrics
time="2022-05-13T10:45:13+07:00" level=info msg="Collecting smartctl results for sdb\n" type=metrics
time="2022-05-13T10:45:13+07:00" level=info msg="Executing command: smartctl -x -j /dev/sdb" type=metrics
time="2022-05-13T10:45:13+07:00" level=error msg="smartctl returned an error code (4) while processing sdb\n" type=metrics
time="2022-05-13T10:45:13+07:00" level=error msg="smartctl detected a checksum error" type=metrics
time="2022-05-13T10:45:13+07:00" level=info msg="Publishing smartctl results for 0x0000000000000000\n" type=metrics
time="2022-05-13T10:45:21+07:00" level=info msg="Collecting smartctl results for sdc\n" type=metrics
time="2022-05-13T10:45:21+07:00" level=info msg="Executing command: smartctl -x -j /dev/sdc" type=metrics
time="2022-05-13T10:45:21+07:00" level=error msg="smartctl returned an error code (4) while processing sdc\n" type=metrics
time="2022-05-13T10:45:21+07:00" level=error msg="smartctl detected a checksum error" type=metrics
time="2022-05-13T10:45:21+07:00" level=info msg="Publishing smartctl results for 0x50014ee65e0f488e\n" type=metrics
time="2022-05-13T10:45:28+07:00" level=info msg="Collecting smartctl results for sdd\n" type=metrics
time="2022-05-13T10:45:28+07:00" level=info msg="Executing command: smartctl -x -j /dev/sdd" type=metrics
time="2022-05-13T10:45:29+07:00" level=error msg="smartctl returned an error code (68) while processing sdd\n" type=metrics
time="2022-05-13T10:45:29+07:00" level=error msg="smartctl detected a checksum error" type=metrics
time="2022-05-13T10:45:29+07:00" level=info msg="Publishing smartctl results for 0x50014ee2654c6ac1\n" type=metrics
time="2022-05-13T10:45:39+07:00" level=error msg="An error occurred while publishing SMART data for device (0x50014ee2654c6ac1): Post http://app:8080/api/device/0x50014ee2654c6ac1/smart: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" type=metrics
time="2022-05-13T10:45:39+07:00" level=info msg="Collecting smartctl results for nvme0\n" type=metrics
time="2022-05-13T10:45:39+07:00" level=info msg="Executing command: smartctl -x -j -d nvme /dev/nvme0" type=metrics
time="2022-05-13T10:45:39+07:00" level=info msg="Publishing smartctl results for 182506420422\n" type=metrics
time="2022-05-13T10:45:39+07:00" level=info msg="Main: Completed" type=metrics
2022/05/13 11:00:01 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                        linux.amd64-0.4.4

time="2022-05-13T11:00:01+07:00" level=info msg="Verifying required tools" type=metrics
time="2022-05-13T11:00:01+07:00" level=info msg="Executing command: smartctl --scan -j" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Executing command: smartctl --info -j /dev/sdc" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Generating WWN" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Executing command: smartctl --info -j /dev/sdd" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Generating WWN" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Executing command: smartctl --info -j -d nvme /dev/nvme0" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Using WWN Fallback" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Executing command: smartctl --info -j /dev/sda" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Generating WWN" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Executing command: smartctl --info -j /dev/sdb" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Using WWN Fallback" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Sending detected devices to API, for filtering & validation" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Collecting smartctl results for sdc\n" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Executing command: smartctl -x -j /dev/sdc" type=metrics
time="2022-05-13T11:00:02+07:00" level=error msg="smartctl returned an error code (4) while processing sdc\n" type=metrics
time="2022-05-13T11:00:02+07:00" level=error msg="smartctl detected a checksum error" type=metrics
time="2022-05-13T11:00:02+07:00" level=info msg="Publishing smartctl results for 0x50014ee65e0f488e\n" type=metrics
time="2022-05-13T11:00:12+07:00" level=info msg="Collecting smartctl results for sdd\n" type=metrics
time="2022-05-13T11:00:12+07:00" level=info msg="Executing command: smartctl -x -j /dev/sdd" type=metrics
time="2022-05-13T11:00:12+07:00" level=error msg="smartctl returned an error code (68) while processing sdd\n" type=metrics
time="2022-05-13T11:00:12+07:00" level=error msg="smartctl detected a checksum error" type=metrics
time="2022-05-13T11:00:12+07:00" level=info msg="Publishing smartctl results for 0x50014ee2654c6ac1\n" type=metrics
time="2022-05-13T11:00:22+07:00" level=error msg="An error occurred while publishing SMART data for device (0x50014ee2654c6ac1): Post http://app:8080/api/device/0x50014ee2654c6ac1/smart: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" type=metrics
time="2022-05-13T11:00:22+07:00" level=info msg="Collecting smartctl results for nvme0\n" type=metrics
time="2022-05-13T11:00:22+07:00" level=info msg="Executing command: smartctl -x -j -d nvme /dev/nvme0" type=metrics
time="2022-05-13T11:00:22+07:00" level=info msg="Publishing smartctl results for 182506420422\n" type=metrics
time="2022-05-13T11:00:23+07:00" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2022-05-13T11:00:23+07:00" level=info msg="Executing command: smartctl -x -j /dev/sda" type=metrics
time="2022-05-13T11:00:24+07:00" level=info msg="Publishing smartctl results for 0x5000cca756cc9dce\n" type=metrics
time="2022-05-13T11:00:34+07:00" level=error msg="An error occurred while publishing SMART data for device (0x5000cca756cc9dce): Post http://app:8080/api/device/0x5000cca756cc9dce/smart: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" type=metrics
time="2022-05-13T11:00:34+07:00" level=info msg="Collecting smartctl results for sdb\n" type=metrics
time="2022-05-13T11:00:34+07:00" level=info msg="Executing command: smartctl -x -j /dev/sdb" type=metrics
time="2022-05-13T11:00:34+07:00" level=error msg="smartctl returned an error code (4) while processing sdb\n" type=metrics
time="2022-05-13T11:00:34+07:00" level=error msg="smartctl detected a checksum error" type=metrics
time="2022-05-13T11:00:34+07:00" level=info msg="Publishing smartctl results for 0x0000000000000000\n" type=metrics
time="2022-05-13T11:00:43+07:00" level=info msg="Main: Completed" type=metrics
2022/05/13 11:15:01 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                        linux.amd64-0.4.4

time="2022-05-13T11:15:01+07:00" level=info msg="Verifying required tools" type=metrics
time="2022-05-13T11:15:01+07:00" level=info msg="Executing command: smartctl --scan -j" type=metrics
time="2022-05-13T11:15:01+07:00" level=info msg="Executing command: smartctl --info -j /dev/sda" type=metrics
time="2022-05-13T11:15:01+07:00" level=info msg="Generating WWN" type=metrics
time="2022-05-13T11:15:01+07:00" level=info msg="Executing command: smartctl --info -j /dev/sdb" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Using WWN Fallback" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Executing command: smartctl --info -j /dev/sdc" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Generating WWN" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Executing command: smartctl --info -j /dev/sdd" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Generating WWN" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Executing command: smartctl --info -j -d nvme /dev/nvme0" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Using WWN Fallback" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Sending detected devices to API, for filtering & validation" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2022-05-13T11:15:02+07:00" level=info msg="Executing command: smartctl -x -j /dev/sda" type=metrics
time="2022-05-13T11:15:03+07:00" level=info msg="Publishing smartctl results for 0x5000cca756cc9dce\n" type=metrics
time="2022-05-13T11:15:12+07:00" level=info msg="Collecting smartctl results for sdb\n" type=metrics
time="2022-05-13T11:15:12+07:00" level=info msg="Executing command: smartctl -x -j /dev/sdb" type=metrics
time="2022-05-13T11:15:12+07:00" level=error msg="smartctl returned an error code (4) while processing sdb\n" type=metrics
time="2022-05-13T11:15:12+07:00" level=error msg="smartctl detected a checksum error" type=metrics
time="2022-05-13T11:15:12+07:00" level=info msg="Publishing smartctl results for 0x0000000000000000\n" type=metrics
time="2022-05-13T11:15:17+07:00" level=info msg="Collecting smartctl results for sdc\n" type=metrics
time="2022-05-13T11:15:17+07:00" level=info msg="Executing command: smartctl -x -j /dev/sdc" type=metrics
time="2022-05-13T11:15:17+07:00" level=error msg="smartctl returned an error code (4) while processing sdc\n" type=metrics
time="2022-05-13T11:15:17+07:00" level=error msg="smartctl detected a checksum error" type=metrics
time="2022-05-13T11:15:17+07:00" level=info msg="Publishing smartctl results for 0x50014ee65e0f488e\n" type=metrics
time="2022-05-13T11:15:22+07:00" level=info msg="Collecting smartctl results for sdd\n" type=metrics
time="2022-05-13T11:15:22+07:00" level=info msg="Executing command: smartctl -x -j /dev/sdd" type=metrics
time="2022-05-13T11:15:23+07:00" level=error msg="smartctl returned an error code (68) while processing sdd\n" type=metrics
time="2022-05-13T11:15:23+07:00" level=error msg="smartctl detected a checksum error" type=metrics
time="2022-05-13T11:15:23+07:00" level=info msg="Publishing smartctl results for 0x50014ee2654c6ac1\n" type=metrics
time="2022-05-13T11:15:33+07:00" level=error msg="An error occurred while publishing SMART data for device (0x50014ee2654c6ac1): Post http://app:8080/api/device/0x50014ee2654c6ac1/smart: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" type=metrics
time="2022-05-13T11:15:33+07:00" level=info msg="Collecting smartctl results for nvme0\n" type=metrics
time="2022-05-13T11:15:33+07:00" level=info msg="Executing command: smartctl -x -j -d nvme /dev/nvme0" type=metrics
time="2022-05-13T11:15:33+07:00" level=info msg="Publishing smartctl results for 182506420422\n" type=metrics
time="2022-05-13T11:15:33+07:00" level=info msg="Main: Completed" type=metrics

@martadinata666 martadinata666 added the bug Something isn't working label May 13, 2022
@AnalogJ
Copy link
Owner

AnalogJ commented May 13, 2022

This is actually a fix not a bug.
Unfortunately Scrutiny has been incorrectly displaying "failed" thresholds as "warn" due to a bug in the display logic:

e780161#diff-2c860daa8b45fb817d063c797dc6ffdfd5ad820504195a2b553fdd9bd26f0b84

Unfortunately for NVMe & SCSI drives, there's not much data from BackBlaze about the real-world failure thresholds, so I'm just using the recommended thresholds to determine success/failure for critical attributes.

@AnalogJ AnalogJ closed this as completed May 13, 2022
@shamoon
Copy link
Contributor

shamoon commented May 13, 2022

Im guessing this OP is same as me (#187) in that smartctl is reporting "passed" so its a bit confusing Scrutiny says failed.

@AnalogJ
Copy link
Owner

AnalogJ commented May 13, 2022

Yep, I'm writing a response to you right now @shamoon

@shamoon
Copy link
Contributor

shamoon commented May 13, 2022

Ha ok thanks. I have one update too I'll post there

@martadinata666
Copy link
Author

This is actually a fix not a bug.
Unfortunately Scrutiny has been incorrectly displaying "failed" thresholds as "warn" due to a bug in the display logic:

e780161#diff-2c860daa8b45fb817d063c797dc6ffdfd5ad820504195a2b553fdd9bd26f0b84

Unfortunately for NVMe & SCSI drives, there's not much data from BackBlaze about the real-world failure thresholds, so I'm just using the recommended thresholds to determine success/failure for critical attributes.

may you add this to some readme/docs, additional documentation related to nvme failure threshold, real world data and how scrutiny determine fail/pass will helpfull prevent confusion and panic attack due sudden failed status. thanks for the supports.

@AnalogJ
Copy link
Owner

AnalogJ commented May 13, 2022

Yeah, I added a section in the troubleshooting doc to point that out:

https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#scrutiny-detects-failure-but-smart-passed

Do you think there's anything else I should add?

@shamoon
Copy link
Contributor

shamoon commented May 13, 2022

Thats helpful for sure. But it really does highlight the issue for me.

Right now the UI makes it look like SMART is failing, or the device is somehow in danger of failing imminently, when in actual fact correlation (not causation) from data from BB is all its saying. Anyway, I know you know!

@AnalogJ
Copy link
Owner

AnalogJ commented May 23, 2022

Copying a comment from the SelfHosted Discord #storage channel:

Yeah, previous to v0.4, there was a UI bug that basically meant all backblaze data was ignored. I fixed that in v0.4.x, but it definitely scared users.

One addl thing to note is that I may need to tweak the "critcal" attributes & thresholds for NVME and SCSI drives. Backblaze doesnt have a lot of data for those drive types. I may disable "SCRUTINY" failures on those drive types until I have more hard data.

AnalogJ added a commit that referenced this issue May 23, 2022
…. More analysis needed for NVMe drives & their critical attributes.

- fixes #187
- fixes #247
@AnalogJ
Copy link
Owner

AnalogJ commented May 25, 2022

Attributes and thresholds with little-no real-world Backblaze data have been loosened so they no longer cause failures.

fixed in v0.4.7 🎉

@martadinata666
Copy link
Author

Do i need wipe out already collected data? I already upgrade to 0.4.7 and waiting for few hours, with collecting data every 15m. Because it still defined as failed 🤔

@AnalogJ
Copy link
Owner

AnalogJ commented May 26, 2022

@martadinata666 unfortunately the way the Device status updating code is written, it will not unset a "failed" drive.
However, can you confirm that the "Numb Err Log Entries" attribute in the device details page no longer has an error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for response
Projects
None yet
Development

No branches or pull requests

3 participants