Ignore UltraDMA CRC Error Count unless it is increasing? #364

paulmorabito · 2022-08-30T11:40:19Z

Hi,

I'm running the latest image (ghcr.io/analogj/scrutiny:master-omnibus) in docker on Unraid. One of my disks had an issue a long time ago that was due to a bad cable and as a result, the "UltraDMA CRC Error Count" is elevated (87). Scrutiny reports this as a failed disk even though the value is not incrementing. Should this be reported as a failed disk when it's working fine and as far as I know, at no increased risk of failure? If so, is it possible to mark it as "accepted" and then monitor for the value incrementing?

Thanks for a great app btw.

AnalogJ · 2022-10-13T04:10:13Z

thats interesting. There's a couple of other issues where users have requested the ability to "mute" notifications for specific SMART attributes and set custom failure thresholds . I think this falls under a similar category.

I'll keep this open for now, but I may merge/close this issue as a dupe in the future.

Faeranne · 2022-12-04T07:47:54Z

Gonna subscribe to follow this, but I wanna point out that this can be a show stopping issue, as a bad sata cable can cause the CRC error count to rise, and unless I missed the flag to do so somewhere, there is no way to reset this. In my case I had a bad sata controller cause all 5 of the disks I currently use to increment this value. Since Scrutiny considered any number present to be a drive failure, I get no meaningful information from the dashboard, as all 5 drives continue to report as faulty from the moment I spun up Scrutiny.
I mostly wanted to post this to show how this really can become a major bug for an end user, though I personally know nothing about Go, so all I can do is manually inspect the scrutiny data (which is still way easier to look at than the console output, so there is that), and watch this bug for updates. Gonna excitedly wait to see if anything comes of this, as I'm otherwise really liking the software.

michaelkrieger · 2023-04-10T16:34:20Z

Very much agree here. There are many similar metrics which need to be overridden. a CRC Error is often a bad cable (in my case I reseated the drive in an enclosure). I have a single command timeout (due to the USB bus being reset), 13 CRC errors (due to a bad cable) and oddly a warning spin-up time of 91 (though that's also the normalized value and the same across all drives). This is identical across 8 (Seagate IronWolf) drives. I have no reason to believe they all have the identical failure.

Currently scrutiny shocks me with failing drives, but these numbers aren't increasing now that the issue is addressed. The ability to either identify when they're not increasing by setting the new baseline-threshold to the current value, or otherwise, is necessary. Otherwise scrutiny status gets turned off and we rely on smart data only.

aerz · 2023-04-12T11:20:43Z

I agree with the comments here. I used a JMicron bridge chip (JMS561) which did a bad SATA command translation and SMART registered this error at UltraDMA CRC Error Count attribute. Now every time Scrutiny starts it raises an error from smartctl:

time="2023-04-12T11:07:56Z" level=error msg="smartctl returned an error code (64) while processing sdb\n" type=metrics
time="2023-04-12T11:07:56Z" level=error msg="smartctl detected a error log with errors" type=metrics

However, it was an error but now it's not and everything is working fine. Perhaps it can be treated as information for this value rather than an error.

chertvl · 2023-11-26T18:22:22Z

This question is still important.
Yesterday I installed a new SSD into the server, and unfortunately I did not wipe the M.2 contact group before installation.
Overnight (consistently once every 1 hour), SMART wrote me CRC errors.
Unfortunately, now I see this new SSD drive as a Failed: Scrunity (according to Scrutiny, status: 2).
I have already wiped the contact group and reinserted the SSD, the errors have stopped appearing, but is it Failed forever?

chertvl · 2023-11-26T18:30:16Z

This question is still important.
Yesterday I installed a new SSD into the server,

@AnalogJ
It would be great if I could either ignore this attribute entirely, or its current value, or at least add literally 1% to the Scrutiny logic:

      '199':
        attribute_id: 199
        value: 100
        thresh: 0
        worst: 100
        raw_value: 18
        raw_string: '18'
        when_failed: ''
        transformed_value: 0
        status: 4
        status_reason: Observed Failure Rate for Non-Critical Attribute is greater
          than 20%
        failure_rate: 0.20878219917249793

MrMeeb · 2023-12-28T16:57:35Z

Just come across exactly the same issue - I've setup Scrutiny for the first time and found one of my disks has a CRC error count of 27. This disk is over 5 years old, so that could've been any time. However as a result, that disk is considered to be failing, and I'm notified as such with no way of dismissing or setting this value as the 'new normal'. The concept of setting a value to be the 'new normal' seems to be used by CheckMK for these scenarios: https://forum.checkmk.com/t/udma-crc-errors-not-resetting/32068. That way the alerts don't have to be permentantly muted, nor does a different arbritrary limit have to be set.

This same problem was also mentioned in the more recent #553

zwimer · 2024-01-10T07:57:45Z

This would be super useful; I currently have to keep scrutiny's metrics off because otherwise it perpetually says failed just because I had a bad seating a few years ago.

paulmorabito added the bug Something isn't working label Aug 30, 2022

AnalogJ added enhancement New feature or request and removed bug Something isn't working labels Oct 13, 2022

AnalogJ mentioned this issue Jan 23, 2024

Add support for disabling repeat notifications if the values haven't changed #547

Merged

AnalogJ closed this as completed in #547 Feb 24, 2024

zwimer mentioned this issue Nov 22, 2024

[BUG] Can't dismiss non-critical scrutiny warning for UltraDMA CRC Error Count #553

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore UltraDMA CRC Error Count unless it is increasing? #364

Ignore UltraDMA CRC Error Count unless it is increasing? #364

paulmorabito commented Aug 30, 2022

AnalogJ commented Oct 13, 2022

Faeranne commented Dec 4, 2022

michaelkrieger commented Apr 10, 2023 •

edited

Loading

aerz commented Apr 12, 2023 •

edited

Loading

chertvl commented Nov 26, 2023 •

edited

Loading

chertvl commented Nov 26, 2023

MrMeeb commented Dec 28, 2023 •

edited

Loading

zwimer commented Jan 10, 2024

Ignore UltraDMA CRC Error Count unless it is increasing? #364

Ignore UltraDMA CRC Error Count unless it is increasing? #364

Comments

paulmorabito commented Aug 30, 2022

AnalogJ commented Oct 13, 2022

Faeranne commented Dec 4, 2022

michaelkrieger commented Apr 10, 2023 • edited Loading

aerz commented Apr 12, 2023 • edited Loading

chertvl commented Nov 26, 2023 • edited Loading

chertvl commented Nov 26, 2023

MrMeeb commented Dec 28, 2023 • edited Loading

zwimer commented Jan 10, 2024

michaelkrieger commented Apr 10, 2023 •

edited

Loading

aerz commented Apr 12, 2023 •

edited

Loading

chertvl commented Nov 26, 2023 •

edited

Loading

MrMeeb commented Dec 28, 2023 •

edited

Loading