Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore UltraDMA CRC Error Count unless it is increasing? #364

Closed
paulmorabito opened this issue Aug 30, 2022 · 8 comments · Fixed by #547
Closed

Ignore UltraDMA CRC Error Count unless it is increasing? #364

paulmorabito opened this issue Aug 30, 2022 · 8 comments · Fixed by #547
Labels
enhancement New feature or request

Comments

@paulmorabito
Copy link

Hi,

I'm running the latest image (ghcr.io/analogj/scrutiny:master-omnibus) in docker on Unraid. One of my disks had an issue a long time ago that was due to a bad cable and as a result, the "UltraDMA CRC Error Count" is elevated (87). Scrutiny reports this as a failed disk even though the value is not incrementing. Should this be reported as a failed disk when it's working fine and as far as I know, at no increased risk of failure? If so, is it possible to mark it as "accepted" and then monitor for the value incrementing?

Thanks for a great app btw.

@paulmorabito paulmorabito added the bug Something isn't working label Aug 30, 2022
@AnalogJ AnalogJ added enhancement New feature or request and removed bug Something isn't working labels Oct 13, 2022
@AnalogJ
Copy link
Owner

AnalogJ commented Oct 13, 2022

thats interesting. There's a couple of other issues where users have requested the ability to "mute" notifications for specific SMART attributes and set custom failure thresholds . I think this falls under a similar category.

I'll keep this open for now, but I may merge/close this issue as a dupe in the future.

@Faeranne
Copy link

Faeranne commented Dec 4, 2022

Gonna subscribe to follow this, but I wanna point out that this can be a show stopping issue, as a bad sata cable can cause the CRC error count to rise, and unless I missed the flag to do so somewhere, there is no way to reset this. In my case I had a bad sata controller cause all 5 of the disks I currently use to increment this value. Since Scrutiny considered any number present to be a drive failure, I get no meaningful information from the dashboard, as all 5 drives continue to report as faulty from the moment I spun up Scrutiny.
I mostly wanted to post this to show how this really can become a major bug for an end user, though I personally know nothing about Go, so all I can do is manually inspect the scrutiny data (which is still way easier to look at than the console output, so there is that), and watch this bug for updates. Gonna excitedly wait to see if anything comes of this, as I'm otherwise really liking the software.

@michaelkrieger
Copy link

michaelkrieger commented Apr 10, 2023

Very much agree here. There are many similar metrics which need to be overridden. a CRC Error is often a bad cable (in my case I reseated the drive in an enclosure). I have a single command timeout (due to the USB bus being reset), 13 CRC errors (due to a bad cable) and oddly a warning spin-up time of 91 (though that's also the normalized value and the same across all drives). This is identical across 8 (Seagate IronWolf) drives. I have no reason to believe they all have the identical failure.

Currently scrutiny shocks me with failing drives, but these numbers aren't increasing now that the issue is addressed. The ability to either identify when they're not increasing by setting the new baseline-threshold to the current value, or otherwise, is necessary. Otherwise scrutiny status gets turned off and we rely on smart data only.

Screenshot 2023-04-10 122516
Screenshot 2023-04-10 122539

@aerz
Copy link

aerz commented Apr 12, 2023

I agree with the comments here. I used a JMicron bridge chip (JMS561) which did a bad SATA command translation and SMART registered this error at UltraDMA CRC Error Count attribute. Now every time Scrutiny starts it raises an error from smartctl:

time="2023-04-12T11:07:56Z" level=error msg="smartctl returned an error code (64) while processing sdb\n" type=metrics
time="2023-04-12T11:07:56Z" level=error msg="smartctl detected a error log with errors" type=metrics

However, it was an error but now it's not and everything is working fine. Perhaps it can be treated as information for this value rather than an error.

@chertvl
Copy link

chertvl commented Nov 26, 2023

This question is still important.
Yesterday I installed a new SSD into the server, and unfortunately I did not wipe the M.2 contact group before installation.
Overnight (consistently once every 1 hour), SMART wrote me CRC errors.
Unfortunately, now I see this new SSD drive as a Failed: Scrunity (according to Scrutiny, status: 2).
I have already wiped the contact group and reinserted the SSD, the errors have stopped appearing, but is it Failed forever?

image

@chertvl
Copy link

chertvl commented Nov 26, 2023

This question is still important.
Yesterday I installed a new SSD into the server,

@AnalogJ
It would be great if I could either ignore this attribute entirely, or its current value, or at least add literally 1% to the Scrutiny logic:

      '199':
        attribute_id: 199
        value: 100
        thresh: 0
        worst: 100
        raw_value: 18
        raw_string: '18'
        when_failed: ''
        transformed_value: 0
        status: 4
        status_reason: Observed Failure Rate for Non-Critical Attribute is greater
          than 20%
        failure_rate: 0.20878219917249793

@MrMeeb
Copy link

MrMeeb commented Dec 28, 2023

Just come across exactly the same issue - I've setup Scrutiny for the first time and found one of my disks has a CRC error count of 27. This disk is over 5 years old, so that could've been any time. However as a result, that disk is considered to be failing, and I'm notified as such with no way of dismissing or setting this value as the 'new normal'. The concept of setting a value to be the 'new normal' seems to be used by CheckMK for these scenarios: https://forum.checkmk.com/t/udma-crc-errors-not-resetting/32068. That way the alerts don't have to be permentantly muted, nor does a different arbritrary limit have to be set.

This same problem was also mentioned in the more recent #553

@zwimer
Copy link

zwimer commented Jan 10, 2024

This would be super useful; I currently have to keep scrutiny's metrics off because otherwise it perpetually says failed just because I had a bad seating a few years ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants