Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussing correlation vs. causation with BlackBlaze data #275

Closed
ViRb3 opened this issue Jun 2, 2022 · 6 comments · Fixed by #352
Closed

Discussing correlation vs. causation with BlackBlaze data #275

ViRb3 opened this issue Jun 2, 2022 · 6 comments · Fixed by #352

Comments

@ViRb3
Copy link

ViRb3 commented Jun 2, 2022

I just installed scrutiny and set it up for my 6 external HDDs. Some of them are more than 5 years old, some of them are brand new. I noticed that two of the old disks were marked as failed:

Disk 1:
Screen Shot 2022-06-02 at 18 04 27

Disk 2:
Screen Shot 2022-06-02 at 18 04 59

This looks pretty scary, given the high failure rate and its description:

Screen Shot 2022-06-02 at 18 05 37

However, it made me think. Does the failure rate mean that 20% of confirmed failed disks in BlackBlaze's dataset had this attribute at or worse than my value? And does this take into account how many of the healthy disks had this attribute with the same value? Because if we only look at the failed data, we're assuming that correlation is causation, which may be wrong. Ideally, I believe we'd want to report the difference between the healthy and failed disks instead. This may be what you're currently doing, but I have no clue, so please excuse any assumptions I made here.

Thanks a lot!

EDIT: Could you please share the source of the BlackBlaze data?

@shamoon
Copy link
Contributor

shamoon commented Jun 2, 2022

I totally agree with you ( #187 (comment) ), see my quote from the BB data (and theres a link there). This is correlation not causation and I think to display it as "Failed" is kind of misleading. At best its a "warning" or a "red flag" that something might happen, not that it has. Thats why BB dont use most of these data points for their decisions about replacing drives, etc.

@AnalogJ
Copy link
Owner

AnalogJ commented Jun 2, 2022

Yeah, as @shamoon mentioned, there's been a lot of concern about how Backblaze data is used within Scrutiny.

I'm working on some changes to the failure detection such that it'll be configurable in the UI, and you can selectively enable/disable the backblaze based failures and the thresholds.

@jeroengui
Copy link

Yeah, same issue here. Brand new server grade drive. And the application shows me that the drive is in status "failed"

image

@AnalogJ
Copy link
Owner

AnalogJ commented Jun 10, 2022

Just wanted to give everyone an update on the status of this issue.

There's currently two tasks I'm working on:

  • Update the details page UI to display the smart status separately from the scrutiny status.
  • Create a setting which allows uses to enable/disable Scrutiny analysis. (released in v0.5.0)

The first task is partially complete. Here's what it currently looks like:

Screen Recording 2022-06-09 at 10 50 32 PM

The Smart status and Scrutiny status are differentiated in the expanding details panel.

This is still a prototype. I think it works well, but I'd love to hear your thoughts.

@Parlane
Copy link

Parlane commented Jun 14, 2022

Yeah, same issue here. Brand new server grade drive. And the application shows me that the drive is in status "failed"

image

Is that a seagate drive ? Probably related to #255 then, and fixed in master to not show failed anymore.

@AnalogJ
Copy link
Owner

AnalogJ commented Aug 4, 2022

Took an incredibly long time, but as of v0.5.0 this functionality is now available in Scrutiny! 🥳

Screen Shot 2022-08-04 at 8 33 58 AM

On the dashboard settings panel, you can now change the "Device Status - Thresholds" between Smart, Scrutiny and Both. By default this is set to Both.

When changed to Smart - only the output of smartctl is relevant, all other Scrutiny/Backblaze detected failures/warnings are ignored (in notifications & UI).

The description and UI for this functionality may be enhanced in the coming releases, but it is functional and working.
I'd appreciate it if you could pull down the latest image, test it out and provide any feedback you may have!

Appreciate everyone's patience - this has been a long time coming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants