-
-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Seagate Drive Command Timeout with Huge Raw Value #522
Comments
Changing the thresholds didn't cause any problems. More likely what happened is you had timeouts before, but they just weren't >5 or >7.5s. Then when you got those longer timeouts, the incorrectly decoded value went above the thresholds, causing the error. I'm seeing the same thing on one of my drives so I might tackle this when I have some time. |
To answer your two questions:
|
Thanks. For point 1, the Command Timeout was giving me an error with a raw value of ~8 before I submitted the threshold change. So the behavior must've been changed in between. Regardless, with the decoding corrected, this should go away. |
Also - where are you seeing the total number of operations? |
You can also use the
|
@goproslowyo Good tip there! I used this to customise the
Which products JSON output as:
However Scrutiny still parses the |
Yeap, I tried to do exactly the same thing @itsthejb but it seems something interprets the value incorrectly anyway. |
@AnalogJ - I think this would be a pretty quick and easy fix for someone who actually knows Go... can you to take a look? I'm also happy to write the code and test if you tell me what to do |
In the meantime you can set "Device status - thresholds" to just "SMART" instead of "Scrutiny" or "both", to ignore Scrutiny's interpretation. Note that this will ignore it for all attributes... |
@zuavra This is what I have done so far, but I would love to revert back to "both" when this is fixed. |
I took a quick look at the code, and this should be a super easy fix. Just need to add a scrutiny/webapp/backend/pkg/thresholds/ata_attribute_metadata.go Lines 662 to 669 in 4b1d9dc
|
I’d love to see it too! Since Scrutiny output has convinced me that one of my other drives is definitely expiring, be nice to see green on the seagate which is in fact still ok |
@itsthejb @firasdib @tonyzzz321 - I've fixed this in my fork: https://github.com/kaysond/scrutiny/tree/master Can you please help test this |
@kaysond Looking good here! Good job |
Hey everyone, thanks for collaborating and figuring this one out. I don't have any seagate drives effected by this issue, so I was depending on the community to help figure out what's going on -- and you delivered! I'll be merging the PR momentarily |
There doesn't seem to be any transformation done between the raw value and the value marked "Scrutiny". |
I also have several Seagate drives with a high parameter of 188, this is caused by a problem with the HBA card when building the server, the card was replaced, the problem was eliminated, the drives are functional, but the parameter is still high in the drives that were launched and tested when the problem with the HBA card occurred. example parameter value 17180131333. It seems that you should simply monitor the rate of increase of this parameter, if the parameter remains stagnant, the problem should be considered solved. |
re-opening this since there still seems to be a problem. One thing to note, |
just to confirm, @Brandoskey @zuavra
|
I have it also this problem with sudo smartctl -x /dev/sda
|
@Brandoskey I've added the fix for the Exos X16 models to my PR @queeup I'm not sure what needs to be done about the 195 value, so I won't be able to include anything for that. |
@OddMagnet, thank you for your work here. You can click on top of
|
@OddMagnet, I opened upstream issue. smartmontools/smartmontools#248 |
@OddMagnet @queeup 195 should be read just like 1 and 7 in these drives:
Mine doesn't have the 195 attribute but I checked and it's not a problem to have the extra -v flag either. Seagate docs |
@IlyaDevice I've added the fix for your drive as well. |
@OddMagnet same story here: 4xWDC WUH721414ALE6L4, 2 with fw LDGAW07G, 2 with fw LDGNW240 `root@DSM7:~# smartctl -a -d sat /dev/sdc | grep 188 root@DSM7:~# smartctl -a -d sat /dev/sdd | grep 188 |
As far as I understand the problem for Seagate drives is that the raw value for 188 represents 3 different values. I'm unsure how the raw value for WDC drives is decoded, so I'm hesitant to simply adding the same modifications to my PR for your drive. Can you check what the output of |
I'm having this issue with a ST18000NM002J-2TV133. This is the result of running
let me know if i should provide any other information |
@OddMagnet, I did it:
but with this FW not possible to make output without '-d sat'
|
Same issue with ST12000NT001-3LX101. Drive brand new.. Is being in a Raid array the issue here? |
@KingsleyBawuah @Barmagler @NolandTech I've added your models to my PR as well, please test the file and provide your output there. I'm currently rather busy, so I likely won't add any more models to my PR. |
Details
sudo smartctl -x -a /dev/sda === START OF INFORMATION SECTION === Read SMART Data failed: SAT command failed === START OF READ SMART DATA SECTION === General Purpose Log Directory Version 1 SMART Extended Comprehensive Error Log Version: 1 (5 sectors) SMART Error Log Version: 1 SMART Extended Self-test Log Version: 1 (1 sectors) 1 Extended offline Completed without error 00% 299 -2 Short offline Completed without error 00% 136 -3 Short offline Completed without error 00% 94 -4 Short offline Completed without error 00% 28 -5 Vendor (0xdf) Completed without error 00% 26 -6 Short offline Completed without error 00% 25 -7 Vendor (0xdf) Completed without error 00% 7 -8 Short offline Completed without error 00% 5 -9 Vendor (0xdf) Completed without error 00% 3 -#10 Short offline Completed without error 00% 1 - SMART Self-test log structure revision number 1 1 Extended offline Completed without error 00% 299 -2 Short offline Completed without error 00% 136 -3 Short offline Completed without error 00% 94 -4 Short offline Completed without error 00% 28 -5 Vendor (0xdf) Completed without error 00% 26 -6 Short offline Completed without error 00% 25 -7 Vendor (0xdf) Completed without error 00% 7 -8 Short offline Completed without error 00% 5 -9 Vendor (0xdf) Completed without error 00% 3 -#10 Short offline Completed without error 00% 1 - SMART Selective self-test log data structure revision number 1 SCT Status Version: 3 SCT Temperature History Version: 2 Index Estimated Time Temperature Celsius SCT Error Recovery Control: Device Statistics (GP Log 0x04) Pending Defects log (GP Log 0x0c) SATA Phy Event Counters (GP Log 0x11)
Thanks for everything so far! |
@OddMagnet, little bit lost what version I need to run: still same result. May be I need to update drivedb.h? |
yes, you need to use the For safety purpose I recommend just renaming your existing file before adding the one from my PR. |
Very strange situation:
DSM: smartctl comes with ABSOLUTELY NO WARRANTY. This is free smartmontools release 6.5 dated 2015-06-04 at 16:29:41 UTC Container: smartctl comes with ABSOLUTELY NO WARRANTY. This is free smartmontools release 7.3 dated 2022-02-28 at 16:33:40 UTC But it's really strange situation - scrutiny use smartctl inside container, why it still giving error? |
Scrutiny still remember the old values. You could either edit them manually in the influx db of scrutiny, or simply remove the drives in Scrutiny's WebUI and restart the container. |
Just chiming in to say that I also had this issue and following @unai-ndz 's post fixed it for me. With one small caviet. For whatever reason my installation of ubuntu 22.04 had the default location for the smart db as follows:
but there was no file there. Instead the db was getting downloaded to /var/lib/smartmontools/drivedb/drivedb.h
I made the edits in that file and then copied it to the /etc/smart_drivedb.h location and it worked. Just letting folks know in case anyone else hits that issue. |
I just added 9x refurbished ST16000NM000H-3KW103 to Scrutiny v0.8.1 and 3 of them have this issue. One of them has a value as high as 4295033551 I think Scrutiny should mention that this metric is mostly commonly caused by a failing backplane or SATA Cable not the actual HDD failing. So the real alert should be if it is increasing over X period not a threshold. Which none of mine are increasing since I added them even after a lot of stress testing and ZFS scrubs. So I would consider this marking the HDD as failed a false positive. Now I can't really rely on Scrutiny to alert me if the HDD is actually failing since it is permanently stuck as failed. Would need to manually go in and check and compare the status. |
Yes I have the same issue with drive model: ST5000DM000 with a 'Çommand Timeout' raw value of 8950848424068. |
Follow the steps in my comment above. You may need to adjust it for your specific hdd model and error code. Ultimately this is an issue with smartctl, not scrutiny. I agree that ideally it should consider warning only when values increase but it's probable that doing the steps above will reduce the reading to zero, or a low value. |
To expand on this, smartctl first looks for /etc/smart_drivedb.h before it reads /var/lib/smartmontools/drivedb/drivedb.h smartmontools/drivedb/drivedb.h does not contain the specifics for my 8TB and 18TB seagate drives so I had to manually create /etc/smart_drivedb.h to handle my specific drives, this means I don't need the commands/metrics_smart_args inside collector.yaml anymore as smartctl on both my docker instance and my actual host now get the correct -v args from smartctl itself. My /etc/smart_drivedb.h file is this:
My docker compose file now links /etc/smart_drivedb.h inside the docker environment:
smartctl -a /dev/devk is now:
|
This problem still happening on Seagate Exos X14 (ST12000NM0538-2K2101) with CN02 firmware.
Adding
If it still shows as failed for anyone after addining |
Describe the bug
I have a couple Seagate drives showing huge raw value for 188 Command Timeout, and it is marked as failed in Scrutiny. Please see screenshot below.
Seagate drives use this field's raw value to represent a combination of 3 integers (total command timeouts, commands completed between 5s and 7.5s, commands completed >7.5s). Therefore, the raw value needs to be decoded before being used to determine drive's failure.
In my case, the raw value of "4295032833" represents 1 command timeout, 1 command >5s and <7.5s, and 1 command >7.5s. This does not cross the threshold to be considered as failure.
Please see related answer at https://superuser.com/a/1747851 and Seagate SMART Attribute Spec documentation.
Expected behavior
Raw value to be decoded before being used to determine drive's failure.
Screenshots
The text was updated successfully, but these errors were encountered: