Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Seagate Drive Command Timeout with Huge Raw Value #522

Open
tonyzzz321 opened this issue Oct 2, 2023 · 81 comments · Fixed by #527
Open

[BUG] Seagate Drive Command Timeout with Huge Raw Value #522

tonyzzz321 opened this issue Oct 2, 2023 · 81 comments · Fixed by #527
Labels
bug Something isn't working

Comments

@tonyzzz321
Copy link

tonyzzz321 commented Oct 2, 2023

Describe the bug
I have a couple Seagate drives showing huge raw value for 188 Command Timeout, and it is marked as failed in Scrutiny. Please see screenshot below.

Seagate drives use this field's raw value to represent a combination of 3 integers (total command timeouts, commands completed between 5s and 7.5s, commands completed >7.5s). Therefore, the raw value needs to be decoded before being used to determine drive's failure.

In my case, the raw value of "4295032833" represents 1 command timeout, 1 command >5s and <7.5s, and 1 command >7.5s. This does not cross the threshold to be considered as failure.

Please see related answer at https://superuser.com/a/1747851 and Seagate SMART Attribute Spec documentation.

Expected behavior
Raw value to be decoded before being used to determine drive's failure.

Screenshots
image
image

@tonyzzz321 tonyzzz321 added the bug Something isn't working label Oct 2, 2023
@firasdib
Copy link

firasdib commented Oct 4, 2023

This commit looks to try to fix it, but perhaps it's what broke it? I'm getting the same error on my end now.

I also read here that this project is more or less unmaintained, so if you want it fixed, you might have to submit your own PR or fork it.

@kaysond
Copy link
Contributor

kaysond commented Oct 8, 2023

This commit looks to try to fix it, but perhaps it's what broke it? I'm getting the same error on my end now.

I also read here that this project is more or less unmaintained, so if you want it fixed, you might have to submit your own PR or fork it.

Changing the thresholds didn't cause any problems. More likely what happened is you had timeouts before, but they just weren't >5 or >7.5s. Then when you got those longer timeouts, the incorrectly decoded value went above the thresholds, causing the error. I'm seeing the same thing on one of my drives so I might tackle this when I have some time.

@kaysond
Copy link
Contributor

kaysond commented Oct 8, 2023

I'm also seeing some sector errors on a drive with <1yr runtime. I'm wondering if there's some decoding error on those too? or maybe I'm just on the front end of the bathtub curve...

image

@firasdib
Copy link

firasdib commented Oct 9, 2023

To answer your two questions:

  1. No, this was not the behavior before. The Command Timeouts were warnings, not errors. They only recently started turning into errors, and marking the drive as failed. Your drive, in the screenshot, has reported 1 command timeout in 65 616 operations.
  2. Your drive is failing, and you should replace it. Those values do not need additional parsing to be accurate, that's only for 1, 7, 188, 195.

@kaysond
Copy link
Contributor

kaysond commented Oct 9, 2023

To answer your two questions:

1. No, this was not the behavior before. The Command Timeouts were warnings, not errors. They only recently started turning into errors, and marking the drive as failed. Your drive, in the screenshot, has reported 1 command timeout in  65 616 operations.

2. Your drive is failing, and you should replace it. Those values do not need additional parsing to be accurate, that's only for 1, 7, 188, 195.

Thanks. For point 1, the Command Timeout was giving me an error with a raw value of ~8 before I submitted the threshold change. So the behavior must've been changed in between. Regardless, with the decoding corrected, this should go away.

@kaysond
Copy link
Contributor

kaysond commented Oct 9, 2023

Also - where are you seeing the total number of operations?

@firasdib
Copy link

I used this: https://www.disktuna.com/big-scary-raw-s-m-a-r-t-values-arent-always-bad-news/#21475164165

@goproslowyo
Copy link

You can also use the -v flag to tell smartctl to parse the value as three raw 16-bit values to get an accurate result:

sudo smartctl -xv 188,raw16 /path/to/disk

@itsthejb
Copy link

@goproslowyo Good tip there! I used this to customise the metrics_smart_args command for my Seagate drive:

 - device: /dev/sde
   type: 'sat'
   commands:
      metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'

Which products JSON output as:

        "raw": {
          "value": 8590065666,
          "string": "2 2 2"
        }

However Scrutiny still parses the value, it would seem

@goproslowyo
Copy link

2023-10-12_021052

Yeap, I tried to do exactly the same thing @itsthejb but it seems something interprets the value incorrectly anyway.

@kaysond
Copy link
Contributor

kaysond commented Oct 12, 2023

@AnalogJ - I think this would be a pretty quick and easy fix for someone who actually knows Go... can you to take a look? I'm also happy to write the code and test if you tell me what to do

@zuavra
Copy link

zuavra commented Oct 13, 2023

In the meantime you can set "Device status - thresholds" to just "SMART" instead of "Scrutiny" or "both", to ignore Scrutiny's interpretation. Note that this will ignore it for all attributes...

@firasdib
Copy link

@zuavra This is what I have done so far, but I would love to revert back to "both" when this is fixed.

@kaysond
Copy link
Contributor

kaysond commented Oct 14, 2023

I took a quick look at the code, and this should be a super easy fix. Just need to add a Transform() function to the ata attribute here that looks at the string value. If it has 3 parts, then you just grab the last one. smartctl itself already sets -v 188,raw16 for many seagate drives.

188: {
ID: 188,
DisplayName: "Command Timeout",
DisplayType: AtaSmartAttributeDisplayTypeRaw,
Ideal: ObservedThresholdIdealLow,
Critical: true,
Description: "The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero.",
ObservedThresholds: []ObservedThreshold{

https://github.com/smartmontools/smartmontools/blob/6b9ed03b9e7c448e41755d484acaabe5db685254/smartmontools/drivedb.h#L4261

@itsthejb
Copy link

I’d love to see it too! Since Scrutiny output has convinced me that one of my other drives is definitely expiring, be nice to see green on the seagate which is in fact still ok

@kaysond
Copy link
Contributor

kaysond commented Oct 15, 2023

@itsthejb @firasdib @tonyzzz321 - I've fixed this in my fork: https://github.com/kaysond/scrutiny/tree/master

Can you please help test this

@itsthejb
Copy link

@kaysond Looking good here! Good job

Screenshot 2023-10-15 at 17 03 46

@AnalogJ
Copy link
Owner

AnalogJ commented Oct 17, 2023

Hey everyone, thanks for collaborating and figuring this one out. I don't have any seagate drives effected by this issue, so I was depending on the community to help figure out what's going on -- and you delivered!

I'll be merging the PR momentarily

@Brandoskey
Copy link

I've updated scrutiny web as well as the collector manual install I have on TrueNAS that has some drives affected by this issue and scrutiny is still reporting the raw value and showing failed for 188. The raw value on two of my drives is 8590065666.
16975535224977369891572488351986

Or are these drives still failing?

@zuavra
Copy link

zuavra commented Oct 17, 2023

Same here, still getting the error for docker image omnibus
sha256:d45a226d02eb38f82574a552299eb3440c3f398674e92d596e0051e85b2bab48

Screenshot_2023-10-17_17-43-17 png

@zuavra
Copy link

zuavra commented Oct 17, 2023

Or are these drives still failing?

There doesn't seem to be any transformation done between the raw value and the value marked "Scrutiny".

@zuavra
Copy link

zuavra commented Oct 17, 2023

With the latest version of image beta:omnibus the attribute shows as warning rather than error, but there still doesn't seem to be any transformation from the raw value, and it still causes the overal drive status to be "failed".

Screenshot_2023-10-17_18-05-06

@SaraDark
Copy link

I also have several Seagate drives with a high parameter of 188, this is caused by a problem with the HBA card when building the server, the card was replaced, the problem was eliminated, the drives are functional, but the parameter is still high in the drives that were launched and tested when the problem with the HBA card occurred. example parameter value 17180131333.

It seems that you should simply monitor the rate of increase of this parameter, if the parameter remains stagnant, the problem should be considered solved.

@AnalogJ
Copy link
Owner

AnalogJ commented Oct 17, 2023

re-opening this since there still seems to be a problem.

One thing to note, beta-omnibus is 12 commits behind main. The "fix" for this issue should be in main already, I'll be updating beta momentarily to alleviate any confusion.

@AnalogJ AnalogJ reopened this Oct 17, 2023
@AnalogJ
Copy link
Owner

AnalogJ commented Oct 17, 2023

just to confirm, @Brandoskey @zuavra
are you running the scrutiny collector with a config file containing:

 - device: /dev/sd[X]
   type: 'sat'
   commands:
      metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'

@queeup
Copy link

queeup commented Mar 2, 2024

I have it also this problem with Seagate ST12000NM0127 . Also 195 Hardware ECC Recovered raw value is a problem.

sudo smartctl -x /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.79] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST12000NM0127
Serial Number:    ZJV3RH3M
LU WWN Device Id: 5 000c50 0b4af3c80
Firmware Version: G005
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5387
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar  3 00:10:02 2024 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1], Master PW ID: 0xfffd
Wt Cache Reorder: Unknown

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  584) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (1086) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x50bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   082   064   044    -    157163313
  3 Spin_Up_Time            PO----   096   090   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    64
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   068   060   045    -    6412647
  9 Power_On_Hours          -O--CK   099   099   000    -    1277
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    7
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   044   036   040    Past 56 (Min/Max 30/61 #79)
192 Power-Off_Retract_Count -O--CK   100   100   000    -    39
193 Load_Cycle_Count        -O--CK   100   100   000    -    101
194 Temperature_Celsius     -O---K   056   064   000    -    56 (0 16 0 0 0)
195 Hardware_ECC_Recovered  -O-RC-   013   007   000    -    157163313
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
198 Offline_Uncorrectable   ----C-   100   100   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
200 Multi_Zone_Error_Rate   PO---K   100   100   001    -    0
240 Head_Flying_Hours       ------   100   253   000    -    314 (12 38 0)
241 Total_LBAs_Written      ------   100   253   000    -    10149912032
242 Total_LBAs_Read         ------   100   253   000    -    760040584
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x24       GPL     R/O    768  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      24  Device vendor specific log
0xa2       GPL     VS   16320  Device vendor specific log
0xa4       GPL,SL  VS     160  Device vendor specific log
0xa6       GPL     VS     192  Device vendor specific log
0xa8-0xa9  GPL,SL  VS     136  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xad       GPL     VS      16  Device vendor specific log
0xaf       GPL,SL  VS       1  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc1       GPL,SL  VS       8  Device vendor specific log
0xc3       GPL,SL  VS      32  Device vendor specific log
0xc9       GPL,SL  VS       8  Device vendor specific log
0xca       GPL,SL  VS      16  Device vendor specific log
0xd1       GPL     VS     336  Device vendor specific log
0xd2       GPL     VS   10000  Device vendor specific log
0xd4       GPL     VS    2048  Device vendor specific log
0xda       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         1         -
# 2  Short offline       Completed without error       00%         0         -
# 3  Short offline       Completed without error       00%         0         -
# 4  Short offline       Completed without error       00%         0         -
# 5  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
Device State:                        Active (0)
Current Temperature:                    56 Celsius
Power Cycle Min/Max Temperature:     25/61 Celsius
Lifetime    Min/Max Temperature:     15/65 Celsius
Under/Over Temperature Limit Count:   0/1091
Vendor specific:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 36 02 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         3 minutes
Temperature Logging Interval:        59 minutes
Min/Max recommended Temperature:     10/40 Celsius
Min/Max Temperature Limit:            5/60 Celsius
Temperature History Size (Index):    128 (6)

Index    Estimated Time   Temperature Celsius
   7    2024-02-26 19:02    55  ************************************
   8    2024-02-26 20:01    55  ************************************
   9    2024-02-26 21:00    56  *************************************
 ...    ..(  6 skipped).    ..  *************************************
  16    2024-02-27 03:53    56  *************************************
  17    2024-02-27 04:52    55  ************************************
  18    2024-02-27 05:51    54  ***********************************
  19    2024-02-27 06:50    56  *************************************
  20    2024-02-27 07:49    56  *************************************
  21    2024-02-27 08:48    55  ************************************
  22    2024-02-27 09:47    55  ************************************
  23    2024-02-27 10:46    56  *************************************
  24    2024-02-27 11:45    56  *************************************
  25    2024-02-27 12:44    55  ************************************
  26    2024-02-27 13:43    55  ************************************
  27    2024-02-27 14:42    54  ***********************************
 ...    ..(  3 skipped).    ..  ***********************************
  31    2024-02-27 18:38    54  ***********************************
  32    2024-02-27 19:37    55  ************************************
  33    2024-02-27 20:36    55  ************************************
  34    2024-02-27 21:35    55  ************************************
  35    2024-02-27 22:34    56  *************************************
 ...    ..(  9 skipped).    ..  *************************************
  45    2024-02-28 08:24    56  *************************************
  46    2024-02-28 09:23    57  **************************************
  47    2024-02-28 10:22    57  **************************************
  48    2024-02-28 11:21    57  **************************************
  49    2024-02-28 12:20    56  *************************************
  50    2024-02-28 13:19    55  ************************************
 ...    ..(  2 skipped).    ..  ************************************
  53    2024-02-28 16:16    55  ************************************
  54    2024-02-28 17:15    56  *************************************
 ...    ..( 10 skipped).    ..  *************************************
  65    2024-02-29 04:04    56  *************************************
  66    2024-02-29 05:03    53  **********************************
  67    2024-02-29 06:02    55  ************************************
  68    2024-02-29 07:01    56  *************************************
 ...    ..( 21 skipped).    ..  *************************************
  90    2024-03-01 04:39    56  *************************************
  91    2024-03-01 05:38    53  **********************************
  92    2024-03-01 06:37    51  ********************************
  93    2024-03-01 07:36    55  ************************************
  94    2024-03-01 08:35    56  *************************************
 ...    ..(  2 skipped).    ..  *************************************
  97    2024-03-01 11:32    56  *************************************
  98    2024-03-01 12:31    55  ************************************
 ...    ..(  3 skipped).    ..  ************************************
 102    2024-03-01 16:27    55  ************************************
 103    2024-03-01 17:26    56  *************************************
 ...    ..(  5 skipped).    ..  *************************************
 109    2024-03-01 23:20    56  *************************************
 110    2024-03-02 00:19    55  ************************************
 ...    ..(  2 skipped).    ..  ************************************
 113    2024-03-02 03:16    55  ************************************
 114    2024-03-02 04:15    56  *************************************
 ...    ..(  6 skipped).    ..  *************************************
 121    2024-03-02 11:08    56  *************************************
 122    2024-03-02 12:07    55  ************************************
 ...    ..(  2 skipped).    ..  ************************************
 125    2024-03-02 15:04    55  ************************************
 126    2024-03-02 16:03    56  *************************************
 127    2024-03-02 17:02    56  *************************************
   0    2024-03-02 18:01    56  *************************************
   1    2024-03-02 19:00    50  *******************************
   2    2024-03-02 19:59    49  ******************************
   3    2024-03-02 20:58    48  *****************************
   4    2024-03-02 21:57    51  ********************************
   5    2024-03-02 22:56    55  ************************************
   6    2024-03-02 23:55    56  *************************************

SMART WRITE LOG does not return COUNT and LBA_LOW register
SCT (Get) Error Recovery Control command failed

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4               7  ---  Lifetime Power-On Resets
0x01  0x010  4            1277  ---  Power-on Hours
0x01  0x018  6     10149912032  ---  Logical Sectors Written
0x01  0x020  6         5714141  ---  Number of Write Commands
0x01  0x028  6       760011728  ---  Logical Sectors Read
0x01  0x030  6         7925241  ---  Number of Read Commands
0x01  0x038  6               -  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4      2337672378  N--  Spindle Motor Power-on Hours
0x03  0x010  4      2337672129  N--  Head Flying Hours
0x03  0x018  4             101  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               0  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4              39  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              56  ---  Current Temperature
0x05  0x010  1              54  ---  Average Short Term Temperature
0x05  0x018  1              52  ---  Average Long Term Temperature
0x05  0x020  1              65  ---  Highest Temperature
0x05  0x028  1              26  ---  Lowest Temperature
0x05  0x030  1              55  ---  Highest Average Short Term Temperature
0x05  0x038  1              45  ---  Lowest Average Short Term Temperature
0x05  0x040  1              52  ---  Highest Average Long Term Temperature
0x05  0x048  1              50  ---  Lowest Average Long Term Temperature
0x05  0x050  4             854  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               5  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4              65  ---  Number of Hardware Resets
0x06  0x010  4              63  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x008  7               0  ---  Vendor Specific
0xff  0x018  7               0  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2           20  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS

Seagate FARM log (GP Log 0xa6) supported [try: -l farm]

@OddMagnet
Copy link

@Brandoskey I've added the fix for the Exos X16 models to my PR

@queeup I'm not sure what needs to be done about the 195 value, so I won't be able to include anything for that.
I can add the fix for 188, but I'm not sure where to add it (or if your device should have a new entry).
Can you tell me what the output of smartctl -i /dev/sda is for your drive?

@queeup
Copy link

queeup commented Mar 2, 2024

@OddMagnet, thank you for your work here. You can click on top of sudo smartctl -x /dev/sda on my last post to see info about the drive. I hide it. Anyways let me share here too.

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.79] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST12000NM0127
Serial Number:    ZJV3RH3M
LU WWN Device Id: 5 000c50 0b4af3c80
Firmware Version: G005
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5387
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Mar  3 01:34:41 2024 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

@queeup
Copy link

queeup commented Mar 2, 2024

@OddMagnet, I opened upstream issue. smartmontools/smartmontools#248

@unai-ndz
Copy link

unai-ndz commented Mar 3, 2024

@OddMagnet @queeup 195 should be read just like 1 and 7 in these drives:

-v 1,raw24/raw32,Raw_Read_Error_Rate
-v 7,raw24/raw32,Seek_Error_Rate
-v 195,raw24/raw32,Hardware_ECC_Recovered

Mine doesn't have the 195 attribute but I checked and it's not a problem to have the extra -v flag either.

Seagate docs

@OddMagnet
Copy link

@queeup I've added your model to the file as well. Though I've commented it as untested, since I can't verify it

@unai-ndz I've gone ahead and added 1 and 7 as well for the Exos X16 and X20.

@IlyaDevice
Copy link

Hello everyone!

I have the same issue with ST4000NM000A-2HZ100
188 field value:
image

smartctl -a /dev/sda | grep 188 returns:

188 Connamd_Timeouts                                                 0x0032   100   015   000    Old_age   Always       -       26673560

smartctl -x /dev/sda | grep 188 returns:

188 Connamd_Timeouts                                                 -O--CK   100   015   000    -    26673560

@OddMagnet
Copy link

@IlyaDevice I've added the fix for your drive as well.
I've mentioned you in that PR as well, please test the fix and post your output there as well

@Barmagler
Copy link

Barmagler commented Mar 8, 2024

@OddMagnet same story here: 4xWDC WUH721414ALE6L4, 2 with fw LDGAW07G, 2 with fw LDGNW240
LDGNW240 has no attr 188 but LDGAW07G has huge timeout

`root@DSM7:~# smartctl -a -d sat /dev/sdc | grep 188
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 196608

root@DSM7:~# smartctl -a -d sat /dev/sdd | grep 188
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 327680
`

Screenshot_105
Screenshot_106

@OddMagnet
Copy link

As far as I understand the problem for Seagate drives is that the raw value for 188 represents 3 different values.

I'm unsure how the raw value for WDC drives is decoded, so I'm hesitant to simply adding the same modifications to my PR for your drive.

Can you check what the output of smartctl -xv 188,raw16 /dev/sdc | grep 188 is for your drives?

@KingsleyBawuah
Copy link

KingsleyBawuah commented Mar 11, 2024

I'm having this issue with a ST18000NM002J-2TV133. This is the result of running sudo smartctl -a /dev/sda | grep 188

188 Command_Timeout         0x0032   100   095   000    Old_age   Always       -       21475164165

let me know if i should provide any other information

@Barmagler
Copy link

@OddMagnet, I did it:

root@DSM7:~# smartctl -d sat -xv 188,raw16 /dev/sdc | grep 188
188 Command_Timeout                                                  -O--CK   100   100   000    -    0 3 0

but with this FW not possible to make output without '-d sat'

root@DSM7:~# smartctl -xv 188,raw16 /dev/sdc
smartctl 6.5 (build date Sep 26 2022) [x86_64-linux-4.4.302+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WUH721414ALE6L4
Revision:             W07G
User Capacity:        14,000,519,643,136 bytes [14.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca264ef7c05
Serial number:        9RKBG76L
Device type:          disk
Local Time is:        Tue Mar 12 20:38:58 2024 +03
SMART support is:     Unavailable - device lacks SMART capability.
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging
Device does not support Background scan results logging

@NolandTech
Copy link

Same issue with ST12000NT001-3LX101. Drive brand new.. Is being in a Raid array the issue here?
smartctl -d sat -xv 188,raw16 /dev/sda | grep 188
188 Command_Timeout -O--CK 100 100 000 - 1 1 1

@OddMagnet
Copy link

@KingsleyBawuah @Barmagler @NolandTech I've added your models to my PR as well, please test the file and provide your output there.

I'm currently rather busy, so I likely won't add any more models to my PR.

@KingsleyBawuah
Copy link

Details

sudo smartctl -x -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-100-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate Exos X18
Device Model: ST18000NM002J-2TV133
Serial Number: ZR5A38GZ
LU WWN Device Id: 5 000c50 0e4b0ae84
Add. Product Id: DELL(tm)
Firmware Version: PAL7
User Capacity: 18,000,207,937,536 bytes [18.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Mar 18 23:53:56 2024 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Disabled
ATA Security is: Disabled, NOT FROZEN [SEC1]
Write SCT (Get) Feature Control Command failed: scsi error aborted command
Wt Cache Reorder: Unknown (SCT Feature Control command failed)

Read SMART Data failed: SAT command failed

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 5 Ext. Comprehensive SMART error log
0x04 GPL R/O 256 Device Statistics log
0x04 SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x08 GPL R/O 2 Power Conditions log
0x09 SL R/W 1 Selective self-test log
0x0a GPL R/W 8 Device Statistics Notification
0x0c GPL R/O 2048 Pending Defects log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x13 GPL R/O 1 SATA NCQ Send and Receive log
0x21 GPL R/O 1 Write stream error log
0x22 GPL R/O 1 Read stream error log
0x24 GPL R/O 768 Current Device Internal Status Data log
0x2f GPL - 1 Set Sector Configuration
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa1 GPL,SL VS 160 Device vendor specific log
0xa2 GPL VS 16320 Device vendor specific log
0xa4 GPL,SL VS 160 Device vendor specific log
0xa6 GPL VS 192 Device vendor specific log
0xa8-0xa9 GPL,SL VS 136 Device vendor specific log
0xab GPL VS 1 Device vendor specific log
0xad GPL VS 16 Device vendor specific log
0xb1 GPL,SL VS 160 Device vendor specific log
0xb6 GPL VS 1920 Device vendor specific log
0xbe-0xbf GPL VS 65535 Device vendor specific log
0xc1 GPL,SL VS 8 Device vendor specific log
0xc3 GPL,SL VS 24 Device vendor specific log
0xc6 GPL VS 5184 Device vendor specific log
0xc7 GPL,SL VS 8 Device vendor specific log
0xc9 GPL,SL VS 8 Device vendor specific log
0xca GPL,SL VS 16 Device vendor specific log
0xcd GPL,SL VS 1 Device vendor specific log
0xce GPL VS 1 Device vendor specific log
0xcf GPL VS 512 Device vendor specific log
0xd1 GPL VS 656 Device vendor specific log
0xd2 GPL VS 10256 Device vendor specific log
0xd4 GPL VS 2048 Device vendor specific log
0xda GPL,SL VS 1 Device vendor specific log
0xdf GPL,SL VS 1 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
No Errors Logged

SMART Error Log Version: 1
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 299 -

2 Short offline Completed without error 00% 136 -

3 Short offline Completed without error 00% 94 -

4 Short offline Completed without error 00% 28 -

5 Vendor (0xdf) Completed without error 00% 26 -

6 Short offline Completed without error 00% 25 -

7 Vendor (0xdf) Completed without error 00% 7 -

8 Short offline Completed without error 00% 5 -

9 Vendor (0xdf) Completed without error 00% 3 -

#10 Short offline Completed without error 00% 1 -

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 299 -

2 Short offline Completed without error 00% 136 -

3 Short offline Completed without error 00% 94 -

4 Short offline Completed without error 00% 28 -

5 Vendor (0xdf) Completed without error 00% 26 -

6 Short offline Completed without error 00% 25 -

7 Vendor (0xdf) Completed without error 00% 7 -

8 Short offline Completed without error 00% 5 -

9 Vendor (0xdf) Completed without error 00% 3 -

#10 Short offline Completed without error 00% 1 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 522 (0x020a)
Device State: Active (0)
Current Temperature: 34 Celsius
Power Cycle Min/Max Temperature: 26/40 Celsius
Lifetime Min/Max Temperature: 25/55 Celsius
Under/Over Temperature Limit Count: 0/0
SMART Status: 0xc24f (PASSED)
Vendor specific:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version: 2
Temperature Sampling Period: 4 minutes
Temperature Logging Interval: 59 minutes
Min/Max recommended Temperature: 10/40 Celsius
Min/Max Temperature Limit: 5/60 Celsius
Temperature History Size (Index): 128 (64)

Index Estimated Time Temperature Celsius
65 2024-03-13 18:24 39 ********************
66 2024-03-13 19:23 37 ******************
67 2024-03-13 20:22 34 ***************
68 2024-03-13 21:21 37 ******************
69 2024-03-13 22:20 36 *****************
70 2024-03-13 23:19 35 ****************
71 2024-03-14 00:18 35 ****************
72 2024-03-14 01:17 34 ***************
73 2024-03-14 02:16 34 ***************
74 2024-03-14 03:15 33 **************
75 2024-03-14 04:14 33 **************
76 2024-03-14 05:13 34 ***************
77 2024-03-14 06:12 39 ********************
78 2024-03-14 07:11 38 *******************
79 2024-03-14 08:10 35 ****************
80 2024-03-14 09:09 31 ************
81 2024-03-14 10:08 32 *************
82 2024-03-14 11:07 33 **************
83 2024-03-14 12:06 32 *************
84 2024-03-14 13:05 31 ************
85 2024-03-14 14:04 32 *************
86 2024-03-14 15:03 32 *************
87 2024-03-14 16:02 32 *************
88 2024-03-14 17:01 31 ************
89 2024-03-14 18:00 33 **************
90 2024-03-14 18:59 32 *************
91 2024-03-14 19:58 34 ***************
92 2024-03-14 20:57 34 ***************
93 2024-03-14 21:56 34 ***************
94 2024-03-14 22:55 32 *************
95 2024-03-14 23:54 31 ************
96 2024-03-15 00:53 30 ***********
97 2024-03-15 01:52 32 *************
98 2024-03-15 02:51 33 **************
99 2024-03-15 03:50 30 ***********
100 2024-03-15 04:49 34 ***************
101 2024-03-15 05:48 31 ************
102 2024-03-15 06:47 30 ***********
103 2024-03-15 07:46 29 **********
104 2024-03-15 08:45 28 *********
105 2024-03-15 09:44 29 **********
... ..( 6 skipped). .. **********
112 2024-03-15 16:37 29 **********
113 2024-03-15 17:36 30 ***********
114 2024-03-15 18:35 30 ***********
115 2024-03-15 19:34 31 ************
116 2024-03-15 20:33 30 ***********
... ..( 3 skipped). .. ***********
120 2024-03-16 00:29 30 ***********
121 2024-03-16 01:28 32 *************
122 2024-03-16 02:27 31 ************
123 2024-03-16 03:26 33 **************
124 2024-03-16 04:25 31 ************
125 2024-03-16 05:24 30 ***********
126 2024-03-16 06:23 30 ***********
127 2024-03-16 07:22 29 **********
0 2024-03-16 08:21 28 *********
1 2024-03-16 09:20 29 **********
2 2024-03-16 10:19 30 ***********
3 2024-03-16 11:18 31 ************
4 2024-03-16 12:17 30 ***********
5 2024-03-16 13:16 30 ***********
6 2024-03-16 14:15 30 ***********
7 2024-03-16 15:14 31 ************
... ..( 3 skipped). .. ************
11 2024-03-16 19:10 31 ************
12 2024-03-16 20:09 32 *************
13 2024-03-16 21:08 34 ***************
14 2024-03-16 22:07 33 **************
15 2024-03-16 23:06 33 **************
16 2024-03-17 00:05 30 ***********
... ..( 2 skipped). .. ***********
19 2024-03-17 03:02 30 ***********
20 2024-03-17 04:01 31 ************
21 2024-03-17 05:00 30 ***********
22 2024-03-17 05:59 30 ***********
23 2024-03-17 06:58 30 ***********
24 2024-03-17 07:57 29 **********
25 2024-03-17 08:56 30 ***********
26 2024-03-17 09:55 30 ***********
27 2024-03-17 10:54 32 *************
28 2024-03-17 11:53 29 **********
29 2024-03-17 12:52 31 ************
30 2024-03-17 13:51 29 **********
31 2024-03-17 14:50 30 ***********
32 2024-03-17 15:49 34 ***************
33 2024-03-17 16:48 32 *************
34 2024-03-17 17:47 31 ************
... ..( 2 skipped). .. ************
37 2024-03-17 20:44 31 ************
38 2024-03-17 21:43 32 *************
39 2024-03-17 22:42 34 ***************
40 2024-03-17 23:41 35 ****************
41 2024-03-18 00:40 32 *************
42 2024-03-18 01:39 31 ************
43 2024-03-18 02:38 31 ************
44 2024-03-18 03:37 33 **************
45 2024-03-18 04:36 32 *************
46 2024-03-18 05:35 32 *************
47 2024-03-18 06:34 32 *************
48 2024-03-18 07:33 31 ************
49 2024-03-18 08:32 33 **************
50 2024-03-18 09:31 32 *************
51 2024-03-18 10:30 31 ************
52 2024-03-18 11:29 32 *************
53 2024-03-18 12:28 33 **************
54 2024-03-18 13:27 32 *************
55 2024-03-18 14:26 34 ***************
56 2024-03-18 15:25 33 **************
... ..( 2 skipped). .. **************
59 2024-03-18 18:22 33 **************
60 2024-03-18 19:21 32 *************
61 2024-03-18 20:20 34 ***************
62 2024-03-18 21:19 34 ***************
63 2024-03-18 22:18 32 *************
64 2024-03-18 23:17 33 **************

SCT Error Recovery Control:
Read: 80 (8.0 seconds)
Write: 80 (8.0 seconds)

Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 31 --- Lifetime Power-On Resets
0x01 0x010 4 539 --- Power-on Hours
0x01 0x018 6 24778543399 --- Logical Sectors Written
0x01 0x020 6 26515217 --- Number of Write Commands
0x01 0x028 6 66579004967 --- Logical Sectors Read
0x01 0x030 6 141155589 --- Number of Read Commands
0x01 0x038 6 - --- Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 521 --- Spindle Motor Power-on Hours
0x03 0x010 4 347 --- Head Flying Hours
0x03 0x018 4 1293 --- Head Load Events
0x03 0x020 4 0 --- Number of Reallocated Logical Sectors
0x03 0x028 4 0 --- Read Recovery Attempts
0x03 0x030 4 0 --- Number of Mechanical Start Failures
0x03 0x038 4 0 --- Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 23 --- Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
0x04 0x010 4 5 --- Resets Between Cmd Acceptance and Completion
0x04 0x018 4 0 -D- Physical Element Status Changed
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 34 --- Current Temperature
0x05 0x010 1 32 --- Average Short Term Temperature
0x05 0x018 1 - --- Average Long Term Temperature
0x05 0x020 1 51 --- Highest Temperature
0x05 0x028 1 26 --- Lowest Temperature
0x05 0x030 1 45 --- Highest Average Short Term Temperature
0x05 0x038 1 28 --- Lowest Average Short Term Temperature
0x05 0x040 1 - --- Highest Average Long Term Temperature
0x05 0x048 1 - --- Lowest Average Long Term Temperature
0x05 0x050 4 0 --- Time in Over-Temperature
0x05 0x058 1 60 --- Specified Maximum Operating Temperature
0x05 0x060 4 0 --- Time in Under-Temperature
0x05 0x068 1 5 --- Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 17 --- Number of Hardware Resets
0x06 0x010 4 10 --- Number of ASR Events
0x06 0x018 4 0 --- Number of Interface CRC Errors
0xff ===== = = === == Vendor Specific Statistics (rev 1) ==
0xff 0x008 7 0 --- Vendor Specific
0xff 0x010 7 0 --- Vendor Specific
0xff 0x018 7 0 --- Vendor Specific
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x000a 2 7 Device-to-host register FISes sent due to a COMRESET
0x0001 2 0 Command failed due to ICRC error
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS

@KingsleyBawuah @Barmagler @NolandTech I've added your models to my PR as well, please test the file and provide your output there.

I'm currently rather busy, so I likely won't add any more models to my PR.

Thanks for everything so far!

@Barmagler
Copy link

@OddMagnet, little bit lost what version I need to run:
before it was ghcr.io/analogj/scrutiny:master-omnibus
now I DL ghcr.io/analogj/scrutiny:latest

still same result. May be I need to update drivedb.h?

Screenshot_108

@OddMagnet
Copy link

yes, you need to use the drivedb.h from my PR and then use it like described here

For safety purpose I recommend just renaming your existing file before adding the one from my PR.

@Barmagler
Copy link

Very strange situation:

  1. smartctl inside container:

Screenshot_109

  1. smartctl on dsm:

Screenshot_110

DSM:
ash-4.4# smartctl -V
smartctl 6.5 (build date Sep 26 2022) [x86_64-linux-4.4.302+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

smartctl comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
the terms of the GNU General Public License; either
version 2, or (at your option) any later version.
See http://www.gnu.org for further details.

smartmontools release 6.5 dated 2015-06-04 at 16:29:41 UTC
smartmontools SVN rev is unknown
smartmontools build host: x86_64-pc-linux-gnu
smartmontools build with: GCC 12.2.0
smartmontools configure arguments: '--host=x86_64-pc-linux-gnu' '--target=x86_64-pc-linux-gnu' '--build=i686-pc-linux-gnu' '--prefix=/usr' 'build_alias=i686-pc-linux-gnu' 'host_alias=x86_64-pc-linux-gnu' 'target_alias=x86_64-pc-linux-gnu' 'CXX=/usr/local/x86_64-pc-linux-gnu/bin/ccache/x86_64-pc-linux-gnu-wrap-g++' 'LDFLAGS= -Wl,-z,relro -Wl,--as-needed -Wl,--no-undefined' 'CC=/usr/local/x86_64-pc-linux-gnu/bin/ccache/x86_64-pc-linux-gnu-wrap-gcc' 'CFLAGS=-DSYNOPLAT_F_X86_64 -include /usr/syno/include/platformconfig.h -DSYNO_ENVIRONMENT -DBUILD_ARCH=64 -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -DSYNO_RUNNING_DSM_BUILD_SYSTEM -g -pipe -fstack-protector-strong -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=3 -Wno-unused-result -fexceptions -Wbidi-chars=ucn -fstack-clash-protection -fcf-protection=full -mshstk -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE' 'PKG_CONFIG_PATH='

Container:
root@DSM7:/opt/scrutiny# smartctl -V
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-4.4.302+] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

smartctl comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
the terms of the GNU General Public License; either
version 2, or (at your option) any later version.
See https://www.gnu.org for further details.

smartmontools release 7.3 dated 2022-02-28 at 16:33:40 UTC
smartmontools SVN rev 5338 dated 2022-02-28 at 16:34:26
smartmontools build host: x86_64-pc-linux-gnu
smartmontools build with: C++11, GCC 12.2.0
smartmontools configure arguments: [hidden in reproducible builds]
reproducible build SOURCE_DATE_EPOCH: 1665910132 (2022-10-16 08:48:52)

But it's really strange situation - scrutiny use smartctl inside container, why it still giving error?

Screenshot_111

@OddMagnet
Copy link

Scrutiny still remember the old values. You could either edit them manually in the influx db of scrutiny, or simply remove the drives in Scrutiny's WebUI and restart the container.

@goodboyrobot
Copy link

Just chiming in to say that I also had this issue and following @unai-ndz 's post fixed it for me. With one small caviet. For whatever reason my installation of ubuntu 22.04 had the default location for the smart db as follows:

➜  ~ smartctl -h | grep 'default is'                        
        [default is +/etc/smart_drivedb.h

but there was no file there. Instead the db was getting downloaded to /var/lib/smartmontools/drivedb/drivedb.h

sudo update-smart-drivedb 
/var/lib/smartmontools/drivedb/drivedb.h updated from branches/RELEASE_7_2_DRIVEDB

I made the edits in that file and then copied it to the /etc/smart_drivedb.h location and it worked. Just letting folks know in case anyone else hits that issue.

@ChaosBlades
Copy link

I just added 9x refurbished ST16000NM000H-3KW103 to Scrutiny v0.8.1 and 3 of them have this issue. One of them has a value as high as 4295033551

I think Scrutiny should mention that this metric is mostly commonly caused by a failing backplane or SATA Cable not the actual HDD failing. So the real alert should be if it is increasing over X period not a threshold. Which none of mine are increasing since I added them even after a lot of stress testing and ZFS scrubs. So I would consider this marking the HDD as failed a false positive. Now I can't really rely on Scrutiny to alert me if the HDD is actually failing since it is permanently stuck as failed. Would need to manually go in and check and compare the status.

@WeterPeter
Copy link

Yes I have the same issue with drive model: ST5000DM000 with a 'Çommand Timeout' raw value of 8950848424068.
Please fix or give us an option to fix!

@unai-ndz
Copy link

Follow the steps in my comment above.
#522 (comment)

You may need to adjust it for your specific hdd model and error code. Ultimately this is an issue with smartctl, not scrutiny.

I agree that ideally it should consider warning only when values increase but it's probable that doing the steps above will reduce the reading to zero, or a low value.

@Parlane
Copy link

Parlane commented Sep 5, 2024

Just chiming in to say that I also had this issue and following @unai-ndz 's post fixed it for me. With one small caviet. For whatever reason my installation of ubuntu 22.04 had the default location for the smart db as follows:

➜  ~ smartctl -h | grep 'default is'                        
        [default is +/etc/smart_drivedb.h

but there was no file there. Instead the db was getting downloaded to /var/lib/smartmontools/drivedb/drivedb.h

sudo update-smart-drivedb 
/var/lib/smartmontools/drivedb/drivedb.h updated from branches/RELEASE_7_2_DRIVEDB

I made the edits in that file and then copied it to the /etc/smart_drivedb.h location and it worked. Just letting folks know in case anyone else hits that issue.

To expand on this, smartctl first looks for /etc/smart_drivedb.h before it reads /var/lib/smartmontools/drivedb/drivedb.h

smartmontools/drivedb/drivedb.h does not contain the specifics for my 8TB and 18TB seagate drives so I had to manually create /etc/smart_drivedb.h to handle my specific drives, this means I don't need the commands/metrics_smart_args inside collector.yaml anymore as smartctl on both my docker instance and my actual host now get the correct -v args from smartctl itself.

My /etc/smart_drivedb.h file is this:

{ "Seagate Exos X18 Hard Drive",
"ST18000NM000J-2TV103",
"SN02",
"",
"-v 1,raw48:54 "
"-v 7,raw48:54 "
"-v 188,raw16"
},
{ "Seagate IronWolf",
"ST(1|2|3|4|6|8|10|12)000VN00(0?[2478]|1|22|33|41)-.*",
"SC60",
"",
"-v 1,raw48:54 "
"-v 7,raw48:54 "
"-v 18,raw48,Head_Health "
"-v 188,raw16 "
"-v 200,raw48,Pressure_Limit "
"-v 240,msec24hour32"
}

My docker compose file now links /etc/smart_drivedb.h inside the docker environment:

    volumes:
      - /run/udev:/run/udev:ro
      - /docker/scrutiny/config:/opt/scrutiny/config
      - /docker/scrutiny/influxdb:/opt/scrutiny/influxdb
      - /etc/smart_drivedb.h:/etc/smart_drivedb.h:ro

smartctl -a /dev/devk is now:

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos X18 Hard Drive
Device Model:     ST18000NM000J-2TV103
Serial Number:    *********
LU WWN Device Id: 5 000c50 0e36d28bf
Firmware Version: SN02
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Sep  6 09:29:16 2024 NZST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   064   044    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   060   045    Pre-fail  Always       -       0
...
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       2 2 2
...
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       50230319259
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       970940703083

@likecyber
Copy link

This problem still happening on Seagate Exos X14 (ST12000NM0538-2K2101) with CN02 firmware.

smartctl 6.5 (build date Sep 26 2022) [x86_64-linux-4.4.302+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos X14
Device Model:     ST12000NM0538-2K2101
Serial Number:    ZHZ58W1B
LU WWN Device Id: 5 000c50 0c4381cc3
Firmware Version: CN02
User Capacity:    12,000,138,625,024 bytes [12.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   Unknown(0x0fe0), ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA >3.2 (0x1ff), 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Sep 30 18:19:37 2024 WIB
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Disabled
ATA Security is:  Disabled, frozen [SEC2]
Write SCT (Get) Feature Control Command failed: scsi error badly formed scsi parameters
Wt Cache Reorder: Unknown (SCT Feature Control command failed)

Adding metrics_smart_args: "-xv 188,raw16 --xall --json -T permissive" does indeed fixed the problem for me.

devices:
  - device: /dev/sata1
    type: "sat"
    commands:
      metrics_smart_args: "-xv 188,raw16 --xall --json -T permissive"

If it still shows as failed for anyone after addining metrics_smart_args,
Remove every file inside the "config" and "influxdb" folders.
(Except collector.yaml, scrutiny.yaml if you don't want to lose your settings)
Delete the container and run a new one, worked perfectly for me.

@tomini
Copy link

tomini commented Dec 9, 2024

Hello,
is it normal that from four drives that are the same model (ST16000NM001G-2KK103) two of them passed and the other two did not, because of the 188 (0xBC) Command Timeout?
Thanks
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.