Table of Contents -
Document Change History -
Version | Date | Comment |
---|---|---|
Current Version (v. 1) | Oct 25, 2019 23:48 | Joshua DeRush (Unlicensed) |
v. 5 | Nov 11, 2019 20:22 | Joshua DeRush (Unlicensed) |
v. 4 | Nov 07, 2019 20:46 | Joshua DeRush (Unlicensed) |
v. 3 | Nov 07, 2019 20:31 | Joshua DeRush (Unlicensed) |
v. 2 | Nov 05, 2019 01:16 | Joshua DeRush (Unlicensed) |
v. 1 | Oct 25, 2019 23:48 | Joshua DeRush (Unlicensed) |
Document Scope & Audience -
Document Scope
SMART tests are rather involved and are rather overwhelming before you know where to look. The idea of this article is to shine some light on SMART tests, when to schedule them and what to watch out for.
Document Audience
INTERNAL USE FOR EXXACT CORPORATION PERSONNEL ONLY. DO NOT DISTRIBUTE OR DISSEMINATE OUTSIDE OF THE EXXACT CORPORATION PREMISES OR TO ANY NON-EXXACT CORPORATION AUTHORIZED PERSONNEL UNLESS SPECIFICALLY AUTHORIZED BY ANDREW.NELSON@EXXACTCORP.COM
Targeted Audience List (If Any)
List any specific targeted audience here with an at symbol (@) in the Target column with the targets name. Notes field should be used to denote WHY the person is being targeted in the document to get notifications, etc.
Date | Target | Notes |
---|---|---|
SMART Summary -
Self-Monitoring, Analysis and Reporting Technology
This monitoring system is included with disk drives (traditional HDDs, SSDs, and eMMC) and its goal is to proactively monitor drives to hopefully catch potential failing drives before they actually fail. This is done by reporting on a vast array of indicators and attributes that can be quite overwhelming. Unfortunately, this is compounded by the fact these indicators and attributes are not standardized across the industry and those that appear to be the same across vendors are often interpreted differently.
We hope to bring some clarity to this subject so that this monitoring system can become of greater value to our users and hopefully prevent data loss.
Manually Running Tests
Installation
Ensure that smartmontools is installed, if not, please install according to the OS you are using either apt-get or yum
yum install smartmontools
sudo apt-get install smartmontools
Confirm SMART is supported
sudo smartctl -i /dev/sda
Testing Schedule -
The following Tests are a good foundation to start with. I would suggest that the frequency of the tests not be adjusted, but the time in which these tests occur can be shifted to meet the needs of the system and environment. These should also occur during non-peak times. For example I typically schedule them in the middle of the night short tests at midnight on Fridays and long tests at 8pm Sundays.
Weekly Short Tests - Scans a sector of the drive typically 15 minutes or less (Possibly 2 times a week if system is under heavy use and data is critical.)
Monthly Long Tests - Scans the entire disk and can take several hours to complete (Possibly 2 times a month if system is under heavy use and data is critical)
Anything more can shorten life of the disk
Setting Tests -
What Metrics Matter -
- SMART 5 - Reallocated_Sector_Count - Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped.[24] Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months
- SMART 187 - Reported_Uncorrectable_Errors - The count of errors that could not be recovered using hardware ECC
- SMART 188 - Command_Timeout - The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero.
- SMART 197 - Current_Pending_Sector_Count - Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written.[57]
However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.
- SMART 198 - Offline_Uncorrectable - The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.