SMART Tests
Table of Contents -
Document Change History -
Version | Date | Comment |
---|---|---|
Current Version (v. 6) | Nov 11, 2019 20:56 | Joshua DeRush |
v. 5 | Nov 11, 2019 20:22 | Joshua DeRush |
v. 4 | Nov 07, 2019 20:46 | Joshua DeRush |
v. 3 | Nov 07, 2019 20:31 | Joshua DeRush |
v. 2 | Nov 05, 2019 01:16 | Joshua DeRush |
v. 1 | Oct 25, 2019 23:48 | Joshua DeRush |
Document Scope & Audience -
Document Scope
SMART tests are rather involved and are rather overwhelming before you know where to look. The idea of this article is to shine some light on SMART tests, when to schedule them and what to watch out for.
Document Audience
INTERNAL USE FOR EXXACT CORPORATION PERSONNEL ONLY. DO NOT DISTRIBUTE OR DISSEMINATE OUTSIDE OF THE EXXACT CORPORATION PREMISES OR TO ANY NON-EXXACT CORPORATION AUTHORIZED PERSONNEL UNLESS SPECIFICALLY AUTHORIZED BY ANDREW.NELSON@EXXACTCORP.COM
Targeted Audience List (If Any)
List any specific targeted audience here with an at symbol (@) in the Target column with the targets name. Notes field should be used to denote WHY the person is being targeted in the document to get notifications, etc.
Date | Target | Notes |
---|---|---|
11/4/19 | Everyone | Useful information that applies to anyone who touches a system. |
SMART Summary -
Self-Monitoring, Analysis and Reporting Technology
This monitoring system is included with disk drives (traditional HDDs, SSDs, and eMMC) and its goal is to proactively monitor drives to hopefully catch potential failing drives before they actually fail. This is done by reporting on a vast array of indicators and attributes that can be quite overwhelming. Unfortunately, this is compounded by the fact these indicators and attributes are not standardized across the industry and those that appear to be the same across vendors are often interpreted differently.
We hope to bring some clarity to this subject so that this monitoring system can become of greater value to our users and hopefully prevent data loss.
Manually Running Tests
Installation
Ensure that smartmontools is installed, if not, please install according to the OS you are using either apt-get or yum
# yum install smartmontools
# sudo apt-get install smartmontools
Confirm SMART is supported
# sudo smartctl -i /dev/<device>
Executing Tests -
# smartctl -t short /dev/<device> # smartctl -t long /dev/<device> # smartctl -t conveyance /dev/<device>
Specific time frames can be gathered by executing the following command.
# sudo smartctl -c /dev/<device>
Testing Schedule -
The following tests are a good foundation to start with. I would suggest that the frequency of the tests not be adjusted, but the time in which these tests occur can be shifted to meet the needs of the system and environment. These should also occur during non-peak times. For example I typically schedule them in the middle of the night short tests at midnight on Fridays and long tests at 8pm Sundays.
Test | Description | Frequency |
---|---|---|
Short | Short test ≤ 2 minutes to determine defective drive. By performing three separate tests the disk can reliably be confirmed faulty in short amount of time. These tests include an electrical test, a mechanical test, and a Read/Verify from a portion of the disk. The contents and location of the area read and verified change from manufacturer to manufacturer but is still a good regular test to keep tabs on the disk. | Weekly to Daily Depends on server role and criticality of data. |
Long | This test is the same as the Short test but with two distinct differences. 1st there is no time limit and 2nd the entire disk is read. This translates to a much longer test that is directly related to the size of the disk and that no area of usable disk space is overlooked. | Monthly to Weekly Depending on server role and criticality of data. |
Conveyance | Only available on ATA drives and only takes a few minutes. Only used when disks have been transported so that any damage in transit can be identified before use. | Only if disks have been moved to a new location/system. |
Anything more can shorten life of the disk but ultimately is moreso a waste of time.
Scheduling Tests -
Testing can be scheduled and automated to avoid having to remember running tests manually on a regular basis. This can be done several ways, however, using smartd.conf is discussed below.
/dev/<device> -a -m user@domain.com -o -s (S/../../1/5|L/../../6/1)
The example above will run on the specified device, monitoring all SMART features, seding a report to user@domain.com after the Short tests completes which is scheduled to run on the 1st day of the week at 5AM while the long test is scheduled for the 6th day of the week at 1AM.
Options and more explanations below.
Flag | Definition | Notes |
---|---|---|
-a | monitors ALL SMART features | |
-d | specify the interface explicitly | Scheduling a different test for "ATA" vs "SCSI" drives |
-m | specify email for notifications | |
-M | specifies type of email | Default is "once" other options are "daily", "diminishing" or "test" |
-M exec | path to script can be specified to run when smartd starts | |
-n | prevents spin-up due to smartd polling | "never", "sleep", "standby", "idle" or "active" |
-s | toggles SMART support | |
-o | offline data collection | Would recommend this be used in almost every case |
-S | autosave of device vendor specific attributes | test type /month/day/day-of-week/time |
What Metrics Matter -
SMART# | Name | Definition |
---|---|---|
5 | Reallocated_Sector_Count | Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months |
10 | Spin Retry Count | Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem. |
187 | Reported_Uncorrectable_Errors | The count of errors that could not be recovered using hardware ECC |
188 | Command_Timeout | The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero. |
194 | Temperature | Indicates the device temperature, if the appropriate sensor is fitted |
196 | Reallocation Event Count | Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful and unsuccessful attempts are counted. |
197 | Current_Pending_Sector_Count | Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written. However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors |
198 | Offline_Uncorrectable | The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem. |
When a Disk is Suspected Bad -
Confirm the device in question and run a long test so that the entire disk is scanned and tested. As stated earlier this can take several hours to complete but will provide a comprehensive overview of the disk's health.
The SMART log (smartctl -x /dev/<device>) can then be attached to a ticket with Exxact to validate RMA replacement if within the 3 year warranty we provide. If the system is older than that, most drive manufacturers offer a 5 year warranty and they can be contacted directly.