SMART Tests

Table of Contents - 


Document Change History - 

Version Date Comment
Current Version (v. 6) Nov 11, 2019 20:56 Joshua DeRush
v. 5 Nov 11, 2019 20:22 Joshua DeRush
v. 4 Nov 07, 2019 20:46 Joshua DeRush
v. 3 Nov 07, 2019 20:31 Joshua DeRush
v. 2 Nov 05, 2019 01:16 Joshua DeRush
v. 1 Oct 25, 2019 23:48 Joshua DeRush


Document Scope & Audience -

Document Scope

SMART tests are rather involved and are rather overwhelming before you know where to look. The idea of this article is to shine some light on SMART tests, when to schedule them and what to watch out for.

Document Audience


INTERNAL USE FOR EXXACT CORPORATION PERSONNEL ONLY. DO NOT DISTRIBUTE OR DISSEMINATE OUTSIDE OF THE EXXACT CORPORATION PREMISES OR TO ANY NON-EXXACT CORPORATION AUTHORIZED PERSONNEL UNLESS SPECIFICALLY AUTHORIZED BY ANDREW.NELSON@EXXACTCORP.COM

Targeted Audience List (If Any)

List any specific targeted audience here with an at symbol (@) in the Target column with the targets name. Notes field should be used to denote WHY the person is being targeted in the document to get notifications, etc.


DateTargetNotes
11/4/19EveryoneUseful information that applies to anyone who touches a system.

SMART Summary -

Self-Monitoring, Analysis and Reporting Technology

This monitoring system is included with disk drives (traditional HDDs, SSDs, and eMMC) and its goal is to proactively monitor drives to hopefully catch potential failing drives before they actually fail. This is done by reporting on a vast array of indicators and attributes that can be quite overwhelming. Unfortunately, this is compounded by the fact these indicators and attributes are not standardized across the industry and those that appear to be the same across vendors are often interpreted differently. 

We hope to bring some clarity to this subject so that this monitoring system can become of greater value to our users and hopefully prevent data loss. 

Manually Running Tests

Installation

Ensure that smartmontools is installed, if not, please install according to the OS you are using either apt-get or yum

# yum install smartmontools
# sudo apt-get install smartmontools

Confirm SMART is supported

# sudo smartctl -i /dev/<device>

Executing Tests - 
# smartctl -t short /dev/<device>
# smartctl -t long /dev/<device>
# smartctl -t conveyance /dev/<device>

Specific time frames can be gathered by executing the following command.
# sudo smartctl -c /dev/<device>

Testing Schedule - 

The following tests are a good foundation to start with. I would suggest that the frequency of the tests not be adjusted, but the time in which these tests occur can be shifted to meet the needs of the system and environment. These should also occur during non-peak times. For example I typically schedule them in the middle of the night short tests at midnight on Fridays and long tests at 8pm Sundays. 


TestDescriptionFrequency

Short

Short test ≤ 2 minutes to determine defective drive. By performing three separate tests the disk can reliably be confirmed faulty in  short amount of time.

These tests include an electrical test, a mechanical test, and a Read/Verify from a portion of the disk. The contents and location of the area read and verified

change from manufacturer to manufacturer but is still a good regular test to keep tabs on the disk. 

Weekly to Daily

Depends on server role

and criticality of data.

Long

This test is the same as the Short test but with two distinct differences. 1st there is no time limit and 2nd the entire disk is read. This translates to a much longer test

that is directly related to the size of the disk and that no area of usable disk space is overlooked.

Monthly to Weekly

Depending on server role

and criticality of data.

ConveyanceOnly available on ATA drives and only takes a few minutes. Only used when disks have been transported so that any damage in transit can be identified before use.

Only if disks have been

moved to a new location/system.

Anything more can shorten life of the disk but ultimately is moreso a waste of time.


Scheduling Tests - 

Testing can be scheduled and automated to avoid having to remember running tests manually on a regular basis. This can be done several ways, however, using smartd.conf is discussed below.

/dev/<device> -a -m user@domain.com -o -s (S/../../1/5|L/../../6/1)

The example above will run on the specified device, monitoring all SMART features, seding a report to user@domain.com after the Short tests completes which is scheduled to run on the 1st day of the week at 5AM while the long test is scheduled for the 6th day of the week at 1AM.

Options and more explanations below.

Flag

DefinitionNotes
-amonitors ALL SMART features
-d

specify the interface explicitly 

Scheduling a different test for "ATA" vs "SCSI" drives
-mspecify email for notifications
-Mspecifies type of email Default is "once" other options are "daily", "diminishing" or "test"
-M execpath to script can be specified to run when smartd starts
-nprevents spin-up due to smartd polling "never", "sleep", "standby", "idle" or "active"
-stoggles  SMART support
-ooffline data collection Would recommend this be used in almost every case
-Sautosave of device vendor specific attributes test type /month/day/day-of-week/time


What Metrics Matter - 

SMART#NameDefinition
5Reallocated_Sector_Count

Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months

10Spin Retry CountCount of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.
187Reported_Uncorrectable_ErrorsThe count of errors that could not be recovered using hardware ECC
188Command_TimeoutThe count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero.
194TemperatureIndicates the device temperature, if the appropriate sensor is fitted
196Reallocation Event Count

Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful and unsuccessful attempts are counted.

197Current_Pending_Sector_Count

Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written.

However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors

198Offline_UncorrectableThe total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.

When a Disk is Suspected Bad -

Confirm the device in question and run a long test so that the entire disk is scanned and tested. As stated earlier this can take several hours to complete but will provide a comprehensive overview of the disk's health.

The SMART log (smartctl -x /dev/<device>) can then be attached to a ticket with Exxact to validate RMA replacement if within the 3 year warranty we provide. If the system is older than that, most drive manufacturers offer a 5 year warranty and they can be contacted directly.