Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Current »

Table of Contents - 


Document Change History - 

Version Date Comment
Current Version (v. 5) Nov 11, 2019 20:22 Joshua DeRush (Unlicensed)
v. 5 Nov 11, 2019 20:22 Joshua DeRush (Unlicensed)
v. 4 Nov 07, 2019 20:46 Joshua DeRush (Unlicensed)
v. 3 Nov 07, 2019 20:31 Joshua DeRush (Unlicensed)
v. 2 Nov 05, 2019 01:16 Joshua DeRush (Unlicensed)
v. 1 Oct 25, 2019 23:48 Joshua DeRush (Unlicensed)


Document Scope & Audience -

Document Scope

SMART tests are rather involved and are rather overwhelming before you know where to look. The idea of this article is to shine some light on SMART tests, when to schedule them and what to watch out for.

Document Audience


INTERNAL USE FOR EXXACT CORPORATION PERSONNEL ONLY. DO NOT DISTRIBUTE OR DISSEMINATE OUTSIDE OF THE EXXACT CORPORATION PREMISES OR TO ANY NON-EXXACT CORPORATION AUTHORIZED PERSONNEL UNLESS SPECIFICALLY AUTHORIZED BY ANDREW.NELSON@EXXACTCORP.COM

Targeted Audience List (If Any)

List any specific targeted audience here with an at symbol (@) in the Target column with the targets name. Notes field should be used to denote WHY the person is being targeted in the document to get notifications, etc.


DateTargetNotes
11/4/19EveryoneUseful information that applies to anyone who touches a system.

SMART Summary -

Self-Monitoring, Analysis and Reporting Technology

This monitoring system is included with disk drives (traditional HDDs, SSDs, and eMMC) and its goal is to proactively monitor drives to hopefully catch potential failing drives before they actually fail. This is done by reporting on a vast array of indicators and attributes that can be quite overwhelming. Unfortunately, this is compounded by the fact these indicators and attributes are not standardized across the industry and those that appear to be the same across vendors are often interpreted differently. 

We hope to bring some clarity to this subject so that this monitoring system can become of greater value to our users and hopefully prevent data loss. 

Manually Running Tests

Installation

Ensure that smartmontools is installed, if not, please install according to the OS you are using either apt-get or yum

# yum install smartmontools
# sudo apt-get install smartmontools

Confirm SMART is supported

# sudo smartctl -i /dev/<device>

Executing Tests - 
# smartctl -t short /dev/<device>
# smartctl -t long /dev/<device>
# smartctl -t conveyance /dev/<device>

Specific time frames can be gathered by executing the following command.
# sudo smartctl -c /dev/<device>

Testing Schedule - 

The following tests are a good foundation to start with. I would suggest that the frequency of the tests not be adjusted, but the time in which these tests occur can be shifted to meet the needs of the system and environment. These should also occur during non-peak times. For example I typically schedule them in the middle of the night short tests at midnight on Fridays and long tests at 8pm Sundays. 


Weekly Short Tests - Scans a sector of the drive typically 5 minutes or less (Possibly 2 times a week if system is under heavy use and data is critical.)

Monthly Long Tests - Scans the entire disk and can take several hours to complete (Possibly 2 times a month if system is under heavy use and data is critical) 

Anything more can shorten life of the disk but ultimately is moreso a waste of time.

TestDescriptionFrequency

Short

Short test ≤ 2 minutes to determine defective drive. By performing three separate tests the disk can reliably be confirmed faulty in  short amount of time. These tests include an electrical test, a mechanical test, and a Read/Write 
Long

Conveyance

Scheduling Tests - 

Testing can be scheduled and automated to avoid having to remember running tests manually on a regular basis. This can be done several ways, however, using smartd.conf is discussed below.

/dev/<device> -a -m user@domain.com -o -s (S/../../1/5|L/../../6/1)

The example above will run on the specified device, monitoring all SMART features, seding a report to user@domain.com after the Short tests completes which is scheduled to run on the 1st day of the week at 5AM while the long test is scheduled for the 6th day of the week at 1AM.

Options and more explanations below.

-a = monitors ALL SMART features

-d = specify the interface explicitly (Scheduling a different test for "ata" vs "scsi" drives)

-m = specify email for notifications

-M = specifies type of email (default is "once" other options are "daily", "diminishing" or "test")

-M exec = path to script can be specified to run when smartd starts

-n = prevents spin-up due to smartd polling ("never", "sleep", "standby", "idle" or "active") 

-s = toggles  SMART support

-o = offline data collection ( would recommend this be used in almost every case)

-S = autosave of device vendor specific attributes

(test type /month/day/day-of-week/time) 

What Metrics Matter - 

  • SMART 5 - Reallocated_Sector_Count - Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped.[24] Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value                                                                   is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months
  • SMART 10 - Spin Retry Count - Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value                                                        is a sign of problems in the hard disk mechanical subsystem.
  • SMART 187 - Reported_Uncorrectable_Errors - The count of errors that could not be recovered using hardware ECC
  • SMART 188 - Command_Timeout - The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero.
  • SMART 194 - Temperature - Indicates the device temperature, if the appropriate sensor is fitted.
  • SMART 196 - Reallocation Event Count - Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful and unsuccessful attempts are counted.
  • SMART 197 - Current_Pending_Sector_Count - Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read                                                                                     errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware                                                                                         remembers that the sector needs to be remapped, and will remap it the next time it's written.[57]

                                                                                 However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked                                                                                 good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed                                                                                   following a successful write operation, then the drive will never remap these problem sectors.

  • SMART 198 - Offline_Uncorrectable - The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.

When a Disk is Suspected Bad -

Confirm the device in question and run a long test so that the entire disk is scanned and tested. As stated earlier this can take several hours to complete but will provide a comprehensive overview of the disk's health.

The SMART log (smartctl -x /dev/<device>) can then be attached to a ticket with Exxact to validate RMA replacement if within the 3 year warranty we provide. If the system is older than that, most drive manufacturers offer a 5 year warranty and they can be contacted directly.


  • No labels