S.M.A.R.T.


S.M.A.R.T. is a monitoring system included in computer hard disk drives, solid-state drives, and eMMC drives. Its primary function is to detect and report various indicators of drive reliability with the intent of anticipating imminent hardware failures.
When S.M.A.R.T. data indicates a possible imminent drive failure, software running on the host system may notify the user so preventive action can be taken to prevent data loss, and the failing drive can be replaced and data integrity maintained.

Background

Hard disk and other storage drives are subject to failures which can be classified within two basic classes:
Mechanical failures account for about 60% of all drive failures.
While the eventual failure may be catastrophic, most mechanical failures result from gradual wear and there are usually certain indications that failure is imminent. These may include increased heat output, increased noise level, problems with reading and writing of data, or an increase in the number of damaged disk sectors.
PCTechGuide's page on S.M.A.R.T. comments that the technology has gone through three phases:

Accuracy

A field study at Google covering over 100,000 consumer-grade drives from December 2005 to August 2006 found correlations between certain S.M.A.R.T. information and annualized failure rates:
An early hard disk monitoring technology was introduced by IBM in 1992 in its IBM 9337 Disk Arrays for AS/400 servers using IBM 0662 SCSI-2 disk drives. Later it was named Predictive Failure Analysis technology. It was measuring several key device health parameters and evaluating them within the drive firmware. Communications between the physical unit and the monitoring software were limited to a binary result: namely, either "device is OK" or "drive is likely to fail soon".
Later, another variant, which was named IntelliSafe, was created by computer manufacturer Compaq and disk drive manufacturers Seagate, Quantum, and Conner. The disk drives would measure the disk's "health parameters", and the values would be transferred to the operating system and user-space monitoring software. Each disk drive vendor was free to decide which parameters were to be included for monitoring, and what their thresholds should be. The unification was at the protocol level with the host.
Compaq submitted IntelliSafe to the Small Form Factor committee for standardization in early 1995. It was supported by IBM, by Compaq's development partners Seagate, Quantum, and Conner, and by Western Digital, which did not have a failure prediction system at the time. The Committee chose IntelliSafe's approach, as it provided more flexibility. Compaq placed IntelliSafe into the public domain on. The resulting jointly developed standard was named S.M.A.R.T..
That SFF standard described a communication protocol for an ATA host to use and control monitoring and analysis in a hard disk drive, but did not specify any particular metrics or analysis methods. Later, "S.M.A.R.T." came to be understood to refer to a variety of specific metrics and methods and to apply to protocols unrelated to ATA for communicating the same kinds of things.

Provided information

The technical documentation for S.M.A.R.T. is in the AT Attachment standard. First introduced in 2004, it has undergone regular revisions, the latest being in 2011. Standardization of similar features on SCSI is more scarce and is not named as such on standards, although vendors and consumers alike do refer to these similar features at S.M.A.R.T. too.
The most basic information that S.M.A.R.T. provides is the S.M.A.R.T. status. It provides only two values: "threshold not exceeded" and "threshold exceeded". Often these are represented as "drive OK" or "drive fail" respectively. A "threshold exceeded" value is intended to indicate that there is a relatively high probability that the drive will not be able to honor its specification in the future: that is, the drive is "about to fail". The predicted failure may be catastrophic or may be something as subtle as the inability to write to certain sectors, or perhaps slower performance than the manufacturer's declared minimum.
The S.M.A.R.T. status does not necessarily indicate the drive's past or present reliability. If a drive has already failed catastrophically, the S.M.A.R.T. status may be inaccessible. Alternatively, if a drive has experienced problems in the past, but the sensors no longer detect such problems, the S.M.A.R.T. status may, depending on the manufacturer's programming, suggest that the drive is now healthy.
The inability to read some sectors is not always an indication that a drive is about to fail. One way that unreadable sectors may be created, even when the drive is functioning within specification, is through a sudden power failure while the drive is writing. Also, even if the physical disk is damaged at one location, such that a certain sector is unreadable, the disk may be able to use spare space to replace the bad area, so that the sector can be overwritten.
More detail on the health of the drive may be obtained by examining the S.M.A.R.T. Attributes. S.M.A.R.T. Attributes were included in some drafts of the ATA standard, but were removed before the standard became final. The meaning and interpretation of the attributes varies between manufacturers, and are sometimes considered a trade secret for one manufacturer or another. Attributes are further discussed below.
Drives with S.M.A.R.T. may optionally maintain a number of 'logs'. The error log records information about the most recent errors that the drive has reported back to the host computer. Examining this log may help one to determine whether computer problems are disk-related or caused by something else
A drive that implements S.M.A.R.T. may optionally implement a number of self-test or maintenance routines, and the results of the tests are kept in the self-test log. The self-test routines may be used to detect any unreadable sectors on the disk, so that they may be restored from back-up sources. This helps to reduce the risk of incurring permanent loss of data.

Standards and implementation

Lack of common interpretation

Many motherboards display a warning message when a disk drive is approaching failure. Although an industry standard exists among most major hard drive manufacturers, issues remain due to attributes intentionally left undocumented to the public in order to differentiate models between manufacturers.
From a legal perspective, the term "S.M.A.R.T." refers only to a signaling method between internal disk drive electromechanical sensors and the host computer. Because of this the specifications of S.M.A.R.T. are entirely vendor specific and, while many of these attributes have been standardized between drive vendors, others remain vendor-specific. S.M.A.R.T. implementations still differ and in some cases may lack "common" or expected features such as a temperature sensor or only include a few select attributes while still allowing the manufacturer to advertise the product as "S.M.A.R.T. compatible."

Visibility to host systems

Depending on the type of interface being used, some S.M.A.R.T.-enabled motherboards and related software may not communicate with certain S.M.A.R.T.-capable drives. For example, few external drives connected via USB and FireWire correctly send S.M.A.R.T. data over those interfaces. With so many ways to connect a hard drive, it is difficult to predict whether S.M.A.R.T. reports will function correctly in a given system.
Even with a hard drive and interface that implements the specification, the computer's operating system may not see the S.M.A.R.T. information because the drive and interface are encapsulated in a lower layer. For example, they may be part of a RAID subsystem in which the RAID controller sees the S.M.A.R.T.-capable drive, but the host computer sees only a logical volume generated by the RAID controller.
On the Windows platform, many programs designed to monitor and report S.M.A.R.T. information will function only under an administrator account.

Access

For a list of various programs that allow reading of S.M.A.R.T. Data, see Comparison of S.M.A.R.T. tools.

ATA S.M.A.R.T. attributes

Each drive manufacturer defines a set of attributes, and sets threshold values beyond which attributes should not pass under normal operation. Each attribute has a raw value that can be a decimal or a hexadecimal value, whose meaning is entirely up to the drive manufacturer, a normalized value, which ranges from 1 to 253 and a worst value, which represents the lowest recorded normalized value. The initial default value of attributes is 100 but can vary between manufacturer.
Manufacturers that have implemented at least one S.M.A.R.T. attribute in various products include Samsung, Seagate, IBM, Fujitsu, Maxtor, Toshiba, Intel, sTec, Inc., Western Digital and ExcelStor Technology.

Known ATA S.M.A.R.T. attributes

The following chart lists some S.M.A.R.T. attributes and the typical meaning of their raw values. Normalized values are usually mapped so that higher values are better, but higher raw attribute values may be better or worse depending on the attribute and manufacturer. For example, the "Reallocated Sectors Count" attribute's normalized value decreases as the count of reallocated sectors increases. In this case, the attribute's raw value will often indicate the actual count of sectors that were reallocated, although vendors are in no way required to adhere to this convention.
As manufacturers do not necessarily agree on precise attribute definitions and measurement units, the following list of attributes is a general guide only.
Drives do not support all attribute codes. Some codes are specific to particular drive types. Drives may use different codes for the same parameter, e.g., see codes 193 and 225.
IDAttribute nameIdeal!Description
01
0x01
Read Error Rate Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number.-
02
0x02
Throughput PerformanceOverall throughput performance of a hard disk drive. If the value of this attribute is decreasing there is a high probability that there is a problem with the disk.-
03
0x03
Spin-Up TimeAverage time of spindle spin up.-
04
0x04
Start/Stop CountA tally of spindle start/stop cycles. The spindle turns on, and hence the count is increased, both when the hard disk is turned on after having before been turned entirely off and when the hard disk returns from having previously been put to sleep mode.-
05
0x05
Reallocated Sectors Count

Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months.
06
0x06
Read Channel MarginMargin of a channel while reading data. The function of this attribute is not specified.-
07
0x07
Seek Error Rate Rate of seek errors of the magnetic heads. If there is a partial failure in the mechanical positioning system, then seek errors will arise. Such a failure may be due to numerous factors, such as damage to a servo, or thermal widening of the hard disk. The raw value has different structure for different vendors and is often not meaningful as a decimal number.-
08
0x08
Seek Time Performance Average performance of seek operations of the magnetic heads. If this attribute is decreasing, it is a sign of problems in the mechanical subsystem.-
09
0x09
Power-On HoursCount of hours in power-on state. The raw value of this attribute shows total count of hours in power-on state.
"By default, the total expected lifetime of a hard disk in perfect condition is defined as 5 years. This is equal to 1825 days in 24/7 mode or 43800 hours."
On some pre-2005 drives, this raw value may advance erratically and/or "wrap around".
-
10
0x0A
Spin Retry Count

Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed. An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.
11
0x0B
Recalibration Retries or Calibration Retry CountThis attribute indicates the count that recalibration was requested. An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.-
12
0x0C
Power Cycle CountThis attribute indicates the count of full hard disk power on/off cycles.-
13
0x0D
Soft Read Error RateUncorrected read errors reported to the operating system.-
22
0x16
Current Helium LevelSpecific to He8 drives from HGST. This value measures the helium inside of the drive specific to this manufacturer. It is a pre-fail attribute that trips once the drive detects that the internal environment is out of specification.-
170
0xAA
Available Reserved SpaceSee attribute E8.-
171
0xAB
SSD Program Fail Count The total number of flash program operation failures since the drive was deployed. Identical to attribute 181.-
172
0xAC
SSD Erase Fail Count Counts the number of flash erase failures. This attribute returns the total number of Flash erase operation failures since the drive was deployed. This attribute is identical to attribute 182.-
173
0xAD
SSD Wear Leveling CountCounts the maximum worst erase count on any block.-
174
0xAE
Unexpected Power Loss CountAlso known as "Power-off Retract Count" per conventional HDD terminology. Raw value reports the number of unclean shutdowns, cumulative over the life of an SSD, where an "unclean shutdown" is the removal of power without STANDBY IMMEDIATE as the last command. Normalized value is always 100.-
175
0xAF
Power Loss Protection FailureLast test result as microseconds to discharge cap, saturated at its maximum value. Also logs minutes since last test and lifetime number of tests. Raw value contains the following data:
  • Bytes 0-1: Last test result as microseconds to discharge cap, saturates at max value. Test result expected in range 25 <= result <= 5000000, lower indicates specific error code.
  • Bytes 2-3: Minutes since last test, saturates at max value.
  • Bytes 4-5: Lifetime number of tests, not incremented on power cycle, saturates at max value.
Normalized value is set to one on test failure or 11 if the capacitor has been tested in an excessive temperature condition, otherwise 100.
-
176
0xB0
Erase Fail CountS.M.A.R.T. parameter indicates a number of flash erase command failures.
177
0xB1
Wear Range DeltaDelta between most-worn and least-worn Flash blocks. It describes how good/bad the wearleveling of the SSD works on a more technical way.-
179
0xB3
Used Reserved Block Count Total"Pre-Fail" attribute used at least in Samsung devices.-
180
0xB4
Unused Reserved Block Count Total"Pre-Fail" attribute used at least in HP devices.-
181
0xB5
Program Fail Count Total or Non-4K Aligned Access CountTotal number of Flash program operation failures since the drive was deployed.
Number of user data accesses where LBAs are not 4 KiB aligned or where size is not modulus 4 KiB, assuming logical block size = 512 B.
-
182
0xB6
Erase Fail Count"Pre-Fail" Attribute used at least in Samsung devices.-
183
0xB7
SATA Downshift Error Count or Runtime Bad BlockWestern Digital, Samsung or Seagate attribute: Either the number of downshifts of link speed or the total number of data blocks with detected, uncorrectable errors encountered during normal operation. Although degradation of this parameter can be an indicator of drive aging and/or potential electromechanical problems, it does not directly indicate imminent drive failure.-
184
0xB8
End-to-End error / IOEDC

This attribute is a part of Hewlett-Packard's SMART IV technology, as well as part of other vendors' IO Error Detection and Correction schemas, and it contains a count of parity errors which occur in the data path to the media via the drive's cache RAM.
185
0xB9
Head StabilityWestern Digital attribute.-
186
0xBA
Induced Op-Vibration DetectionWestern Digital attribute.-
187
0xBB
Reported Uncorrectable Errors

The count of errors that could not be recovered using hardware ECC.
188
0xBC
Command Timeout

The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero.
189
0xBD
High Fly WritesHDD manufacturers implement a flying height sensor that attempts to provide additional protections for write operations by detecting when a recording head is flying outside its normal operating range. If an unsafe fly height condition is encountered, the write process is stopped, and the information is rewritten or reallocated to a safe region of the hard drive. This attribute indicates the count of these errors detected over the lifetime of the drive.
This feature is implemented in most modern Seagate drives
and some of Western Digital's drives, beginning with the WD Enterprise WDE18300 and WDE9180 Ultra2 SCSI hard drives, and will be included on all future WD Enterprise products.
-
190
0xBE
Temperature Difference or Airflow TemperatureValue is equal to, allowing manufacturer to set a minimum threshold which corresponds to a maximum temperature. This also follows the convention of 100 being a best-case value and lower values being undesirable. However, some older drives may instead report raw Temperature or Temperature minus 50 here.-
191
0xBF
G-sense Error RateThe count of errors resulting from externally induced shock and vibration.-
192
0xC0
Power-off Retract Count, Emergency Retract Cycle Count, or Unsafe Shutdown CountNumber of power-off or emergency retract cycles.-
193
0xC1
Load Cycle Count or Load/Unload Cycle Count Count of load/unload cycles into head landing zone position. Some drives use 225 for Load Cycle Count instead.
Western Digital rates their VelociRaptor drives for 600,000 load/unload cycles, and WD Green drives for 300,000 cycles; the latter ones are designed to unload heads often to conserve power. On the other hand, the WD3000GLFS is specified for only 50,000 load/unload cycles.
Some laptop drives and "green power" desktop drives are programmed to unload the heads whenever there has not been any activity for a short period, to save power. Operating systems often access the file system a few times a minute in the background, causing 100 or more load cycles per hour if the heads unload: the load cycle rating may be exceeded in less than a year. There are programs for most operating systems that disable the Advanced Power Management and Automatic acoustic management features causing frequent load cycles.
-
194
0xC2
Temperature or Temperature CelsiusIndicates the device temperature, if the appropriate sensor is fitted. Lowest byte of the raw value contains the exact temperature value.-
195
0xC3
Hardware ECC Recovered The raw value has different structure for different vendors and is often not meaningful as a decimal number.-
196
0xC4
Reallocation Event Count

Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful and unsuccessful attempts are counted.
197
0xC5
Current Pending Sector Count

Count of "unstable" sectors. If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately ; instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written.
However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good. This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.
198
0xC6
Uncorrectable Sector Count

The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.
199
0xC7
UltraDMA CRC Error CountThe count of errors in data transfer via the interface cable as determined by ICRC.-
200
0xC8
Multi-Zone Error RateThe count of errors found when writing a sector. The higher the value, the worse the disk's mechanical condition is.-
200
0xC8
Write Error Rate The total count of errors when writing a sector.-
201
0xC9
Soft Read Error Rate or
TA Counter Detected


Count indicates the number of uncorrectable software read errors.
202
0xCA
Data Address Mark errors or
TA Counter Increased
Count of Data Address Mark errors.-
203
0xCB
Run Out CancelThe number of errors caused by incorrect checksum during the error correction.-
204
0xCC
Soft ECC CorrectionCount of errors corrected by the internal error correction software.-
205
0xCD
Thermal Asperity RateCount of errors due to high temperature.-
206
0xCE
Flying HeightHeight of heads above the disk surface. If too low, head crash is more likely; if too high, read/write errors are more likely.-
207
0xCF
Spin High CurrentAmount of surge current used to spin up the drive.-
208
0xD0
Spin BuzzCount of buzz routines needed to spin up the drive due to insufficient power.-
209
0xD1
Offline Seek PerformanceDrive's seek performance during its internal tests.-
210
0xD2
Vibration During WriteFound in Maxtor 6B200M0 200GB and Maxtor 2R015H1 15GB disks.-
211
0xD3
Vibration During WriteA recording of a vibration encountered during write operations.-
212
0xD4
Shock During WriteA recording of shock encountered during write operations.-
220
0xDC
Disk ShiftDistance the disk has shifted relative to the spindle. Unit of measure is unknown.-
221
0xDD
G-Sense Error RateThe count of errors resulting from externally induced shock and vibration.-
222
0xDE
Loaded HoursTime spent operating under data load.-
223
0xDF
Load/Unload Retry CountCount of times head changes position.-
224
0xE0
Load FrictionResistance caused by friction in mechanical parts while operating.-
225
0xE1
Load/Unload Cycle CountTotal count of load cycles Some drives use 193 for Load Cycle Count instead. See Description for 193 for significance of this number.-
226
0xE2
Load 'In'-timeTotal time of loading on the magnetic heads actuator.-
227
0xE3
Torque Amplification CountCount of attempts to compensate for platter speed variations.-
228
0xE4
Power-Off Retract CycleThe number of power-off cycles which are counted whenever there is a "retract event" and the heads are loaded off of the media such as when the machine is powered down, put to sleep, or is idle.-
230
0xE6
GMR Head Amplitude, Drive Life Protection Status Amplitude of "thrashing".
In solid-state drives, indicates whether usage trajectory is outpacing the expected life curve
-
231
0xE7
Life Left or TemperatureIndicates the approximate SSD life left, in terms of program/erase cycles or available reserved blocks. A normalized value of 100 represents a new drive, with a threshold value at 10 indicating a need for replacement. A value of 0 may mean that the drive is operating in read-only mode to allow data recovery.
Previously occasionally used for Drive Temperature.
-
232
0xE8
Endurance Remaining or Available Reserved SpaceNumber of physical erase cycles completed on the SSD as a percentage of the maximum physical erase cycles the drive is designed to endure.
Intel SSDs report the available reserved space as a percentage of the initial reserved space.
-
233
0xE9
Media Wearout Indicator or Power-On HoursIntel SSDs report a normalized value from 100, a new drive, to a minimum of 1. It decreases while the NAND erase cycles increase from 0 to the maximum-rated cycles.
Previously occasionally used for Power-On Hours.
-
234
0xEA
Average erase count AND Maximum Erase CountDecoded as: byte 0-1-2 = average erase count and byte 3-4-5 = max erase count.-
235
0xEB
Good Block Count AND System Block CountDecoded as: byte 0-1-2 = good block count and byte 3-4 = system block count.-
240
0xF0
Head Flying Hours or 'Transfer Error Rate' Time spent during the positioning of the drive heads. Some Fujitsu drives report the count of link resets during a data transfer.-
241
0xF1
Total LBAs WrittenTotal count of LBAs written.-
242
0xF2
Total LBAs ReadTotal count of LBAs read.
Some S.M.A.R.T. utilities will report a negative number for the raw value since in reality it has 48 bits rather than 32.
-
243
0xF3
Total LBAs Written ExpandedThe upper 5 bytes of the 12-byte total number of LBAs written to the device. The lower 7 byte value is located at attribute 0xF1.-
244
0xF4
Total LBAs Read ExpandedThe upper 5 bytes of the 12-byte total number of LBAs read from the device. The lower 7 byte value is located at attribute 0xF2.-
249
0xF9
NAND Writes Total NAND Writes. Raw value reports the number of writes to NAND in 1 GB increments.-
250
0xFA
Read Error Retry RateCount of errors while reading from a disk.-
251
0xFB
Minimum Spares RemainingThe Minimum Spares Remaining attribute indicates the number of remaining spare blocks as a percentage of the total number of spare blocks available.-
252
0xFC
Newly Added Bad Flash BlockThe Newly Added Bad Flash Block attribute indicates the total number of bad flash blocks the drive detected since it was first initialized in manufacturing.-
254
0xFE
Free Fall ProtectionCount of "Free Fall Events" detected.-

Threshold Exceeds Condition

Threshold Exceeds Condition is an estimated date when a critical drive statistic attribute will reach its threshold value. When Drive Health software reports a "Nearest T.E.C.", it should be regarded as a "Failure date". Sometimes, no date is given and the drive can be expected to work without errors.
To predict the date, the drive tracks the rate at which the attribute changes. Note that TEC dates are only estimates; hard drives can and do fail much sooner or much later than the TEC date.

Self-tests

S.M.A.R.T. drives may offer a number of self-tests:
; Short
; Long/extended
; Conveyance
; Selective
The self-test logs for SCSI and ATA drives are slightly different. It is possible for the long test to pass even if the short test fails.
The drive's self-test log can contain up to 21 read-only entries. When the log is filled, old entries are removed.