A Google study presented at the currently held Conference on File and Storage Technologies questions these traditional failure explanations and concludes that there are many more factors impacting the life expectancy of a hard drive and that failure predictions are much more complex than previously thought. What makes this study interesting is the fact that Google's server infrastructure is estimated to exceed a number of 450,000 fairly mainstream systems that, in a large number, use consumer-grade devices with capacities ranging from 80 to 400 GB in capacity. According to the company, the project covered "more than 100,000" drives that were put into production in or after 2001. The drives ran at a platter rotation speed of 5400 and 7200 rpm, came from "many of the largest disk drive manufacturers and from at least nine different models."The bottom line is that temperature and high usage alone aren't responsible for failures by default but I'm not so sure if this also counts for desktop PCs. Server PCs usually work 24/24, 7/7 and I expect the temperature variation of the hard drives will be rather small. In desktop PCs this isn't the case, when your system is off the hard drives may have a temperature of 20°C and during heavy load they might reach up to 40°C-55°C depending on your case's cooling. Perhaps high temperatures aren't really a problem but big spikes in HDD temperature might be a problem..
Google said that it is collecting "vital information" about all of its systems every few minutes and stores the data for further analysis. For example, this information includes environmental factors (such as temperatures), activity levels and SMART parameters (Self-Monitoring Analysis and Reporting Technology) that are commonly considered to be good indicators to describe the health of disk drives.
In general, Google's hard drive population saw a failure rate that was increasing with the age of the drive. Within the group of hard drives up to one year old, 1.7% of the devices had to be replaced due to failure. The rate jumps to 8% in year 2 and 8.6% in year 3. The failure rate levels out thereafter, but Google believes that the reliability of drives older than 4 years is influenced more by "the particular models in that vintage than by disk drive aging effects."
Breaking out different levels of utilization, the Google study shows an interesting result. Only drives with an age of six months or younger show a decidedly higher probability of failure when put into a high activity environment. Once the drive survives its first months, the probability of failure due to high usage decreases in year 1, 2, 3 and 4 - and increases significantly in year 5. Google's temperature research found an equally surprising result: "Failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend," the authors of the study found.
In contrast the company discovered that certain SMART parameters apparently do have an effect drive failures. For example, drives typically scan the disk surface in the background and report errors as they discover them. Significant scan errors can hint to surface errors and Google reports that fewer than 2% of its drives show scan errors. However, drives with scan errors turned out to be ten times more likely to fail than drives without scan errors. About 70% of Google's drives with scan errors survived the first eight months after the first scan error was reported.
Temperature and usage barely has an effect on hard drive failure
Posted on Monday, February 19 2007 @ 00:15:25 CET by Thomas De Maesschalck