Anna Khomulo - Fotolia
Compared to hard drives, SSDs are remarkably reliable; yet, no storage technology is perfect. Even the latest NVMe SSDs are susceptible to a sudden or gradual breakdown.
Knowing how to spot the signs of an imminent SSD failure, as well as understanding how to troubleshoot a malfunctioning SSD, can mark the difference between permanent data loss and a trouble-free recovery. Like any storage device, an NVMe SSD will eventually fail; the only variable is when. Unlike hard drives, SSDs can't send an audible warning that something may be going wrong. Yet, while the SSD may be dead, all is not necessarily lost.
Here's a look at four leading causes of SSD failure and how to resolve the problems.
While NVMe SSDs are the new kid on the block, the problem that plagues them the most is one of the oldest in computing: heat. "NVMe SSDs can run insanely hot, especially if you're running intense operations like high-level calculations," said Leon Adato, head geek at IT management software and monitoring tools provider SolarWinds. "Even under regular operation, NVMe [SSDs] can generate problem-causing temperatures."
Providing adequate cooling can ensure that the SSD doesn't overheat, keeping it from failing or throttling down to a slower speed. The challenge is finding a way to draw heat away from the drive. There are various approaches to this problem. "You might [use] a big chassis where you can ensure lots of direct external airflow over the chip, or you might be able to install a heat sink, fan or liquid cooling system," Adato said.
Leon AdatoHead geek, SolarWinds
Lowering the ambient room temperature to a cooler level can also go a long way toward resolving SSD heat-related issues. "However you approach it, the idea is to do something to increase the cooling and/or reduce the ambient temperature inside the system chassis," Adato said.
2. Firmware failure
SSD firmware is incredibly complex and many SSD failures tend to be a corner case -- a problem that occurs only outside of normal operating parameters. Fortunately, when a serious firmware problem reveals itself, most SSDs automatically fall into a fail-safe mode. "If the SSD can't guarantee the integrity of the data, generally the vendor implements an 'assert' or other failure mode where they take the namespace offline or put it in read-only mode to protect the host software from reading bad data," said Jonmichael Hands, senior strategic planner and product manager for Intel and a working group co-chair at NVM Express, the consortium responsible for the development of the NVMe specification.
Firmware problems happen from time to time. Last November, for instance, Hewlett Packard Enterprise issued a customer bulletin warning that its SSD Firmware Version HPD8 needed a critical fix. Organizations that fail to apply the fix will see their drive fail at 32,768 hours of operating time. As a result, in exactly 3 years, 270 days and eight hours, all the data stored on the drive will be lost.
The most common form of SSD misuse is wearing out a drive prematurely because it wasn't properly matched to the data center workload. "For instance, a [quad-level cell] drive with lower endurance is meant for scale-out storage or object storage, not for use as a cache drive with a high amount of random writes," Hands said.
Fortunately, endurance can be accurately predicted and modeled, so it's easy to plan ahead to mitigate SSD failure. "Know what DWPD [drive writes per day] and TBW [terabytes written] your SSD supports," Hands said. "Model your workload and figure out which SSD is best." To predict a drive's wear-out date, helpful tools such as Intel's SSD Endurance Estimator are available.
4. Lurking problems
SSD problems usually don't become apparent until they begin causing major trouble. The sooner you know there's a problem, the faster you can respond to the situation and minimize the impact. "Make sure you use hardware monitoring software to track ... components for I/O speed, bad blocks and other failure modes so you know as soon as possible [when] something is going south," Adato said.
Adato noted that it's also important to create a business environment in which end users can feel comfortable about reporting an SSD-based system that's behaving poorly, suboptimally or strangely. "IT needs to know about a failure quickly, and fixing it faster is far more important than finding a guilty party to blame," he said.
When it comes to SSD failure, addressing problems quickly is key to preventing too much damage. "The best you can hope for is a loss of the ability to write to the drive, but retaining the ability to read from it," Adato said. "Thus, you can pull all your data [to another drive] before sending the unit to the scrap heap."