[HN Gopher] SSD will fail at 40k power-on hours (2021)
___________________________________________________________________
 
SSD will fail at 40k power-on hours (2021)
 
Author : dredmorbius
Score  : 143 points
Date   : 2022-07-10 19:36 UTC (3 hours ago)
 
web link (www.cisco.com)
w3m dump (www.cisco.com)
 
| lucb1e wrote:
| Check your power-on hours:                   $ sudo smartctl -a
| /dev/sda | grep -e Power_On_Hours -e ^ID         ID#
| ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
| UPDATED  WHEN_FAILED RAW_VALUE           9 Power_On_Hours
| 0x0032   098   098   000    Old_age   Always       -       9743
| 
| Just looking at the raw value, it seems to be 9'743 hours in my
| case
 
  | borplk wrote:
  | Mine is above 53,000 hours ... time to check my backups!
 
    | lucb1e wrote:
    | Sounds like you're in the clear for this particular bug...
    | 
    | ...but always check your backups regularly for data that is
    | dear to you!
    | 
    | Protip of the day: that includes things on someone else's
    | server. I remember when Grooveshark went offline from one day
    | to the next and I lost nearly my whole library because I
    | remembered only some artists and had to go through thousands
    | of songs to find which ones I actually liked from them. My
    | browser's localStorage object contained a few playlists but I
    | didn't use those much. Or when 000webhost cancelled my
    | account because I was using the 100MB(?) to back up some
    | files that were most important to me, rather than for actual
    | webhosting (in my defense, I was 15 at the time), and so when
    | I returned from a holiday with my parents with an actual
    | crashed hard drive, that turned double sour. Backing up
    | things from what they now call the "cloud" is something I
    | learned early, as I have virtually no code I wrote before
    | that summer, only some of the music, only essays with WordArt
    | if they were printed, etc.
 
| dredmorbius wrote:
| Possibly related to recent HN issues, see:
| https://news.ycombinator.com/item?id=32031243
 
  | solardev wrote:
  | Wow, thanks for sharing. I didn't realize how closely related
  | they were.
  | 
  | (TLDR For anyone wondering, "recent HN issues" means HN very
  | likely went down yesterday because of this same bug, when two
  | (edit: two pairs, four total) enterprise SSDs with old firmware
  | died after 40,000 hours close together. An admin of HN and its
  | host both like this theory. See details in that thread.)
  | 
  | Edit: If you want to discuss that theory, it's probably better
  | to do it in that other thread directly instead... dang and a
  | person from M5 Hosting (HN's previous host) are both
  | participating there.
 
    | mkl wrote:
    | Not two SSDs, _four_ : two in the main server, and two in the
    | backup server.
 
      | solardev wrote:
      | Thanks for the correction!
 
    | [deleted]
 
    | kazen44 wrote:
    | the chance of two SSD's failing at the same time under normal
    | circumstances is extremely slim. So this might actually be a
    | good cause of this incident.
 
      | MBCook wrote:
      | Especially since one pair was a nearly unused backup server
      | that had a totally different use profile.
 
| DoneWithAllThat wrote:
| What the hell is that ridiculous "bias-free language" claptrap at
| the beginning? Man DEI is seriously out of control.
 
  | UkrainianJew wrote:
  | It's a bug in the human firmware. People seem to have an
  | instinctive and subconscious need for property - out of all
  | things occupying our attention span, being able to arbitrarily
  | change some on a whim.
  | 
  | I think, this instinct is responsible for humans figuring out
  | farming (as in developing the land near you to your liking) and
  | many cultural achievements.
  | 
  | Except, with the information society, our attention is being
  | constantly overwhelmed by the stream of information produced by
  | other people, so this instinct kicks in and makes some people
  | want to control what language others use. I don't think we will
  | see any studies of this soon, but my hunch is that there is
  | reverse correlation between the amount of one's physical
  | property and one's sensitivity to the language and content of
  | others' speech.
  | 
  | Corporations happily abused it, since letting your employees
  | "own" pronouns and acknowledgements is cheaper than paying them
  | enough to own their houses (let alone start competing
  | companies). Now it has spun into a de-facto religion where many
  | people's weight in the society depends on perpetuating (and
  | intensifying) the dogmas. Kinda similar to late USSR where most
  | people didn't believe in communism anymore, but not having a
  | Lenin's room in your office would get you labeled as an
  | American spy.
  | 
  | From what we can learn from the history, it will intensify
  | until the movement splinters into competing factions, that will
  | heavily oppose each other, and will eventually settle on some
  | common ground to avoid continuous mutual damage.
 
  | ParetoOptimal wrote:
  | It's a device to identify certain kinds of people who would
  | have a problem with less loaded language without any loss in
  | clarity.
 
    | rajamaka wrote:
    | I would love to see some examples of Cisco documentation that
    | ever offended anyone.
 
      | mlyle wrote:
      | I miss old Cisco documentation, with IP addresses and
      | router names like SanJose3 and 408 phone numbers on PRIs
      | etc.
 
      | bombcar wrote:
      | It's a warning that the documentation may refer to
      | master/slave or something like that because Cisco cares
      | enough about DEI to update documentation but not enough to
      | actually update out-of-support firmware.
 
      | 13of40 wrote:
      | OK, here's one they need to update:
      | 
      | https://blogs.cisco.com/news/digital-transformation-
      | requires...
      | 
      | The offending text:
      | 
      | Act now by adding "equality, inclusion, and diversity" as
      | an agenda item for your next staff meeting, brownbag, or
      | employee gathering.
      | 
      | What, why?
      | 
      | https://www.upi.com/Odd_News/2013/08/02/In-Seattle-the-
      | terms...
      | 
      | "Brownbag" is offensive in one context, and that means it's
      | offensive in all contexts.
 
    | mancerayder wrote:
    | Other than DEI administrators, trainers and people in
    | positions with DEI in them, who is actually getting offended?
 
      | powerhour wrote:
      | People that have to move their mouse a bit to hit the x
      | button, apparently.
 
  | deigestapo wrote:
  | It's so they can track it, compute metrics, report, etc.
  | 
  | This gets fed into indicators that can be used to boost the ESG
  | (communism) score of publicly-traded companies.
 
  | GuB-42 wrote:
  | Goes well with the legal disclaimer that follows.
  | 
  | The legal or whatever-not-technical department wanted to leave
  | their mark.
 
  | TaylorAlexander wrote:
  | Having a statement like that shows people that they are open to
  | suggestions on improvements. Since a lot of people are not so
  | open to suggestions, it makes sense to me to include this
  | language. They added a little X button so you can close it
  | easily.
 
  | [deleted]
 
  | hn_throwaway_99 wrote:
  | My reaction was "If you want to write some documentation with
  | bias-free language, just write the documentation with bias-free
  | language." Why the need for a long paragraph explaining "Look
  | how great and sensitive we are!"
  | 
  | I understand, and agree with, the desire to use inclusive
  | language, but so much of this has just devolved into
  | performative nonsense.
 
    | mlyle wrote:
    | Else you get questions, like, "why don't you say master/slave
    | like everyone else?!@!!"
 
      | kwhitefoot wrote:
      | At this stage I think such questions can just be ignored.
 
        | MarcoZavala wrote:
 
    | alpb wrote:
    | Saying that and usernames like "DoneWithAllThat" and
    | "hn_throwaway", yeah, it checks out.
 
      | hn_throwaway_99 wrote:
      | Not sure exactly what point you're trying to make, but if
      | it's "the risk of saying anything even _remotely_ critical
      | of DEI tactics is a huge, gargantuan, giant career risk
      | these days ", then I wholeheartedly agree.
 
  | 0xbadcafebee wrote:
  | The docs may include "master/slave", and they don't want to get
  | sued or bad PR, so this generic notice says "we don't like bad
  | words but sometimes the industry uses bad words and that's
  | unfortunate". If you click the _Learn More_ link in the
  | paragraph, you 'll learn more.
 
    | redeeman wrote:
 
      | deigestapo wrote:
 
      | zorpner wrote:
      | There is -- it's using words other than those, which is
      | both easy and considerate.
 
| civilized wrote:
| It's been over two years since this was first identified... since
| this apparently affected many makes and models of SSDs, it would
| be nice to know if my laptop could be affected and if there's
| anything I could do about it.
 
  | pmoriarty wrote:
  | One thing everyone could and should be doing is backups.
 
    | m0llusk wrote:
    | Two things: Test restores or you don't actually have backups.
    | Just saying.
 
      | chrischen wrote:
      | I got bit by this with iPhone backups. I did a phone trade
      | in and followed the backup before trading in instructions.
      | Problem is after the trade in the backup failed to restore
      | due to an unknown error. The whole manual syncing and
      | backing up with a cable workflow with Apple is super fickle
      | and riddled with bugs.
      | 
      | Luckily I had Time Machine backups of my iOS backups and I
      | managed to avoid losing too much data.
      | 
      | As a sidenote it seems like Apple has pretty much neglected
      | their offline backup and syncing workflow to drive more
      | people to just pay for iCloud storage. Half the time my
      | iPhone takes hours just to get detected by the mac when
      | _plugged in._
 
  | opencl wrote:
  | This will not affect your laptop, all of the models affected by
  | this are enterprise SAS SSDs.
  | 
  | Of course your SSD might have some _other_ firmware bug that
  | would eat your data, all you can do is search for the model
  | number and see if the manufacturer has issued any notices
  | /firmware updates.
 
    | robocat wrote:
    | > This will not affect your laptop
    | 
    | That's just your presumptive opinion, right?
 
      | Sakos wrote:
      | How likely is it that they're using an enterprise SAS SSD
      | in their laptop?
 
| yomkippur wrote:
| crap so its certainly HP laptops. so which laptops are safe from
| this?
 
  | mrkramer wrote:
  | My HP laptop has Toshiba SSD. I'm not sure about other models.
  | But I think only enterprise SSDs are affected.
 
| mistrial9 wrote:
| related topic - leaving SMART control tests ON for a (non-SSD)
| drive, apparently interferes with sleep; the drive will wake up
| to test itself. For some drives, I would prefer that not to
| happen and just stay quiet. Yet, testing for this behavior seems
| elusive -- querying the disk wakes it, and most linux disk tools
| seem unaware of sleep state. I just listen for the disk spinning,
| or notice a long pause before an operation.
 
| onion2k wrote:
| Backblaze have a great blog about things they learn about hard
| drives. It's been going for years, less about firmware issues and
| more about general usage.
| https://www.backblaze.com/blog/backblaze-drive-stats-for-q1-...
 
| usr1106 wrote:
| Cisco is not a SSD manufacturer. They write industry-wide bug.
| Does that mean that more than one SSD manufacturer is affected
| (because they use partially the same firmware)? Further down they
| mention only Sandisk. Or is the industry-wide just their newspeak
| for saying any Sandisk of affected model, regardless whether
| installed in a Cisco box or somewhere else?
 
  | dr_zoidberg wrote:
  | I'm interested here too. I've got a Crucial SSD from 2015
  | that's been on about:
  | 
  | * 100% of 2015-2017, let's add 2 years here
  | 
  | * Aboutish 50% of days since 2018 to 2020
  | 
  | * On and off again (5%?) since then until now.
  | 
  | So it's about 3 years of full use? I'm eyeballing the use here.
  | So it may be close to the numbers that were given, but I'm not
  | sure. Guess I could check the SMART stats to get a precise
  | number and from there decide what to do about it.
  | 
  | Searching a bit it seems it's a well-known bug in "enterprise
  | SSDs"[0, 1] (which my drive certainly isn't) but there aren't
  | any real details about what causes it, other than "a firmaware
  | bug".
  | 
  | [0] https://www.servethehome.com/hpe-issues-hpd7-fix-for-ssds-
  | th...
  | 
  | [1] https://www.anandtech.com/show/15673/dell-hpe-updates-
  | for-40...
 
  | dredmorbius wrote:
  | The problem seems to be widely experienced.
  | 
  | The Cisco report turned up in response to a post I'd made of
  | the HN issue on the Fediverse:
  | 
  | https://mastodon.infra.de/@galaxis/108622795822100862
 
| userbinator wrote:
| 40000 (or even 40960) seems an odd number to fail at. 64k or 32k
| would make the cause pretty obvious, but 40000 doesn't seem all
| that round in binary. Perhaps a 12-bit counter incrementing every
| 10h? This is puzzling.
| 
| Of course, I am also entertaining the possibility that no one
| thought they would be in use for this long, which would certainly
| be evidence of planned obsolescence.
 
  | twawaaay wrote:
  | Very strange understanding of the word "evidence".
  | 
  | No sane SSD manufacturer would do such thing on purpose. You do
  | it and you loose business, that's it.
  | 
  | The simplest explanation is that somebody made an honest
  | engineering mistake.
 
    | bayindirh wrote:
    | When you purchase a server (fleet), you get a long warranty
    | with it. Generally 3 to 5 years. So you expect this fleet to
    | stay in service for <=5 years mostly.
    | 
    | Unless you burn through your SSDs, you're very unlikely to
    | hit this event.
    | 
    | When these servers' continue to be used and disks all start
    | to fail at the same time, this will obviously stink.
    | 
    | The bathtub curve is not like this. You can _feel_ that.
 
    | fartcannon wrote:
    | Given the power dynamic between a single customer and large
    | corporations, the smart thing to do is to assume malice until
    | prove otherwise. This puts the onus on the corporations and,
    | if we're lucky, creates an environment where they compete
    | with each other to be seen as the most honest. The worst
    | thing that happens is the single customer has to buy an SSD
    | from someone they don't trust.
    | 
    | If we do the opposite, as you say, and assume everything is
    | an honest mistake, that puts pressure on the single customer
    | to prove that the organization with a huge marketing budget
    | is doing something wrong. In this situation, the worst thing
    | that happens is we all get taken advantage of.
    | 
    | Our collective distrust is the only power we have against
    | massive marketing/PR budgets. It doesn't have to be angry, or
    | sour, or cranky, we just collectively need to not take their
    | word until we have a reason to do so.
 
      | charcircuit wrote:
      | Are you seriously saying that by default we should believe
      | they intentionally planned to cause their customers to lose
      | all of their data?
 
        | [deleted]
 
        | alliao wrote:
        | planned obsolescence is quite a thing...?
 
        | dtjb wrote:
        | In some cases, but a product must fulfill its core
        | purpose. If a SSD intentionally dumped data and self
        | destructed at a set time, that would be disastrous for
        | the brand. Same way a car doesn't adopt planned
        | obsolescence by blowing up after 200k miles.
 
        | bayindirh wrote:
        | If a spinning rust can run for ~8 years without any
        | problems,a consumer SSD can hit beyond 40K hours
        | reliably, and everything is checked and tested tens of
        | times because of the complexity of flash storage, I'd get
        | suspicious too.
        | 
        | Also, enterprise drives get firmware updates (regardless
        | of spinning or not), and this firmware is automatically
        | applied via RAID controller, so it could be remedied
        | easily before it got this big if it's an actual error.
 
  | justinsaccount wrote:
  | Someone pointed out on the other thread that it could be 2^57
  | nanoseconds:                 >>> 2**57/10**9/3600
  | 40031.996687737745
 
    | AaronFriel wrote:
    | If it were 53, I'd wonder "are they storing the time in the
    | integer part of a double precision float?" That wouldn't go
    | negative, it'd just start absorbing increments without
    | changing the value.
    | 
    | Though that might cause a divide by zero?
    | 
    | What could cause unexpected behavior at 57 bits?
    | 
    | Perhaps storing fractions of an hour, like incrementing it
    | every 1/16th of an hour and calculating a relative rate of
    | change, causing a divide by zero?
 
      | mkl wrote:
      | Do embedded CPUs like the one in an SSD have floating point
      | units? It seems more likely to me that the upper bits in a
      | 64 bit integer counter were used for something else.
 
      | danielheath wrote:
      | Packing a type flag into the upper bits of a 64 bit value
      | is a reasonably common optimisation in dynamic language
      | implementations (because it lets you use unboxed number
      | arithmetic).
 
      | jonas21 wrote:
      | My overactive imagination thinks it went something like
      | this:
      | 
      | Engineer A: Gee, I need to store a few flags with each
      | block, but there's nowhere to put them. Ah! We're storing
      | timestamps as 64-bit _microseconds_. I can use a few of
      | those bits and there 'll still be enough to go thousands of
      | years without overflowing.
      | 
      | Engineer B: Gee, our SSDs are getting so fast, soon we'll
      | be able to hit 1M writes/sec. But we're storing timestamps
      | as microseconds. How can we generate unique timestamps for
      | each write? Ah! I'll switch to nanoseconds. It's a good
      | thing we have plenty of space in this 64-bit int.
      | 
      | BOOM!
 
___________________________________________________________________
(page generated 2022-07-10 23:00 UTC)