proxy70

	[HN Gopher] SSD will fail at 40k power-on hours (2021) ___________________________________________________________________ SSD will fail at 40k power-on hours (2021) Author : dredmorbius Score : 143 points Date : 2022-07-10 19:36 UTC (3 hours ago)
	web link (www.cisco.com)
	w3m dump (www.cisco.com)
	\| lucb1e wrote: \| Check your power-on hours: $ sudo smartctl -a \| /dev/sda \| grep -e Power_On_Hours -e ^ID ID# \| ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE \| UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Hours \| 0x0032 098 098 000 Old_age Always - 9743 \| \| Just looking at the raw value, it seems to be 9'743 hours in my \| case \| borplk wrote: \| Mine is above 53,000 hours ... time to check my backups! \| lucb1e wrote: \| Sounds like you're in the clear for this particular bug... \| \| ...but always check your backups regularly for data that is \| dear to you! \| \| Protip of the day: that includes things on someone else's \| server. I remember when Grooveshark went offline from one day \| to the next and I lost nearly my whole library because I \| remembered only some artists and had to go through thousands \| of songs to find which ones I actually liked from them. My \| browser's localStorage object contained a few playlists but I \| didn't use those much. Or when 000webhost cancelled my \| account because I was using the 100MB(?) to back up some \| files that were most important to me, rather than for actual \| webhosting (in my defense, I was 15 at the time), and so when \| I returned from a holiday with my parents with an actual \| crashed hard drive, that turned double sour. Backing up \| things from what they now call the "cloud" is something I \| learned early, as I have virtually no code I wrote before \| that summer, only some of the music, only essays with WordArt \| if they were printed, etc. \| dredmorbius wrote: \| Possibly related to recent HN issues, see: \| https://news.ycombinator.com/item?id=32031243 \| solardev wrote: \| Wow, thanks for sharing. I didn't realize how closely related \| they were. \| \| (TLDR For anyone wondering, "recent HN issues" means HN very \| likely went down yesterday because of this same bug, when two \| (edit: two pairs, four total) enterprise SSDs with old firmware \| died after 40,000 hours close together. An admin of HN and its \| host both like this theory. See details in that thread.) \| \| Edit: If you want to discuss that theory, it's probably better \| to do it in that other thread directly instead... dang and a \| person from M5 Hosting (HN's previous host) are both \| participating there. \| mkl wrote: \| Not two SSDs, _four_ : two in the main server, and two in the \| backup server. \| solardev wrote: \| Thanks for the correction! \| [deleted] \| kazen44 wrote: \| the chance of two SSD's failing at the same time under normal \| circumstances is extremely slim. So this might actually be a \| good cause of this incident. \| MBCook wrote: \| Especially since one pair was a nearly unused backup server \| that had a totally different use profile. \| DoneWithAllThat wrote: \| What the hell is that ridiculous "bias-free language" claptrap at \| the beginning? Man DEI is seriously out of control. \| UkrainianJew wrote: \| It's a bug in the human firmware. People seem to have an \| instinctive and subconscious need for property - out of all \| things occupying our attention span, being able to arbitrarily \| change some on a whim. \| \| I think, this instinct is responsible for humans figuring out \| farming (as in developing the land near you to your liking) and \| many cultural achievements. \| \| Except, with the information society, our attention is being \| constantly overwhelmed by the stream of information produced by \| other people, so this instinct kicks in and makes some people \| want to control what language others use. I don't think we will \| see any studies of this soon, but my hunch is that there is \| reverse correlation between the amount of one's physical \| property and one's sensitivity to the language and content of \| others' speech. \| \| Corporations happily abused it, since letting your employees \| "own" pronouns and acknowledgements is cheaper than paying them \| enough to own their houses (let alone start competing \| companies). Now it has spun into a de-facto religion where many \| people's weight in the society depends on perpetuating (and \| intensifying) the dogmas. Kinda similar to late USSR where most \| people didn't believe in communism anymore, but not having a \| Lenin's room in your office would get you labeled as an \| American spy. \| \| From what we can learn from the history, it will intensify \| until the movement splinters into competing factions, that will \| heavily oppose each other, and will eventually settle on some \| common ground to avoid continuous mutual damage. \| ParetoOptimal wrote: \| It's a device to identify certain kinds of people who would \| have a problem with less loaded language without any loss in \| clarity. \| rajamaka wrote: \| I would love to see some examples of Cisco documentation that \| ever offended anyone. \| mlyle wrote: \| I miss old Cisco documentation, with IP addresses and \| router names like SanJose3 and 408 phone numbers on PRIs \| etc. \| bombcar wrote: \| It's a warning that the documentation may refer to \| master/slave or something like that because Cisco cares \| enough about DEI to update documentation but not enough to \| actually update out-of-support firmware. \| 13of40 wrote: \| OK, here's one they need to update: \| \| https://blogs.cisco.com/news/digital-transformation- \| requires... \| \| The offending text: \| \| Act now by adding "equality, inclusion, and diversity" as \| an agenda item for your next staff meeting, brownbag, or \| employee gathering. \| \| What, why? \| \| https://www.upi.com/Odd_News/2013/08/02/In-Seattle-the- \| terms... \| \| "Brownbag" is offensive in one context, and that means it's \| offensive in all contexts. \| mancerayder wrote: \| Other than DEI administrators, trainers and people in \| positions with DEI in them, who is actually getting offended? \| powerhour wrote: \| People that have to move their mouse a bit to hit the x \| button, apparently. \| deigestapo wrote: \| It's so they can track it, compute metrics, report, etc. \| \| This gets fed into indicators that can be used to boost the ESG \| (communism) score of publicly-traded companies. \| GuB-42 wrote: \| Goes well with the legal disclaimer that follows. \| \| The legal or whatever-not-technical department wanted to leave \| their mark. \| TaylorAlexander wrote: \| Having a statement like that shows people that they are open to \| suggestions on improvements. Since a lot of people are not so \| open to suggestions, it makes sense to me to include this \| language. They added a little X button so you can close it \| easily. \| [deleted] \| hn_throwaway_99 wrote: \| My reaction was "If you want to write some documentation with \| bias-free language, just write the documentation with bias-free \| language." Why the need for a long paragraph explaining "Look \| how great and sensitive we are!" \| \| I understand, and agree with, the desire to use inclusive \| language, but so much of this has just devolved into \| performative nonsense. \| mlyle wrote: \| Else you get questions, like, "why don't you say master/slave \| like everyone else?!@!!" \| kwhitefoot wrote: \| At this stage I think such questions can just be ignored. \| MarcoZavala wrote: \| alpb wrote: \| Saying that and usernames like "DoneWithAllThat" and \| "hn_throwaway", yeah, it checks out. \| hn_throwaway_99 wrote: \| Not sure exactly what point you're trying to make, but if \| it's "the risk of saying anything even _remotely_ critical \| of DEI tactics is a huge, gargantuan, giant career risk \| these days ", then I wholeheartedly agree. \| 0xbadcafebee wrote: \| The docs may include "master/slave", and they don't want to get \| sued or bad PR, so this generic notice says "we don't like bad \| words but sometimes the industry uses bad words and that's \| unfortunate". If you click the _Learn More_ link in the \| paragraph, you 'll learn more. \| redeeman wrote: \| deigestapo wrote: \| zorpner wrote: \| There is -- it's using words other than those, which is \| both easy and considerate. \| civilized wrote: \| It's been over two years since this was first identified... since \| this apparently affected many makes and models of SSDs, it would \| be nice to know if my laptop could be affected and if there's \| anything I could do about it. \| pmoriarty wrote: \| One thing everyone could and should be doing is backups. \| m0llusk wrote: \| Two things: Test restores or you don't actually have backups. \| Just saying. \| chrischen wrote: \| I got bit by this with iPhone backups. I did a phone trade \| in and followed the backup before trading in instructions. \| Problem is after the trade in the backup failed to restore \| due to an unknown error. The whole manual syncing and \| backing up with a cable workflow with Apple is super fickle \| and riddled with bugs. \| \| Luckily I had Time Machine backups of my iOS backups and I \| managed to avoid losing too much data. \| \| As a sidenote it seems like Apple has pretty much neglected \| their offline backup and syncing workflow to drive more \| people to just pay for iCloud storage. Half the time my \| iPhone takes hours just to get detected by the mac when \| _plugged in._ \| opencl wrote: \| This will not affect your laptop, all of the models affected by \| this are enterprise SAS SSDs. \| \| Of course your SSD might have some _other_ firmware bug that \| would eat your data, all you can do is search for the model \| number and see if the manufacturer has issued any notices \| /firmware updates. \| robocat wrote: \| > This will not affect your laptop \| \| That's just your presumptive opinion, right? \| Sakos wrote: \| How likely is it that they're using an enterprise SAS SSD \| in their laptop? \| yomkippur wrote: \| crap so its certainly HP laptops. so which laptops are safe from \| this? \| mrkramer wrote: \| My HP laptop has Toshiba SSD. I'm not sure about other models. \| But I think only enterprise SSDs are affected. \| mistrial9 wrote: \| related topic - leaving SMART control tests ON for a (non-SSD) \| drive, apparently interferes with sleep; the drive will wake up \| to test itself. For some drives, I would prefer that not to \| happen and just stay quiet. Yet, testing for this behavior seems \| elusive -- querying the disk wakes it, and most linux disk tools \| seem unaware of sleep state. I just listen for the disk spinning, \| or notice a long pause before an operation. \| onion2k wrote: \| Backblaze have a great blog about things they learn about hard \| drives. It's been going for years, less about firmware issues and \| more about general usage. \| https://www.backblaze.com/blog/backblaze-drive-stats-for-q1-... \| usr1106 wrote: \| Cisco is not a SSD manufacturer. They write industry-wide bug. \| Does that mean that more than one SSD manufacturer is affected \| (because they use partially the same firmware)? Further down they \| mention only Sandisk. Or is the industry-wide just their newspeak \| for saying any Sandisk of affected model, regardless whether \| installed in a Cisco box or somewhere else? \| dr_zoidberg wrote: \| I'm interested here too. I've got a Crucial SSD from 2015 \| that's been on about: \| \| * 100% of 2015-2017, let's add 2 years here \| \| * Aboutish 50% of days since 2018 to 2020 \| \| * On and off again (5%?) since then until now. \| \| So it's about 3 years of full use? I'm eyeballing the use here. \| So it may be close to the numbers that were given, but I'm not \| sure. Guess I could check the SMART stats to get a precise \| number and from there decide what to do about it. \| \| Searching a bit it seems it's a well-known bug in "enterprise \| SSDs"[0, 1] (which my drive certainly isn't) but there aren't \| any real details about what causes it, other than "a firmaware \| bug". \| \| [0] https://www.servethehome.com/hpe-issues-hpd7-fix-for-ssds- \| th... \| \| [1] https://www.anandtech.com/show/15673/dell-hpe-updates- \| for-40... \| dredmorbius wrote: \| The problem seems to be widely experienced. \| \| The Cisco report turned up in response to a post I'd made of \| the HN issue on the Fediverse: \| \| https://mastodon.infra.de/@galaxis/108622795822100862 \| userbinator wrote: \| 40000 (or even 40960) seems an odd number to fail at. 64k or 32k \| would make the cause pretty obvious, but 40000 doesn't seem all \| that round in binary. Perhaps a 12-bit counter incrementing every \| 10h? This is puzzling. \| \| Of course, I am also entertaining the possibility that no one \| thought they would be in use for this long, which would certainly \| be evidence of planned obsolescence. \| twawaaay wrote: \| Very strange understanding of the word "evidence". \| \| No sane SSD manufacturer would do such thing on purpose. You do \| it and you loose business, that's it. \| \| The simplest explanation is that somebody made an honest \| engineering mistake. \| bayindirh wrote: \| When you purchase a server (fleet), you get a long warranty \| with it. Generally 3 to 5 years. So you expect this fleet to \| stay in service for <=5 years mostly. \| \| Unless you burn through your SSDs, you're very unlikely to \| hit this event. \| \| When these servers' continue to be used and disks all start \| to fail at the same time, this will obviously stink. \| \| The bathtub curve is not like this. You can _feel_ that. \| fartcannon wrote: \| Given the power dynamic between a single customer and large \| corporations, the smart thing to do is to assume malice until \| prove otherwise. This puts the onus on the corporations and, \| if we're lucky, creates an environment where they compete \| with each other to be seen as the most honest. The worst \| thing that happens is the single customer has to buy an SSD \| from someone they don't trust. \| \| If we do the opposite, as you say, and assume everything is \| an honest mistake, that puts pressure on the single customer \| to prove that the organization with a huge marketing budget \| is doing something wrong. In this situation, the worst thing \| that happens is we all get taken advantage of. \| \| Our collective distrust is the only power we have against \| massive marketing/PR budgets. It doesn't have to be angry, or \| sour, or cranky, we just collectively need to not take their \| word until we have a reason to do so. \| charcircuit wrote: \| Are you seriously saying that by default we should believe \| they intentionally planned to cause their customers to lose \| all of their data? \| [deleted] \| alliao wrote: \| planned obsolescence is quite a thing...? \| dtjb wrote: \| In some cases, but a product must fulfill its core \| purpose. If a SSD intentionally dumped data and self \| destructed at a set time, that would be disastrous for \| the brand. Same way a car doesn't adopt planned \| obsolescence by blowing up after 200k miles. \| bayindirh wrote: \| If a spinning rust can run for ~8 years without any \| problems,a consumer SSD can hit beyond 40K hours \| reliably, and everything is checked and tested tens of \| times because of the complexity of flash storage, I'd get \| suspicious too. \| \| Also, enterprise drives get firmware updates (regardless \| of spinning or not), and this firmware is automatically \| applied via RAID controller, so it could be remedied \| easily before it got this big if it's an actual error. \| justinsaccount wrote: \| Someone pointed out on the other thread that it could be 2^57 \| nanoseconds: >>> 257/109/3600 \| 40031.996687737745 \| AaronFriel wrote: \| If it were 53, I'd wonder "are they storing the time in the \| integer part of a double precision float?" That wouldn't go \| negative, it'd just start absorbing increments without \| changing the value. \| \| Though that might cause a divide by zero? \| \| What could cause unexpected behavior at 57 bits? \| \| Perhaps storing fractions of an hour, like incrementing it \| every 1/16th of an hour and calculating a relative rate of \| change, causing a divide by zero? \| mkl wrote: \| Do embedded CPUs like the one in an SSD have floating point \| units? It seems more likely to me that the upper bits in a \| 64 bit integer counter were used for something else. \| danielheath wrote: \| Packing a type flag into the upper bits of a 64 bit value \| is a reasonably common optimisation in dynamic language \| implementations (because it lets you use unboxed number \| arithmetic). \| jonas21 wrote: \| My overactive imagination thinks it went something like \| this: \| \| Engineer A: Gee, I need to store a few flags with each \| block, but there's nowhere to put them. Ah! We're storing \| timestamps as 64-bit _microseconds_. I can use a few of \| those bits and there 'll still be enough to go thousands of \| years without overflowing. \| \| Engineer B: Gee, our SSDs are getting so fast, soon we'll \| be able to hit 1M writes/sec. But we're storing timestamps \| as microseconds. How can we generate unique timestamps for \| each write? Ah! I'll switch to nanoseconds. It's a good \| thing we have plenty of space in this 64-bit int. \| \| BOOM! ___________________________________________________________________ (page generated 2022-07-10 23:00 UTC)