|
| facorreia wrote:
| 2017.
| [deleted]
| lordnacho wrote:
| The problem with bugs deep in the stack is that it is really time
| consuming to establish that they are in fact as deep as they are.
|
| I wrote a Swift iOS app once, and came across an issue with one
| of the collection classes.
|
| Of course, nobody thinks that the Swift libs will be wrong as a
| first guess. So I worked through a number of hypotheses about my
| own code, slowly stripping out pieces that I thought might
| contain an error. And then combinations. I also tried reducing
| the number of entries just to simplify the logs. This worked, but
| of course you are not going to think that there's a library bug
| affecting collections with size > 16, and it wasn't actually a
| theory until I randomly decided to reduce the n. I also
| discovered that it worked just fine on release but not debug, so
| I thought maybe I have some race condition.
|
| More and more stripping down occurred, until I eventually gave up
| using my own project and just started a new one just to see about
| the collection class. I did it for the sake of being thorough,
| rather than actually thinking the lib had a bug in its debug
| implementation. But lo and behold, when I managed to make it
| reproducible and put it on SO, someone from Apple acknowledged
| that they could also see it, and they fixed it.
|
| Naturally if I'd gone direct to testing the lib I'd have saved a
| huge amount of time, but I guess that's the tradeoff from the
| most sensible heuristic: test your own code first, the bug is
| there.
| gh123man wrote:
| > nobody thinks that the Swift libs will be wrong as a first
| guess
|
| This is highly dependent on which version of Swift you started
| with! When Swift introduced the new substring API I hit a bug
| where certain UTF-8 character sequences caused an index out of
| bounds error internally. Unfortunately we learned this in
| production when an entire organization couldn't launch our app
| due to a string they were feeding through it.
|
| That is how your trust in the standard libs is forever broken.
| Library and compiler bugs were quite common in the Swift 1-3
| days.
| jcelerier wrote:
| Yeah, over the course of my allegedly short career (I'm 29)
| I've reported dozens of bugs against GCC, Clang, MSVC,
| binutils, Qt, SDL, glibc, PortAudio, macOS and other
| foundational stuff... I'm not saying I automatically assume
| "toolchain bug", but my cutoff for seriously pondering "is it
| a bug in $underlying_stuff" is around 30 minutes of "I really
| can't see where in my code things were done wrong" and so far
| this heuristic has consistently held...
| cesarb wrote:
| > but I guess that's the tradeoff from the most sensible
| heuristic: test your own code first, the bug is there.
|
| Also known as "select is not broken" (see for instance
| https://blog.codinghorror.com/the-first-rule-of-
| programming-...).
| yjftsjthsd-h wrote:
| Reminds me of: "It Is Never a Compiler Bug Until It Is"
| (https://r6.ca/blog/20200929T023701Z.html ,
| https://news.ycombinator.com/item?id=24636326). The bottom of
| the modern stack is _really_ reliable, until it isn 't;)
| Smoosh wrote:
| Not just "the modern stack". I work mainframes and always
| felt the IBM-supplied environment (compilers, transaction
| processing systems, databases) was rock solid.
|
| Then one day I discovered APARs were a thing.
|
| https://www.ibm.com/support/pages/open-apars-ibm-products-
| av...
| twic wrote:
| Similar story with a bug in the IBM JDK's implementation of
| BigDecimal. Surely if anyone is going to get decimals right
| it's IBM! Took us a long time to stop looking at our code.
|
| (turns out that IBM do get decimals right if you're running on
| z/Architecture, where the code diverts to some hardware-
| accelerated fast path; just not on x86-64 machines used by
| paupers like my project)
| CalChris wrote:
| Debian announcement
|
| https://lists.debian.org/debian-devel/2017/06/msg00308.html
|
| Ahrefs writeup
|
| https://tech.ahrefs.com/skylake-bug-a-detective-story-ab1ad2...
|
| The Intel spec update still labels SKL150 as _No Fix_ but there
| is a microcode update available. Dunno exactly what to make of
| that distinction.
|
| https://www.intel.com/content/www/us/en/processors/core/desk...
|
| Can an x86 program detect whether this update has been applied?
| Can a Linux process set a DONT_HYPERTHREAD_ME_BRO bit?
| BeeOnRope wrote:
| It was "fixed" in a microcode update by disabling the _loop
| stream buffer_ (LSD) which is a special mode of operation for
| very small loops where the instruction decoders and uop cache
| in the CPU are shut down and the loop runs directly out of a
| small cache*. Since the problem arose only when the LSD was
| being used, in combination with hyperthreading and high byte
| register use, this effectively avoids the problem.
|
| Of course, disabling the LSD has some costs: CPUs use more
| power and some loops are slower (though some are faster). These
| updates are usually applied silently without user consent, so
| you might quite surprised to find out that after a reboot your
| computation kernel suddenly draws more power or has slowed down
| or sped up.
|
| > Can an x86 program detect whether this update has been
| applied? Can a Linux process set a DONT_HYPERTHREAD_ME_BRO bit?
|
| Yes. One way would be to check the microcode version (available
| in /proc/cpuinfo on Linux, among other places), since the
| version that introduced this fix is known.
|
| Another way would be to run a small loop known to fit in the
| LSD and then check a performance counter event which counts
| uops delivered from the LSD, like lsd.uops. This counter is
| always zero when the LSD is disabled (or realistically you
| could just run _any_ substantial code and check the counter
| since you always have some non-neglible portion of the uops
| coming from the LSD). This is how I check it from the command
| line in practice.
|
| Finally, if you don't have easy access to the counters, you
| could create a loop that has a significant performance
| difference depending on whether it is coming from the LSD or
| not. For example, a loop that crosses a 32-byte boundary will
| run 2 or more cycles when using the decoder or uop cache, but
| could run in 1 cycle in the LSD. Timing such a loop would give
| you a strong indication about whether the LSD is enabled.
|
| ---
|
| * Specifically, the cache used is not a dedicated one, but
| rather the IDQ (decoded instruction queue) is reused. This
| queue holds uops and is normally fed by the decoders or the uop
| cache on one end, and which feeds the allocation/rename engine
| on the other. In LSD mode, this queue stops being a queue and
| is instead used as a kind of cache with the loop operations
| "locked down" in the queue and just repeatedly replayed.
| kaladin-jasnah wrote:
| Dumb question, but why is it abbreviated as LS_D_ when it's
| spelled loop stream _b_uffer?
| CalChris wrote:
| It's actually spelled _Loop Stream Detector_ and it dates
| to the _Core 2_ processor family which is circa 2006. The
| LSD is described in section 3.4.2.4 of the Intel
| Optimization Manual, _Optimizing the Loop Stream Detector
| (LSD)._ AnandTech describes how it works.
|
| https://www.anandtech.com/show/2594/4
| BeeOnRope wrote:
| Yeah that's right. Not sure where I picked up the term
| "... buffer" but a search shows I've been using it for a
| while.
| 13of40 wrote:
| > More experienced programmers know very well that the bug is
| generally in their code: occasionally in third-party libraries;
| very rarely in system libraries
|
| This was the bane of my existence when I worked on testing
| Windows years ago. New SDETs almost invariably fell into the trap
| of assuming any automation error was a "test bug" instead of a
| bug in OS code, even if the OS code in question was written last
| week.
| 1432132143 wrote:
| really guys GFY you know what OEMs do, they disable many features
| every time got some new bug. i.e undervolting now my thinkbook
| fan is always on on my laptop 30* fan is on 29* fan is on can't
| even undervold my cpu now. Realy thx
| wging wrote:
| Previous submission:
| https://news.ycombinator.com/item?id=14686277
|
| (This is not a complaint; I found the post interesting.)
| dang wrote:
| Thanks! Macroexpanded:
|
| _I found a bug in Intel Skylake processors_ -
| https://news.ycombinator.com/item?id=14686277 - July 2017 (99
| comments)
| [deleted]
| bjarneh wrote:
| > Binary search always fails? "The Java compiler is acting funny
| today!"
|
| :-)
| Decabytes wrote:
| I'm glad I'm just a pleb programmer, who never has done anything
| so complicated that it would expose processor errata.
|
| And even if I did, I wouldn't have the expertise to even figure
| it out.
| brokenmachine wrote:
| Welcome to the 99.999999999%.
| dfox wrote:
| The issue there is that the hardware is full of totally absurd
| bugs. If you target PC-like userspace or one of the two major
| mobile platforms it is somebody else's job to shield you from
| that. In general CPU level bugs are somewhat rare, but every
| single platform vendor had shipped some kind of silicon that
| contains peripherals that do not work as documented and only by
| chance work with the reference driver implementation.
| SavantIdiot wrote:
| This is a scary place to be: the top-level debug resource for a
| major project. It took almost two years to resolve, but was
| already known as SKL150. Looking at the clang vs. gcc assembly
| without knowledge of SKL150 would be literally impossible to
| debug. GCC -O1 vs -O2 is a clue, but even with the asm diffs,
| wth? Again, scary.
| tinus_hn wrote:
| The world is a scary place; this is basically the same as
| rowhammer which is an issue in computers shipped today.
| woodruffw wrote:
| Unless I'm misunderstanding what you mean, this isn't really
| like rowhammer at all -- it's a uarch/ucode bug, which is
| effectively a programming error within the CPU. Rowhammer is
| a physical flaw in how memory cells in DRAM are laid out, one
| that can be triggered by memory access patterns independent
| of CPU architecture and microarchitecture.
|
| (There are also hundreds of errata like this one in every CPU
| generation. They're _usually_ not easy to exploit, since they
| cause system instability rather than disclosing secret
| material or allowing unintended code execution.)
| zsmi wrote:
| > Rowhammer is a physical flaw in how memory cells in DRAM
| are laid out
|
| It's not really a flaw, more like a consequence of how
| memory cells are laid out. I mean most people want lots of
| bits in their DRAM. Maximizing this parameter necessitates
| that some will be in close proximity.
| woodruffw wrote:
| To my (non-EE) mind, the flaw is the electrical leakage
| between the cells. Tight packing is a consequence of
| economic forces, but I assume there are also technical
| solutions that allow for tight packing (but either offset
| the performance or cost gains). Is that assumption wrong?
| (Genuinely asking!)
| tlb wrote:
| DRAM cells also decay over time (~ 60 milliseconds), but
| memory controllers have some logic to refresh every row
| on a regular schedule so it's not an issue.
|
| They should also have logic to refresh adjacent rows if
| some number of consecutive accesses to a small group of
| rows is detected. This is rare in normal workloads,
| because those accesses normally come from cache. It's
| lame of chipmakers to not fix this. The fix would
| requires the DRAM controller (integrated into modern
| CPUs) to know more about the internals of DRAMs than they
| currently do.
| zsmi wrote:
| In theory DDR5/LPDDR5 added a controller command for
| RowHammer mitigation but I haven't had time to research
| it yet.
|
| See: https://arxiv.org/pdf/2108.06703.pdf
| zsmi wrote:
| There was a good paper on it in 2014. [1] They describe
| the RowHammer attack as: opening and closing (activation
| and precharge) a DRAM row (aggressor row) at a high
| enough rate (hammering) such that it can cause bit-flips
| in physically nearby rows (victim row).
|
| Colloquially, it's basically a change in voltage in one
| place can indirectly cause a change in voltage in another
| place via capacitive coupling. Capacitance increases
| proportional to the inverse of the separating distance so
| only in recent years have things shrunk to the size that
| makes it an issue.
|
| Since having less bits in DRAM is basically not an option
| most mitigation techniques that I know of remove the
| possibility of hammering: possibilities include the OS,
| memory system controller, or DRAM controller changes.
|
| [1] https://users.ece.cmu.edu/~yoonguk/papers/kim-
| isca14.pdf
| woodruffw wrote:
| Much appreciated, thank you.
| [deleted]
| dimitrios1 wrote:
| Apologies if this is off topic -- but I am constantly impressed
| at some of the things I find that come from inria.fr. I first
| came across them when learning OCaml. Seems to be a top notch
| university.
| woodruffw wrote:
| Inria is a research institute, not a university. But they do
| indeed do excellent work!
| bruce343434 wrote:
| The link called "6th Generation Intel(r) Processor Family -
| Specification Update." 404's
| userbinator wrote:
| "gcc/clang/icc/msvc won't usually issue the affected opcode
| pattern and it ends up being rare. SKL150 - Short loops using
| both the AH/BH/CH/DH registers and the corresponding wide
| register _may_ result in unpredictable system behavior. "
|
| I think Intel should regression-test its CPUs using the decades
| of demoscene productions out there, especially those in the
| extreme-size-optimisation categories; testing with almost
| exclusively "mainstream" compiler output is IMHO a bad idea and a
| step down the path to "warranty void if VLC is used"
| (https://news.ycombinator.com/item?id=7205759 )
___________________________________________________________________
(page generated 2021-11-08 23:00 UTC) |