[HN Gopher] How I found a bug in Intel Skylake processors (2017)
___________________________________________________________________
 
How I found a bug in Intel Skylake processors (2017)
 
Author : vinnyglennon
Score  : 228 points
Date   : 2021-11-08 16:12 UTC (6 hours ago)
 
web link (gallium.inria.fr)
w3m dump (gallium.inria.fr)
 
| facorreia wrote:
| 2017.
 
| [deleted]
 
| lordnacho wrote:
| The problem with bugs deep in the stack is that it is really time
| consuming to establish that they are in fact as deep as they are.
| 
| I wrote a Swift iOS app once, and came across an issue with one
| of the collection classes.
| 
| Of course, nobody thinks that the Swift libs will be wrong as a
| first guess. So I worked through a number of hypotheses about my
| own code, slowly stripping out pieces that I thought might
| contain an error. And then combinations. I also tried reducing
| the number of entries just to simplify the logs. This worked, but
| of course you are not going to think that there's a library bug
| affecting collections with size > 16, and it wasn't actually a
| theory until I randomly decided to reduce the n. I also
| discovered that it worked just fine on release but not debug, so
| I thought maybe I have some race condition.
| 
| More and more stripping down occurred, until I eventually gave up
| using my own project and just started a new one just to see about
| the collection class. I did it for the sake of being thorough,
| rather than actually thinking the lib had a bug in its debug
| implementation. But lo and behold, when I managed to make it
| reproducible and put it on SO, someone from Apple acknowledged
| that they could also see it, and they fixed it.
| 
| Naturally if I'd gone direct to testing the lib I'd have saved a
| huge amount of time, but I guess that's the tradeoff from the
| most sensible heuristic: test your own code first, the bug is
| there.
 
  | gh123man wrote:
  | > nobody thinks that the Swift libs will be wrong as a first
  | guess
  | 
  | This is highly dependent on which version of Swift you started
  | with! When Swift introduced the new substring API I hit a bug
  | where certain UTF-8 character sequences caused an index out of
  | bounds error internally. Unfortunately we learned this in
  | production when an entire organization couldn't launch our app
  | due to a string they were feeding through it.
  | 
  | That is how your trust in the standard libs is forever broken.
  | Library and compiler bugs were quite common in the Swift 1-3
  | days.
 
    | jcelerier wrote:
    | Yeah, over the course of my allegedly short career (I'm 29)
    | I've reported dozens of bugs against GCC, Clang, MSVC,
    | binutils, Qt, SDL, glibc, PortAudio, macOS and other
    | foundational stuff... I'm not saying I automatically assume
    | "toolchain bug", but my cutoff for seriously pondering "is it
    | a bug in $underlying_stuff" is around 30 minutes of "I really
    | can't see where in my code things were done wrong" and so far
    | this heuristic has consistently held...
 
  | cesarb wrote:
  | > but I guess that's the tradeoff from the most sensible
  | heuristic: test your own code first, the bug is there.
  | 
  | Also known as "select is not broken" (see for instance
  | https://blog.codinghorror.com/the-first-rule-of-
  | programming-...).
 
  | yjftsjthsd-h wrote:
  | Reminds me of: "It Is Never a Compiler Bug Until It Is"
  | (https://r6.ca/blog/20200929T023701Z.html ,
  | https://news.ycombinator.com/item?id=24636326). The bottom of
  | the modern stack is _really_ reliable, until it isn 't;)
 
    | Smoosh wrote:
    | Not just "the modern stack". I work mainframes and always
    | felt the IBM-supplied environment (compilers, transaction
    | processing systems, databases) was rock solid.
    | 
    | Then one day I discovered APARs were a thing.
    | 
    | https://www.ibm.com/support/pages/open-apars-ibm-products-
    | av...
 
  | twic wrote:
  | Similar story with a bug in the IBM JDK's implementation of
  | BigDecimal. Surely if anyone is going to get decimals right
  | it's IBM! Took us a long time to stop looking at our code.
  | 
  | (turns out that IBM do get decimals right if you're running on
  | z/Architecture, where the code diverts to some hardware-
  | accelerated fast path; just not on x86-64 machines used by
  | paupers like my project)
 
| CalChris wrote:
| Debian announcement
| 
| https://lists.debian.org/debian-devel/2017/06/msg00308.html
| 
| Ahrefs writeup
| 
| https://tech.ahrefs.com/skylake-bug-a-detective-story-ab1ad2...
| 
| The Intel spec update still labels SKL150 as _No Fix_ but there
| is a microcode update available. Dunno exactly what to make of
| that distinction.
| 
| https://www.intel.com/content/www/us/en/processors/core/desk...
| 
| Can an x86 program detect whether this update has been applied?
| Can a Linux process set a DONT_HYPERTHREAD_ME_BRO bit?
 
  | BeeOnRope wrote:
  | It was "fixed" in a microcode update by disabling the _loop
  | stream buffer_ (LSD) which is a special mode of operation for
  | very small loops where the instruction decoders and uop cache
  | in the CPU are shut down and the loop runs directly out of a
  | small cache*. Since the problem arose only when the LSD was
  | being used, in combination with hyperthreading and high byte
  | register use, this effectively avoids the problem.
  | 
  | Of course, disabling the LSD has some costs: CPUs use more
  | power and some loops are slower (though some are faster). These
  | updates are usually applied silently without user consent, so
  | you might quite surprised to find out that after a reboot your
  | computation kernel suddenly draws more power or has slowed down
  | or sped up.
  | 
  | > Can an x86 program detect whether this update has been
  | applied? Can a Linux process set a DONT_HYPERTHREAD_ME_BRO bit?
  | 
  | Yes. One way would be to check the microcode version (available
  | in /proc/cpuinfo on Linux, among other places), since the
  | version that introduced this fix is known.
  | 
  | Another way would be to run a small loop known to fit in the
  | LSD and then check a performance counter event which counts
  | uops delivered from the LSD, like lsd.uops. This counter is
  | always zero when the LSD is disabled (or realistically you
  | could just run _any_ substantial code and check the counter
  | since you always have some non-neglible portion of the uops
  | coming from the LSD). This is how I check it from the command
  | line in practice.
  | 
  | Finally, if you don't have easy access to the counters, you
  | could create a loop that has a significant performance
  | difference depending on whether it is coming from the LSD or
  | not. For example, a loop that crosses a 32-byte boundary will
  | run 2 or more cycles when using the decoder or uop cache, but
  | could run in 1 cycle in the LSD. Timing such a loop would give
  | you a strong indication about whether the LSD is enabled.
  | 
  | ---
  | 
  | * Specifically, the cache used is not a dedicated one, but
  | rather the IDQ (decoded instruction queue) is reused. This
  | queue holds uops and is normally fed by the decoders or the uop
  | cache on one end, and which feeds the allocation/rename engine
  | on the other. In LSD mode, this queue stops being a queue and
  | is instead used as a kind of cache with the loop operations
  | "locked down" in the queue and just repeatedly replayed.
 
    | kaladin-jasnah wrote:
    | Dumb question, but why is it abbreviated as LS_D_ when it's
    | spelled loop stream _b_uffer?
 
      | CalChris wrote:
      | It's actually spelled _Loop Stream Detector_ and it dates
      | to the _Core 2_ processor family which is circa 2006. The
      | LSD is described in section 3.4.2.4 of the Intel
      | Optimization Manual, _Optimizing the Loop Stream Detector
      | (LSD)._ AnandTech describes how it works.
      | 
      | https://www.anandtech.com/show/2594/4
 
        | BeeOnRope wrote:
        | Yeah that's right. Not sure where I picked up the term
        | "... buffer" but a search shows I've been using it for a
        | while.
 
| 13of40 wrote:
| > More experienced programmers know very well that the bug is
| generally in their code: occasionally in third-party libraries;
| very rarely in system libraries
| 
| This was the bane of my existence when I worked on testing
| Windows years ago. New SDETs almost invariably fell into the trap
| of assuming any automation error was a "test bug" instead of a
| bug in OS code, even if the OS code in question was written last
| week.
 
| 1432132143 wrote:
| really guys GFY you know what OEMs do, they disable many features
| every time got some new bug. i.e undervolting now my thinkbook
| fan is always on on my laptop 30* fan is on 29* fan is on can't
| even undervold my cpu now. Realy thx
 
| wging wrote:
| Previous submission:
| https://news.ycombinator.com/item?id=14686277
| 
| (This is not a complaint; I found the post interesting.)
 
  | dang wrote:
  | Thanks! Macroexpanded:
  | 
  |  _I found a bug in Intel Skylake processors_ -
  | https://news.ycombinator.com/item?id=14686277 - July 2017 (99
  | comments)
 
| [deleted]
 
| bjarneh wrote:
| > Binary search always fails? "The Java compiler is acting funny
| today!"
| 
| :-)
 
| Decabytes wrote:
| I'm glad I'm just a pleb programmer, who never has done anything
| so complicated that it would expose processor errata.
| 
| And even if I did, I wouldn't have the expertise to even figure
| it out.
 
  | brokenmachine wrote:
  | Welcome to the 99.999999999%.
 
  | dfox wrote:
  | The issue there is that the hardware is full of totally absurd
  | bugs. If you target PC-like userspace or one of the two major
  | mobile platforms it is somebody else's job to shield you from
  | that. In general CPU level bugs are somewhat rare, but every
  | single platform vendor had shipped some kind of silicon that
  | contains peripherals that do not work as documented and only by
  | chance work with the reference driver implementation.
 
| SavantIdiot wrote:
| This is a scary place to be: the top-level debug resource for a
| major project. It took almost two years to resolve, but was
| already known as SKL150. Looking at the clang vs. gcc assembly
| without knowledge of SKL150 would be literally impossible to
| debug. GCC -O1 vs -O2 is a clue, but even with the asm diffs,
| wth? Again, scary.
 
  | tinus_hn wrote:
  | The world is a scary place; this is basically the same as
  | rowhammer which is an issue in computers shipped today.
 
    | woodruffw wrote:
    | Unless I'm misunderstanding what you mean, this isn't really
    | like rowhammer at all -- it's a uarch/ucode bug, which is
    | effectively a programming error within the CPU. Rowhammer is
    | a physical flaw in how memory cells in DRAM are laid out, one
    | that can be triggered by memory access patterns independent
    | of CPU architecture and microarchitecture.
    | 
    | (There are also hundreds of errata like this one in every CPU
    | generation. They're _usually_ not easy to exploit, since they
    | cause system instability rather than disclosing secret
    | material or allowing unintended code execution.)
 
      | zsmi wrote:
      | > Rowhammer is a physical flaw in how memory cells in DRAM
      | are laid out
      | 
      | It's not really a flaw, more like a consequence of how
      | memory cells are laid out. I mean most people want lots of
      | bits in their DRAM. Maximizing this parameter necessitates
      | that some will be in close proximity.
 
        | woodruffw wrote:
        | To my (non-EE) mind, the flaw is the electrical leakage
        | between the cells. Tight packing is a consequence of
        | economic forces, but I assume there are also technical
        | solutions that allow for tight packing (but either offset
        | the performance or cost gains). Is that assumption wrong?
        | (Genuinely asking!)
 
        | tlb wrote:
        | DRAM cells also decay over time (~ 60 milliseconds), but
        | memory controllers have some logic to refresh every row
        | on a regular schedule so it's not an issue.
        | 
        | They should also have logic to refresh adjacent rows if
        | some number of consecutive accesses to a small group of
        | rows is detected. This is rare in normal workloads,
        | because those accesses normally come from cache. It's
        | lame of chipmakers to not fix this. The fix would
        | requires the DRAM controller (integrated into modern
        | CPUs) to know more about the internals of DRAMs than they
        | currently do.
 
        | zsmi wrote:
        | In theory DDR5/LPDDR5 added a controller command for
        | RowHammer mitigation but I haven't had time to research
        | it yet.
        | 
        | See: https://arxiv.org/pdf/2108.06703.pdf
 
        | zsmi wrote:
        | There was a good paper on it in 2014. [1] They describe
        | the RowHammer attack as: opening and closing (activation
        | and precharge) a DRAM row (aggressor row) at a high
        | enough rate (hammering) such that it can cause bit-flips
        | in physically nearby rows (victim row).
        | 
        | Colloquially, it's basically a change in voltage in one
        | place can indirectly cause a change in voltage in another
        | place via capacitive coupling. Capacitance increases
        | proportional to the inverse of the separating distance so
        | only in recent years have things shrunk to the size that
        | makes it an issue.
        | 
        | Since having less bits in DRAM is basically not an option
        | most mitigation techniques that I know of remove the
        | possibility of hammering: possibilities include the OS,
        | memory system controller, or DRAM controller changes.
        | 
        | [1] https://users.ece.cmu.edu/~yoonguk/papers/kim-
        | isca14.pdf
 
        | woodruffw wrote:
        | Much appreciated, thank you.
 
    | [deleted]
 
| dimitrios1 wrote:
| Apologies if this is off topic -- but I am constantly impressed
| at some of the things I find that come from inria.fr. I first
| came across them when learning OCaml. Seems to be a top notch
| university.
 
  | woodruffw wrote:
  | Inria is a research institute, not a university. But they do
  | indeed do excellent work!
 
| bruce343434 wrote:
| The link called "6th Generation Intel(r) Processor Family -
| Specification Update." 404's
 
| userbinator wrote:
| "gcc/clang/icc/msvc won't usually issue the affected opcode
| pattern and it ends up being rare. SKL150 - Short loops using
| both the AH/BH/CH/DH registers and the corresponding wide
| register _may_ result in unpredictable system behavior. "
| 
| I think Intel should regression-test its CPUs using the decades
| of demoscene productions out there, especially those in the
| extreme-size-optimisation categories; testing with almost
| exclusively "mainstream" compiler output is IMHO a bad idea and a
| step down the path to "warranty void if VLC is used"
| (https://news.ycombinator.com/item?id=7205759 )
 
___________________________________________________________________
(page generated 2021-11-08 23:00 UTC)