[HN Gopher] Understanding DeepMind's sorting algorithm
___________________________________________________________________
 
Understanding DeepMind's sorting algorithm
 
Author : jart
Score  : 212 points
Date   : 2023-06-12 17:30 UTC (5 hours ago)
 
web link (justine.lol)
w3m dump (justine.lol)
 
| mlochbaum wrote:
| Hang on, you can't just quote MB/s numbers for an O(n log(n))
| sort. What length were these tests run at?
| 
| The code size might not end up quite as good (also requires
| malloc), but a branchless merge sort is a contender for a fast
| and lightweight sort. Just published, tiny-sort-rs[0] cites 632
| bytes and looks like ~350MB/s at 1e4 elements on Zen 3. In my
| tests, my own pisort[1] benches a little over twice as fast as
| LongSort, but it uses sorting networks as the base case so it's
| like 5KB. It's roughly based on piposort[2] which has more
| complicated recursion but a simpler base case.
| 
| 400 MB/s seems a bit slow for a radix sort on that hardware: I'm
| hitting those numbers on my i5-6200U, which has less than half
| the clock rate, with my own radix sort. Recommend checking
| ska_sort_copy from [3] as it has about the same performance.
| 
| [0] https://github.com/Voultapher/tiny-sort-rs
| 
| [1]
| https://github.com/mlochbaum/SingeliSort/blob/master/src/mer...
| 
| [2] https://github.com/scandum/piposort
| 
| [3] https://github.com/skarupke/ska_sort
 
  | mlochbaum wrote:
  | Just realized that obviously you don't need stability if you're
  | using in-place quicksort, so the tiny-sort heapsort is a better
  | recommendation. 304 bytes, although the scaling to large arrays
  | is much worse because of the awful access patterns.
 
  | jstanley wrote:
  | > What length were these tests run at?
  | 
  | The first example is "assembly code they published for sorting
  | an array with three items" - this isn't an entire general-
  | purpose sorting algorithm, it's just the innermost part.
 
    | mlochbaum wrote:
    | Second part of the article, starting at "I thought it'd be
    | useful to share something that's actually portable and
    | executable".
 
    | srcreigh wrote:
    | The alpha dev post claims 1.7% improvement on large sequences
    | (250k+)
 
      | refulgentis wrote:
      | Yes, and as both posts say, that's because large sequences
      | are implemented by building up from small sequences :)
 
| Semisweet6708 wrote:
| [dead]
 
| tabtab wrote:
| Several years ago I read about using genetic algorithms to
| "evolve" better mini-sorts. I wonder how the two compare.
 
| cammil wrote:
| Why is sorting not done by specialised hardware? Or is it
| already?
 
  | bmc7505 wrote:
  | It can be done for fixed-length lists. Optimal sorting networks
  | [1] are an active research topic with many interesting
  | connections to differentiable sorting and ranking [2].
  | 
  | [1]:
  | https://en.wikipedia.org/wiki/Sorting_network#Optimal_sortin...
  | 
  | [2]: https://arxiv.org/abs/2105.04019
 
| fred256 wrote:
| This reminds me of challenge 28 from the game Human Resource
| Machine.
 
| smodad wrote:
| I just realized that Justine was the person responsible for the
| massive reduction in the memory footprint of the Llama models
| back in March.[1] Super impressive! These are my favorite kinds
| of blog posts.
| 
| [1] https://github.com/ggerganov/llama.cpp/pull/613
 
  | gajnadsgjoas wrote:
  | You wanted to say the one was banned by the author because of
  | all the drama that followed
 
    | jimsimmons wrote:
    | What drama. Ooc
 
      | pcj-github wrote:
      | https://github.com/github-drama/github-drama/pull/46
 
  | m00x wrote:
  | It's just using mmap, nothing too impressive. It's a nice
  | contribution nonetheless.
 
| srcreigh wrote:
| Is it strange that it's slower in jart's testing but claimed to
| be faster in the AlphaDev blog post?
| 
| jart doesn't provide detail about length of sequences used in
| testing, and AlphaDev basically says that between 6 and 249,999
| elements the optimizations are slower (they only claim
| improvement for very small and 250k+ element sequences).
| 
| The AlphaDev numbers are so curious as well. AFAICT there's extra
| branching when you splice the tiny-sequence optimized versions
| (slower), but better sorting for the tiny sequences (faster).
| 
| Is it, like, branch prediction gets an edge when the leaf nodes
| of the recursion are all sorting tiny sequences? In jart's code,
| it's DFS, which I can only guess would trample a bit on branch
| prediction. I wonder if a BFS search could be better
| 
| No idea what would cause this though, curious if anyone has other
| ideas, I really don't know.
 
| CalChris wrote:
| mov %rdx,%rcx
| 
| Wouldn't this _mov_ instruction be handled by the register
| renamer (Allocate /Rename/MoveElimination/ZeroIdiom) at
| essentially 'zero' cost? Yet clearly they're measuring a
| difference. I'll be curious what Agner Fog and Peter Cordes
| think.
| 
| Answer: renaming can fail if the operands aren't ready and it
| isn't zero cost, just less.
 
___________________________________________________________________
(page generated 2023-06-12 23:00 UTC)