proxy70

	[HN Gopher] Understanding DeepMind's sorting algorithm ___________________________________________________________________ Understanding DeepMind's sorting algorithm Author : jart Score : 212 points Date : 2023-06-12 17:30 UTC (5 hours ago)
	web link (justine.lol)
	w3m dump (justine.lol)
	\| mlochbaum wrote: \| Hang on, you can't just quote MB/s numbers for an O(n log(n)) \| sort. What length were these tests run at? \| \| The code size might not end up quite as good (also requires \| malloc), but a branchless merge sort is a contender for a fast \| and lightweight sort. Just published, tiny-sort-rs[0] cites 632 \| bytes and looks like ~350MB/s at 1e4 elements on Zen 3. In my \| tests, my own pisort[1] benches a little over twice as fast as \| LongSort, but it uses sorting networks as the base case so it's \| like 5KB. It's roughly based on piposort[2] which has more \| complicated recursion but a simpler base case. \| \| 400 MB/s seems a bit slow for a radix sort on that hardware: I'm \| hitting those numbers on my i5-6200U, which has less than half \| the clock rate, with my own radix sort. Recommend checking \| ska_sort_copy from [3] as it has about the same performance. \| \| [0] https://github.com/Voultapher/tiny-sort-rs \| \| [1] \| https://github.com/mlochbaum/SingeliSort/blob/master/src/mer... \| \| [2] https://github.com/scandum/piposort \| \| [3] https://github.com/skarupke/ska_sort \| mlochbaum wrote: \| Just realized that obviously you don't need stability if you're \| using in-place quicksort, so the tiny-sort heapsort is a better \| recommendation. 304 bytes, although the scaling to large arrays \| is much worse because of the awful access patterns. \| jstanley wrote: \| > What length were these tests run at? \| \| The first example is "assembly code they published for sorting \| an array with three items" - this isn't an entire general- \| purpose sorting algorithm, it's just the innermost part. \| mlochbaum wrote: \| Second part of the article, starting at "I thought it'd be \| useful to share something that's actually portable and \| executable". \| srcreigh wrote: \| The alpha dev post claims 1.7% improvement on large sequences \| (250k+) \| refulgentis wrote: \| Yes, and as both posts say, that's because large sequences \| are implemented by building up from small sequences :) \| Semisweet6708 wrote: \| [dead] \| tabtab wrote: \| Several years ago I read about using genetic algorithms to \| "evolve" better mini-sorts. I wonder how the two compare. \| cammil wrote: \| Why is sorting not done by specialised hardware? Or is it \| already? \| bmc7505 wrote: \| It can be done for fixed-length lists. Optimal sorting networks \| [1] are an active research topic with many interesting \| connections to differentiable sorting and ranking [2]. \| \| [1]: \| https://en.wikipedia.org/wiki/Sorting_network#Optimal_sortin... \| \| [2]: https://arxiv.org/abs/2105.04019 \| fred256 wrote: \| This reminds me of challenge 28 from the game Human Resource \| Machine. \| smodad wrote: \| I just realized that Justine was the person responsible for the \| massive reduction in the memory footprint of the Llama models \| back in March.[1] Super impressive! These are my favorite kinds \| of blog posts. \| \| [1] https://github.com/ggerganov/llama.cpp/pull/613 \| gajnadsgjoas wrote: \| You wanted to say the one was banned by the author because of \| all the drama that followed \| jimsimmons wrote: \| What drama. Ooc \| pcj-github wrote: \| https://github.com/github-drama/github-drama/pull/46 \| m00x wrote: \| It's just using mmap, nothing too impressive. It's a nice \| contribution nonetheless. \| srcreigh wrote: \| Is it strange that it's slower in jart's testing but claimed to \| be faster in the AlphaDev blog post? \| \| jart doesn't provide detail about length of sequences used in \| testing, and AlphaDev basically says that between 6 and 249,999 \| elements the optimizations are slower (they only claim \| improvement for very small and 250k+ element sequences). \| \| The AlphaDev numbers are so curious as well. AFAICT there's extra \| branching when you splice the tiny-sequence optimized versions \| (slower), but better sorting for the tiny sequences (faster). \| \| Is it, like, branch prediction gets an edge when the leaf nodes \| of the recursion are all sorting tiny sequences? In jart's code, \| it's DFS, which I can only guess would trample a bit on branch \| prediction. I wonder if a BFS search could be better \| \| No idea what would cause this though, curious if anyone has other \| ideas, I really don't know. \| CalChris wrote: \| mov %rdx,%rcx \| \| Wouldn't this _mov_ instruction be handled by the register \| renamer (Allocate /Rename/MoveElimination/ZeroIdiom) at \| essentially 'zero' cost? Yet clearly they're measuring a \| difference. I'll be curious what Agner Fog and Peter Cordes \| think. \| \| Answer: renaming can fail if the operands aren't ready and it \| isn't zero cost, just less. ___________________________________________________________________ (page generated 2023-06-12 23:00 UTC)