|
| mlochbaum wrote:
| Hang on, you can't just quote MB/s numbers for an O(n log(n))
| sort. What length were these tests run at?
|
| The code size might not end up quite as good (also requires
| malloc), but a branchless merge sort is a contender for a fast
| and lightweight sort. Just published, tiny-sort-rs[0] cites 632
| bytes and looks like ~350MB/s at 1e4 elements on Zen 3. In my
| tests, my own pisort[1] benches a little over twice as fast as
| LongSort, but it uses sorting networks as the base case so it's
| like 5KB. It's roughly based on piposort[2] which has more
| complicated recursion but a simpler base case.
|
| 400 MB/s seems a bit slow for a radix sort on that hardware: I'm
| hitting those numbers on my i5-6200U, which has less than half
| the clock rate, with my own radix sort. Recommend checking
| ska_sort_copy from [3] as it has about the same performance.
|
| [0] https://github.com/Voultapher/tiny-sort-rs
|
| [1]
| https://github.com/mlochbaum/SingeliSort/blob/master/src/mer...
|
| [2] https://github.com/scandum/piposort
|
| [3] https://github.com/skarupke/ska_sort
| mlochbaum wrote:
| Just realized that obviously you don't need stability if you're
| using in-place quicksort, so the tiny-sort heapsort is a better
| recommendation. 304 bytes, although the scaling to large arrays
| is much worse because of the awful access patterns.
| jstanley wrote:
| > What length were these tests run at?
|
| The first example is "assembly code they published for sorting
| an array with three items" - this isn't an entire general-
| purpose sorting algorithm, it's just the innermost part.
| mlochbaum wrote:
| Second part of the article, starting at "I thought it'd be
| useful to share something that's actually portable and
| executable".
| srcreigh wrote:
| The alpha dev post claims 1.7% improvement on large sequences
| (250k+)
| refulgentis wrote:
| Yes, and as both posts say, that's because large sequences
| are implemented by building up from small sequences :)
| Semisweet6708 wrote:
| [dead]
| tabtab wrote:
| Several years ago I read about using genetic algorithms to
| "evolve" better mini-sorts. I wonder how the two compare.
| cammil wrote:
| Why is sorting not done by specialised hardware? Or is it
| already?
| bmc7505 wrote:
| It can be done for fixed-length lists. Optimal sorting networks
| [1] are an active research topic with many interesting
| connections to differentiable sorting and ranking [2].
|
| [1]:
| https://en.wikipedia.org/wiki/Sorting_network#Optimal_sortin...
|
| [2]: https://arxiv.org/abs/2105.04019
| fred256 wrote:
| This reminds me of challenge 28 from the game Human Resource
| Machine.
| smodad wrote:
| I just realized that Justine was the person responsible for the
| massive reduction in the memory footprint of the Llama models
| back in March.[1] Super impressive! These are my favorite kinds
| of blog posts.
|
| [1] https://github.com/ggerganov/llama.cpp/pull/613
| gajnadsgjoas wrote:
| You wanted to say the one was banned by the author because of
| all the drama that followed
| jimsimmons wrote:
| What drama. Ooc
| pcj-github wrote:
| https://github.com/github-drama/github-drama/pull/46
| m00x wrote:
| It's just using mmap, nothing too impressive. It's a nice
| contribution nonetheless.
| srcreigh wrote:
| Is it strange that it's slower in jart's testing but claimed to
| be faster in the AlphaDev blog post?
|
| jart doesn't provide detail about length of sequences used in
| testing, and AlphaDev basically says that between 6 and 249,999
| elements the optimizations are slower (they only claim
| improvement for very small and 250k+ element sequences).
|
| The AlphaDev numbers are so curious as well. AFAICT there's extra
| branching when you splice the tiny-sequence optimized versions
| (slower), but better sorting for the tiny sequences (faster).
|
| Is it, like, branch prediction gets an edge when the leaf nodes
| of the recursion are all sorting tiny sequences? In jart's code,
| it's DFS, which I can only guess would trample a bit on branch
| prediction. I wonder if a BFS search could be better
|
| No idea what would cause this though, curious if anyone has other
| ideas, I really don't know.
| CalChris wrote:
| mov %rdx,%rcx
|
| Wouldn't this _mov_ instruction be handled by the register
| renamer (Allocate /Rename/MoveElimination/ZeroIdiom) at
| essentially 'zero' cost? Yet clearly they're measuring a
| difference. I'll be curious what Agner Fog and Peter Cordes
| think.
|
| Answer: renaming can fail if the operands aren't ready and it
| isn't zero cost, just less.
___________________________________________________________________
(page generated 2023-06-12 23:00 UTC) |