[HN Gopher] All About Libpas, Phil's Super Fast Malloc
___________________________________________________________________
 
All About Libpas, Phil's Super Fast Malloc
 
Author : Jarred
Score  : 238 points
Date   : 2022-06-01 13:45 UTC (9 hours ago)
 
web link (github.com)
w3m dump (github.com)
 
| carlsborg wrote:
| Trying to understand the license for bmalloc, and webkit in
| general. The WebCore folder has an Apple license and two LGPL
| licenses. The root folder has no license file. Can anyone
| comment?
 
  | pizlonator wrote:
  | libpas is BSD 2-clause. so is bmalloc. libpas is in the bmalloc
  | directory in WebKit, but is really a separate piece of code.
  | 
  | each file has its own license header.
 
| SemanticStrengh wrote:
 
| ridiculous_fish wrote:
| Hey Phil, thanks for the write up! Curious why this is only used
| for JSC's JIT, and not to back its heap? Did the design of the
| JSC GC influence libpas at all?
 
  | pizlonator wrote:
  | It's used for JSC's JIT and for all heap allocations in
  | JavaScriptCore and WebKit that aren't GC allocations.
  | 
  | Riptide (the JSC GC) has a similar design and some of the key
  | ideas were first pioneered there:
  | 
  | - The use of fast bitvector searches to find an eligible page
  | (Riptide calls it first eligible block, I think).
  | 
  | - The use of bitvector simd to change object states in bulk.
  | 
  | - Riptide uses a mark bitvector that is indexed by minalign
  | atom (16 bytes) and a block size of 16KB. Libpas's default
  | "small segregated" configuration is 16 byte minalign, 16KB
  | block, and an alloc bitvector indexed by minalign.
  | 
  | I have since thought about how Riptide could be wired up to use
  | libpas. It's possible to do it. Already today you could create
  | a libpas page_config that leaves room for mark bits in the page
  | header, and makes them use the same indexing as the alloc bits.
  | Riptide has a "newly allocated" bitvector that serves almost
  | the same purpose as the alloc bits (kinda, if you squint really
  | hard) - so you'd go from Riptides newly_allocated and mark
  | bitvectors, to libpas's alloc bitvector and a custom mark
  | bitvector stuffed into page headers using an exotic
  | page_config. If JSC did this, it would buy:
  | 
  | - Better decommit policies than Riptide. I did a lot of tuning
  | in libpas to make decommit Just Right (TM).
  | 
  | - Faster allocation of medium-sized objects.
  | 
  | - Possibly better memory efficiency.
  | 
  | But, Riptide is already very nicely tuned, and it gets some
  | benefit from libpas's decommit rules because Riptide calls
  | fastMalloc to get memory and sometimes uses fastFree to return
  | it. fastFree calls bmalloc::api::free, which calls libpas's
  | bmalloc_deallocate, which then goes to pas_deallocate with the
  | BMALLOC_HEAP_CONFIG. So, Riptide already gets some of the
  | goodness of libpas by sitting on top of it.
 
  | [deleted]
 
| dragontamer wrote:
| General purpose memory-allocation is a lie. Everything has use
| cases where they're faster than other libraries.
| 
| I think that's why we keep seeing newer malloc schemes pop up,
| because the performance of the heap 100% depends on the use-case,
| and different people have different use cases.
| 
| Still, studying everyone else's heaps (and garbage collectors, a
| closely related discussion) is probably good for high-performance
| programmers.
 
  | pizlonator wrote:
  | Yeah, you're totally right.
  | 
  | Libpas beats other mallocs in WebKit. I also had benchmarks
  | involving non-WebKit workloads and the results were all over
  | the place (sometimes slower than other mallocs by a lot,
  | sometimes faster by a lot - same thing with memory, sometimes
  | more efficient, sometimes less). This didn't surprise me; I've
  | seen this before when writing memory management code.
  | 
  | I think it makes sense for large software projects that use
  | malloc a lot and care about perf to eventually either write
  | their own malloc or to take an existing one and then tune it a
  | lot.
 
  | latenightcoding wrote:
  | I've seen a few opensource projects archive their custom
  | allocators because they were not beating the system's one or
  | jemalloc. So if you have the skill and time then yeah go for
  | it, else stick with general purpose ones.
 
    | mananaysiempre wrote:
    | There are a couple of approaches that might legitimately be
    | simpler than using a general-purpose allocator: in a single-
    | pass batch process, just throw things on the floor and let
    | the OS sort it out on exit[1] (I think the D compiler does
    | this); in a multiple-pass batch process, make each pass
    | litter its own room (arena, obstack, etc.) then demolish said
    | room when done (GCC does this or at least did in the past).
    | 
    | On the other hand, these may require rearranging the logic
    | somewhat to fit them, so the question of how much
    | rearrangement to tolerate still remains. And, well[4],
    | /*        * We divy out chunks of memory rather than call
    | malloc each time so        * we don't have to worry about
    | leaking memory.  It's probably        * not a big deal if all
    | this memory was wasted but if this ever        * goes into a
    | library that would probably not be a good idea.        *
    | * XXX - this *is* in a library....        */
    | 
    | [1] I am rather dismayed by how Raymond Chen advocates for
    | this for memory[2] but insists it could not possibly be a
    | good idea for Windows GDI handles, no, you stupid sloppy
    | programmer[3]. Maybe because he was involved in GDI from the
    | other side? (Of course, for file descriptors on Unix it's
    | still the standard practice.)
    | 
    | [2] http://bytepointer.com/resources/old_new_thing/20120105_0
    | 06_...
    | 
    | [3] http://bytepointer.com/resources/old_new_thing/20051014_3
    | 05_...
    | 
    | [4] https://github.com/the-tcpdump-
    | group/libpcap/blob/4c1e516dd2...
 
      | jstimpfle wrote:
      | This "[2] vs [3]" is a good find. I have several possible
      | explanations. It could be the difference of 7 years between
      | those posts. It could also be that Windows Handles can't be
      | cleaned up in userspace. Not how malloc chunks up the
      | "handles" (i.e. pages / regions) that it got from Windows.
      | This is, AFAIK, different than HANDLE's that have to be
      | requested from the system one-by-one and can't be chunked
      | up. (Or am I wrong? I actually don't know a lot about how
      | this works under the hood).
 
    | dragontamer wrote:
    | Its very easy to beat the general purpose ones if you know
    | your exact use case.
    | 
    | Ex: 16-bit pointers is a 65536-sized heap. Assume 8-bytes per
    | element, that's 512KB of space. A bit small, but large enough
    | to so a lot of things.
    | 
    | 65536 elements can be represented as a bitmask. The bitmask
    | only takes up 8192-bytes (8KB), which fits inside of 16
    | AVX512 registers (Intel offers 32x AVX512/ZMM registers btw).
    | Or it fits inside of GPU __shared__ memory.
    | 
    | If you need multithreaded, you perform atomic AND and atomic
    | OR to clear, and set, the bitmask as appropriate.
    | 
    | -------
    | 
    | Do you know the exact size of your heap? The size of the
    | pointer? The amount of parallelism involved? What about the
    | size of the elements? Access pattern? (Bump-alloc'd Linked
    | Lists are a sequential traversal over RAM btw, so that's very
    | efficient), etc. etc.
    | 
    | The 64-bit pointer is overkill for most people's purposes.
    | 32-bits represents 4-billion objects, and if each object is
    | 16-bytes long that's 64GBs of RAM. Right here right now, a
    | 32-bit pointer / custom allocator already offers many
    | benefits over the 64-bit pointer. (half-sized pointers, more
    | data in cache, etc. etc.)
 
      | thesz wrote:
      | 512 in AVX512 is the number of bits per register. You are
      | off by factor of 8.
 
  | cmrdporcupine wrote:
  | 100%, and also I think there's a tendency among people who
  | compare 'higher up the stack' in GC'd languages to manual heap
  | allocation in C/C++ etc. in a way that implies the the latter
  | is almost "free" from a performance POV when in fact underneath
  | the covers in these allocators there's still a lot of the same
  | kind of walking of datastructures that also happens inside a
  | tracing GC.
  | 
  | You can get unpredictable pauses from malloc/free, too.
 
    | asveikau wrote:
    | I think in either case, GC or not, you write something
    | intuitively and then when it becomes an actual problem, you
    | study the patterns and improve them, but most of the time you
    | can leave it alone and the general purpose thing is good
    | enough.
    | 
    | Or if you are in a niche like audio/video or something, you
    | avoid allocations all together during the bulk of the code.
 
      | cmrdporcupine wrote:
      | Oh for sure. Premature optimizing is usually bad form.
      | 
      | What is tricky about performance issues in allocation is
      | that they can be hard to profile. There are tools for
      | analyzing GC performance in GC'd languages, but sometimes
      | malloc/free can just be a big black box.
 
        | djmips wrote:
        | Premature optimization is not bad form when it's re-
        | framed as good architecture. So it's not 'usually bad
        | form' to architect something from the outset, using your
        | experience, and that's something everyone understands.
        | This pervasive disdain for premature optimization leads
        | to bad architecture that often leads to expensive
        | rewrites. So because the word optimization is so
        | overloaded and treated with disdain it feels like we need
        | a new language to talk about what's really meant by
        | 'premature optimization' ( the bad kind)
 
        | asveikau wrote:
        | That's one reason I point out a few comments above that
        | certain niches will need to take it into account ahead of
        | time. Eg. An a/v application will typically allocate all
        | buffers up front and re-use them frequently rather than
        | return them to the allocator. A lot of server
        | applications will want to keep per-client memory usage
        | low. For general purposes, there's the general purpose
        | allocator.
 
  | tptacek wrote:
  | This is a pretty well-known CS truism, isn't it? One of the
  | first papers I ever fell in love with covers it in detail:
  | https://users.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pd...
 
| lidavidm wrote:
| I wonder how this compares to jemalloc, mimalloc, snmalloc?
 
  | pizlonator wrote:
  | I never got a chance to compare it to those, since I was most
  | interested in beating bmalloc. And I mainly wanted to beat it
  | on Safari workloads.
  | 
  | I believe bmalloc was previously compared against jemalloc and
  | tcmalloc, also using Safari workloads, and bmalloc was
  | significantly faster at the time.
 
    | IshKebab wrote:
    | For other people who have never heard of bmalloc - it's a
    | custom allocator used only by WebKit. I guess it's not
    | surprising they added one since the Mac system allocator is
    | extremely slow. A custom allocator is pretty much a free 20%
    | speed up on Mac (depending on your workload) but I found they
    | made no difference on Linux. Haven't tried on Windows.
 
      | pizlonator wrote:
      | Just some credit where credit is due.
      | 
      | I measured the system malloc crushing all other mallocs on
      | some workloads and they were the kind of workloads that
      | some folks run every day.
      | 
      | I measured the system malloc crushing most other mallocs on
      | memory efficiency on most workloads. System malloc is
      | really good at reusing memory and has very mature decommit
      | policies that easily rival what I came up with.
      | 
      | So there's that.
 
        | pcwalton wrote:
        | macOS malloc was _incredibly_ slow for Pathfinder, far
        | behind every other OS. Everything became bottlenecked on
        | it. It was a free 2x speedup if not more to switch to
        | jemalloc.
        | 
        | I suspect this is because Pathfinder's CPU portion is a
        | multicore workload and macOS allocator performs poorly
        | when under heavy multithreaded contention. It probably
        | just isn't the kind of workload that macOS allocator was
        | tuned for.
 
        | pizlonator wrote:
        | My understanding is that the system malloc is excellent
        | under contention, but has a high baseline cost for every
        | malloc/free call.
        | 
        | I don't remember exactly what workloads it performed
        | really great at (and if I did I dunno if I could say),
        | but I do remember they were parallel, and the speed-ups
        | got bigger the more cores you added.
        | 
        | Everything else about your experience matches mine.
        | Libpas is much faster than system malloc in WebKit and
        | JSC. The difference isn't 2x on my preferred benchmarks
        | (which are large and do lots of things that don't rely on
        | malloc), but it is easily more than 2x on smaller
        | benchmarks. So your 2x result sounds about right.
 
        | meisel wrote:
        | What sort of workloads are those? I have yet to see a
        | workload where macOS's system allocator is not
        | substantially slower than alternatives like jemalloc.
 
      | astrange wrote:
      | Alas, many people think performance engineering is only
      | about the wall clock time of their program and not what
      | impact it has on anything else.
      | 
      | malloc performance highly depends on how large the
      | allocations are though.
 
      | KerrAvon wrote:
      | What's the workload where you found that to be the case?
      | The system allocator is heavily tuned for general use, and
      | should be very difficult to beat -- generally.
 
        | IshKebab wrote:
        | It was a non-traditional compiler written in C++.
        | 
        | > The system allocator is heavily tuned
        | 
        | Perhaps but it probably has more constraints than a
        | custom allocator too, e.g. backwards compatibility with
        | software that unwittingly depends on its exact behaviour.
        | 
        | I found a discussion of this where people reported
        | various speedups (10%, 100%, 300%) on Mac by switching to
        | a custom allocator:
        | https://news.ycombinator.com/item?id=29068828
        | 
        | I'm sure there are better benchmarks if you Google it.
 
| jeffbee wrote:
| Well the other browser's design doc is not anywhere near this
| detailed, but if you want to compare and contrast the overall
| design here is Chromium's PartitionAlloc.
| 
| https://chromium.googlesource.com/chromium/src/+/master/base...
 
| rq1 wrote:
| > Consequently, passing a function pointer (or struct of function
| pointers), where the pointer points to an always_inline function
| and the callee is always_inline results in specialization akin to
| template monomorphization. This works to any depth; the compiler
| won't be satisfied until there are no more always_inline function
| calls. This fortuitous development in compilers allowed me to
| write very nice template code in C. Libpas achieves templates in
| C using config structs that contain function pointers --
| sometimes to always_inline functions (when we want specialization
| and inlining) and sometimes to out-of-line functions (when we
| want specialization but not inlining). Additionally, the C
| template style allows us to have true polymorphic functions. Lots
| of libpas slow paths are huge and not at all hot. We don't want
| that code specialized for every config. Luckily, this works just
| fine in C templates -- those polymorphic functions just pass
| around a pointer to the config they are using, and dynamically
| load and call things in that config, almost exactly the same way
| that the specialized code would do. This saves a lot of code size
| versus C++ templates.
| 
| This document is incredible!
 
  | photochemsyn wrote:
  | Seems like a real tour de force of low-level memory management.
  | Prior to seeing this my go-to for understanding custom low-
  | level (for embedded) memory managers was this (explains the
  | basics of stack, block, bitmap and thread-local allocation) but
  | now I have something far more complicated to get confused
  | about.
  | 
  | https://www.embedded-software-engineering.de/dynamic-memory-...
 
    | pizlonator wrote:
    | Thank you! :-)
    | 
    | I always wanted to write a malloc, and this is what I came up
    | with after 3.5 years or so.
 
  | [deleted]
 
| adamnemecek wrote:
| I'm always suspicious of things that are named after the maker.
 
  | astrange wrote:
  | A better rule is to be suspicious of anything with "fast" or
  | synonyms in the name. It either won't be fast or it'll be
  | insecure.
 
  | fmajid wrote:
  | Did you know the entire US Air Traffic Control system runs on
  | long-discontinued computers and written in JOVIAL, "Jules' Own
  | Version of the International Algebraic Language" (i.e. ALGOL),
  | named after Jules Schwartz?
  | 
  | https://en.wikipedia.org/wiki/JOVIAL
  | 
  | The FAA's been trying to replce it for ages, but AFAIK it's
  | still ongoing.
 
  | thrtythreeforty wrote:
  | Like Linux, or gcc.godbolt.org?
  | 
  | I kid, but I don't see it as a downside for software. Naming a
  | scientific phenomenon after yourself is a red flag though, see
  | also the Crackpot Index [1].
  | 
  | [1]: https://math.ucr.edu/home/baez/crackpot.html
 
    | tialaramex wrote:
    | Godbolt _says_ Compiler Explorer right there, the fact people
    | call it  "Godbolt" is because that's the memorable name
    | rather than because Matt Godbolt tried to name it after
    | himself. He calls it Compiler Explorer.
    | 
    | You can't control what people call things. My favourite
    | example is Blackfriars Bridge in London. When that bridge was
    | built, it was formally named after William Pitt. But like,
    | who cares? In what way is this bridge, Pitt's bridge? People
    | knew where it was, the Blackfriars area of London, so too bad
    | William Pitt, everybody called it Blackfriars Bridge. And so,
    | that's what its name is.
    | 
    | The priory which is why "Blackfriars" got the name hadn't
    | existed for two hundred years by the time the bridge was
    | proposed. Nobody living when the bridge was finished would
    | have ever known anybody who'd met any of the friars during
    | their lifetime. Nevertheless that part of London had "always"
    | been called Blackfriars, and so a bridge and a railway
    | station built many years later are named that too.
 
      | bool3max wrote:
      | People call it Godbolt because you access it via
      | godbolt.org,
 
  | throwaway744678 wrote:
  | From the top of my head: Linux, Bourne shell...
 
    | adamnemecek wrote:
 
      | kbenson wrote:
      | Sometimes things are named after the maker ex post facto
      | because they need a way to distinguish it. The Bourne
      | shell, mentioned here, is likely called that for the same
      | reason the very first shell was called the Thompson shell,
      | in that they were written by a single person originally and
      | not really given a name (the original being from Ken
      | Thompson), since they were just different (or the original)
      | implementations of the system shell. Eventually you're
      | going to need a way to distinguish one from the other
      | though, right? Noting who wrote it is as good (probably
      | better) than any other option.
      | 
      | Linux was probably named differently, but also probably
      | doesn't really match whatever criteria you're using to be
      | suspicious. It started as a hobby project to make a free
      | and open source UNIX system, of which there were few
      | (none?), written by one guy. There was no expectation that
      | other people would use it for anything other than something
      | interesting to play around with, and using a portion of
      | your own name for your own little hobby project isn't
      | something I would look askance at, personally, especially
      | in an age where there was no expectation that it might grow
      | to anything, because that was definitely not the norm at
      | the time.
 
        | ncmncm wrote:
        | "Linux" was forced on him.
 
  | pizlonator wrote:
  | It's called libpas, and pas stands for whatever you like.
 
| mattgreenrocks wrote:
| I admit I'm a sucker for metacircularity, but have to ask:
| 
| > The bootstrap heap has hacks to allow itself to allocate that
| array out of itself.
| 
| Not quite following that, can someone elaborate?
 
  | convolvatron wrote:
  | I suspect its that - generally when you write a heap it needs
  | to keep some metadata. if you have a multi-heap system (really
  | handy), then its generally easier to host that metadata on
  | another heap. sometimes you really want that, because the
  | target heap might be specialized for different kinds of data.
  | so I assume this is carefully constructing the heap state so
  | that it can be used for its own metadata as its being
  | constructed
 
  | pizlonator wrote:
  | The bootstrap heap has a freelist that is just an array of
  | entries that tell you where the free memory is.
  | 
  | That array has to be allocated somewhere.
  | 
  | All pas_simple_large_free_heaps allocate that array from the
  | bootstrap heap. Including the bootstrap heap. So, the
  | pas_simple_large_free_heap has hacks that go something like:
  | "if I'm allocating or freeing something and I need to allocate
  | or free the free list, then _very carefully_ also perform that
  | allocation /deallocation in tandem with the one I'm doing". The
  | main hack there is just that there's some statically allocated
  | "slop" for the boostrap free list, so if you want to add to it
  | in the middle of an allocation, you add it to the slop array,
  | and then after you're doing you reallocate the free list and
  | copy stuff from slop into it.
 
| sydthrowaway wrote:
| Did the author leak this IP when he left Apple?
 
  | pizlonator wrote:
  | lol
  | 
  | WebKit is open source. Libpas was always designed with WebKit
  | in mind. It landed in WebKit half a year before I left Apple.
 
| RcouF1uZ4gsC wrote:
| > Consequently, passing a function pointer (or struct of function
| pointers), where the pointer points to an always_inline function
| and the callee is always_inline results in specialization akin to
| template monomorphization. This works to any depth; the compiler
| won't be satisfied until there are no more always_inline function
| calls. This fortuitous development in compilers allowed me to
| write very nice template code in C. Libpas achieves templates in
| C using config structs that contain function pointers --
| sometimes to always_inline functions (when we want specialization
| and inlining) and sometimes to out-of-line functions (when we
| want specialization but not inlining). Additionally, the C
| template style allows us to have true polymorphic functions. Lots
| of libpas slow paths are huge and not at all hot. We don't want
| that code specialized for every config. Luckily, this works just
| fine in C templates -- those polymorphic functions just pass
| around a pointer to the config they are using, and dynamically
| load and call things in that config, almost exactly the same way
| that the specialized code would do. This saves a lot of code size
| versus C++ templates.
| 
| I thought this was super interesting. Seems like a very nice
| technique for when you use function pointers in C for stuff like
| sort comparison. There was an earlier thread about PostgreSQL
| that discussed how they created multiple sort functions to avoid
| the overhead of a indirect function pointer call when sorting
| things like integers where the function call dwarfs the
| comparison cost. This seems like a technique that could have been
| used to avoid duplicating the sorting functions while still
| getting the benefits.
 
  | thrtythreeforty wrote:
  | I still don't quite understand how the technique saves code
  | size versus C++ templates, although I am quite eager to
  | understand it since template instantiation is a thorn in my
  | side when writing code for a microcontroller with 32KB of code
  | space.
  | 
  | Is the idea that the struct contains pointers to always_inline
  | functions, and that the compiler has visibility into this at
  | the top of the always_inline call stack, so the always_inline
  | gets monomorphized even through multiple layers of pointer
  | indirection? If I'm understanding it right, that's a pretty
  | nice technique. Would this allow you to mark a "class" (or
  | individual members of that class) as "hot, please monomorphize"
  | by making it always_inline, or "cold, please use regular
  | pointers" by omitting always_inline?
  | 
  | C++ templates would force you to monomorphize every variant, or
  | use vtables (function pointers) for every variant, without the
  | ability to choose the approach for each subclass. The compiler
  | is also spectacularly bad at recognizing deduplication
  | opportunities between template specializations.
 
    | verdagon wrote:
    | I think that's right.
    | 
    | C++ binaries tend to be pretty large because they use
    | templates/monomorphization a lot, even when there aren't
    | really meaningful performance gains. For example, it often
    | doesn't make sense to have 20 instantiations of std::vector,
    | when only two of them are used in the critical path. Rust can
    | be even further in this unfortunate direction because of the
    | borrow checker's restrictions around polymorphism and
    | abstraction.
    | 
    | In comparison, C, Java, and Javascript tend to rely on
    | dynamic dispatch more often (not that C has any built-in
    | dynamic dispatch features, but in practice, C programs tend
    | to use function pointers manually).
    | 
    | A few languages like C# have a really nice balance, and
    | monomorphize in the JIT at run-time.
    | 
    | With this technique they mention, C is the only mainstream
    | language I know of that lets you control it in a decoupled
    | way, which is really cool.
 
    | pizlonator wrote:
    | Say you have:                   template
    | void foo(T& value) { value += 42; }
    | 
    | There's no easy way to tell the C++ compiler, "please make
    | only one version of foo and use callbacks to figure out what
    | it means to += on T".
    | 
    | In libpas, this would be done by first creating:
    | always_inline void foo(void* value, void (*add_int)(void*
    | value, int addend))         {             add_int(value, 42);
    | }
    | 
    | And then you could also create:
    | never_inline void foo_outline(void* value, void
    | (*add_int)(void* value, int addend)) { foo(value, add_int); }
    | 
    | Now, if you want monomorphization like C++, call foo() and
    | pass a literal for add_int. But if you don't, then call
    | foo_outline.
    | 
    | Say you have 1000 callsites to foo(), they all use different
    | types, and 5 of those callsites are hot while the other ones
    | are hardly ever called (maybe they only run in unusual-but-
    | necessary situations). So, the 5 hot callsites can call
    | foo(), and the remaining 995 cold ones cal call
    | foo_outline().
 
      | thrtythreeforty wrote:
      | > There's no easy way to tell the C++ compiler, "please
      | make only one version of foo and use callbacks to figure
      | out what it means to += on T".
      | 
      | Partial counterpoint: you could use class polymorphism, if
      | T is always a type you control. But you're right in
      | general; C++ doesn't have typeclasses or some other way to
      | create an ad-hoc vtable for, say, int.
      | 
      | > Now, if you want monomorphization like C++, call foo()
      | and pass a literal for add_int. But if you don't, then call
      | foo_outline.
      | 
      | Does this work through structs of function pointers? Is
      | that the reason it's so powerful?
      | 
      | For example, a my_class constructor create_my_class makes
      | the class point to foo_outline, unless it's known to be a
      | hot type, then it points to foo. When you call
      | create_my_class, would the compiler see the values of the
      | pointers, and start inlining foo calls, including if you
      | pass the my_class struct into foo as "this", continuing the
      | inlining?
 
        | pizlonator wrote:
        | Yes, it works with structs of function pointers. And it
        | works recursively.
        | 
        | This works (this is slightly shorthand C, fill in the
        | blanks yourself):                   struct config {
        | void (*foo)(things bar);             stuff (*bar)(int
        | baz);         };         always_inline stuff doit(config
        | c)         {             c.foo(whatever);
        | return c.bar(42);         }         always_inline void
        | my_foo(...) { ... }         always_inline stuff
        | my_bar(...) { ... }         stuff dostuff(void)         {
        | config c;             c.foo = my_foo;              c.bar
        | = my_bar;             return doit(c);         }
        | 
        | This'll all get inlined and specialized to death.
 
        | thrtythreeforty wrote:
        | Got it. That is definitely a cool insight. Thanks for
        | sharing!
 
    | RcouF1uZ4gsC wrote:
    | > Would this allow you to mark a "class" (or individual
    | members of that class) as "hot, please monomorphize" by
    | making it always_inline, or "cold, please use regular
    | pointers" by omitting always_inline?
    | 
    | From my understanding of the technique, this is correct.
 
  | verdagon wrote:
  | IIRC, the branch predictor can work with function pointers as
  | well. It can sometimes mitigate problems like this, though not
  | as completely as just monomorphizing away the run-time
  | branching/calling like they did.
 
| exikyut wrote:
| A weekend project idea I just thought of, but wouldn't be able to
| commentate on as well as someone more experienced in memory
| management:
| 
| Run a bunch of memory allocator demo programs (trivial and
| complex) under _blinkenlights_
| (https://justine.lol/blinkenlights/), a program execution
| visualizer that shows how programs interact with the stack and
| heap as they run.
| 
| For bonus points, proceed to explain _why_ their visualized
| performance is different. :)
 
| meisel wrote:
| Will this have support for macOS by inserting itself, as jemalloc
| does, as the default malloc zone? If so, does that mean it will
| be very fast at determining if a given pointer was allocated by
| libpas?
 
  | pizlonator wrote:
  | libpas can do that, though some of the code to do it is not in
  | the WebKit repo (but the enumerator support - a complex
  | component that is necessary to fully replace system malloc - is
  | in the repo).
  | 
  | libpas can already determine if it owns a pointer in O(1) time
  | and no locking. It uses megapages for this purpose. Though,
  | right now, it's not configured to fully leverage this (if you
  | pass an unknown ptr to free then as a last step it'll lock the
  | heap_lock even though it would be a small change to just use
  | megapages there).
 
| donfuzius wrote:
| This looks awesome and I really like the tone of the
| documentation. Great work! At the same time I'm sort of worried,
| that such a deep change in the memory management of my day-to-day
| browser opens up a significant security risk for a considerable
| time, until new bugs are found and closed. I'm sure this got more
| than one pair of eyes at apple, but I'd really welcome a specific
| bug bounty on such a change...
 
  | pizlonator wrote:
  | That's a fair point. Code churn creates bugs. That's a good
  | argument for slowing down the pace of browser development
  | overall, except maybe the kind of development that focuses on
  | security hardening.
  | 
  | Libpas has some security features that bmalloc didn't have. It
  | goes further than bmalloc in avoiding inline metadata (bmalloc
  | had an IsoHeap<> thing that used scrambled inline freelists).
  | Also, libpas opens the door to isoheaping _all_ allocations in
  | WebKit (see https://bugs.webkit.org/show_bug.cgi?id=231938),
  | which would be a big security win.
  | 
  | Also, libpas has been fuzzed to death. In test_pas, there's a
  | bunch of fuzz tests I call "chaos" and I've run those for
  | extended periods of time (days, weeks) after major changes. Of
  | course it still has bugs, but considering how insane the heap
  | organization is, libpas also makes attackers' lives much harder
  | in the short term.
  | 
  | The combination of libpas's enhanced security features and the
  | fact that it substantially changes heap layout might be enough
  | to cancel out the fact that it has increased bug tail. And...
  | if you're worried about bug tail from major changes introducing
  | security vulns then there are probably much worse major changes
  | that someone could do.
 
| ryanianian wrote:
| Are there any projects that surgically augment the memory APIs to
| be more cooperative? Sure `malloc` implementations will make
| various tradeoffs, but what if the malloc/free APIs were expanded
| to expose more programmer-intent or cooperative defrag, etc?
| 
| - Perhaps `malloc` could ask for an intended lifecycle (think GC
| generation) that could swap arenas or other algo internals. This
| opens us up to meta-allocators.
| 
| - Perhaps program lifecycle hooks that give allocators more
| context into how already-allocated memory intends to be accessed.
| ("Done bootstrapping the server, so now new memory will be
| accessed by single threads and only live for one request", etc)
| 
| - Perhaps the programmer could periodically call `remalloc` to
| allow objects to be moved or promoted between arenas/generations.
| 
| - Perhaps mallocs/frees could be more intelligently batched
| (where explicit structs/arrays don't normally make sense) for
| surprise-free defrag or OS allocation.
| 
| - Intelligent pairing of malloc/free with explicit notions of
| intended re-use; some indication of "I will free this and ask for
| the same memory again in the same call" or auto-struct-ing
| mallocs which know that we tend stack
| malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual
| OS/memory involvement whatsoever.
| 
| - Perhaps `identmalloc` could handle common flyweight patterns of
| instantiate-unless-exists; e.g. intelligently turn shared state
| into offsets into a contiguous arenas. This may be more useful in
| C++ with existing conventions around copy-by-value constructs.
| 
| Some of these are indeed components of existing
| mallocs/allocators (e.g. libpas is type-aware), but my point is
| that allocators have to jump through hoops to derive heuristics,
| and these heuristics often amount to "the programmer often
| intends to use memory like X." Performance engineers then
| pattern-match for X rather than cooperating with the compiler to
| expose X intrinsically from the beginning. The above ideas impose
| _slightly_ different programming paradigms, so there 's still the
| white wale of a perfect drop-in heuristic, but it seems less
| subject to temporary or local maxima.
| 
| So: Are there compelling (albeit slightly less general-purpose)
| alternatives where the allocator and programmer communicate
| additional hints or other API contracts that allow the allocator
| to move away from heuristics and into proven patterns? Or is this
| basically akin to "use c++/smart pointers or just get a gc"?
 
  | pizlonator wrote:
  | I've thought about this a lot. I think that overall,
  | malloc/free/new/delete are already so hard to use that if you
  | added more stuff, it would create too much cognitive load for
  | the programmer. The result would be that the hints the
  | programmer gave you would be more wrong than the malloc's best
  | guess.
  | 
  | Let me go through these point by point and offer some thoughts.
  | 
  | - Perhaps `malloc` could ask for an intended lifecycle (think
  | GC generation) that could swap arenas or other algo internals.
  | This opens us up to meta-allocators.
  | 
  | It could, and if that information was accurate, then it would
  | be profitable. But if that information was inaccurate, then
  | you'd get worse perf and worse memory usage than if you let
  | malloc just do whatever it wants. Also, anything like this
  | creates more fragmentation since it creates a new opportunity
  | for the programmer to force the allocator to pick something
  | other than the first fit.
  | 
  | - Perhaps program lifecycle hooks that give allocators more
  | context into how already-allocated memory intends to be
  | accessed. ("Done bootstrapping the server, so now new memory
  | will be accessed by single threads and only live for one
  | request", etc)
  | 
  | If you convey this information after you've already allocated
  | the object then there isn't much the malloc could do. If the
  | malloc uses this information to decide where the next malloc
  | call puts memory then you've created a fragmentation risk.
  | 
  | - Perhaps the programmer could periodically call `remalloc` to
  | allow objects to be moved or promoted between
  | arenas/generations.
  | 
  | Yes, they could, and this is a great idea. I believe that
  | calling realloc (or just mallocing a new object, memcpying or
  | copy-constructing, and freeing the old one) relieves
  | fragmentation because it gives the malloc an opportunity to put
  | the object in a more profitable location.
  | 
  | - Perhaps mallocs/frees could be more intelligently batched
  | (where explicit structs/arrays don't normally make sense) for
  | surprise-free defrag or OS allocation.
  | 
  | libpas batches allocation and deallocation. These strategies
  | help performance. They have only a small effect on memory usage
  | (and that effect is that you use a bit more memory by
  | batching).
  | 
  | - Intelligent pairing of malloc/free with explicit notions of
  | intended re-use; some indication of "I will free this and ask
  | for the same memory again in the same call" or auto-struct-ing
  | mallocs which know that we tend stack
  | malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual
  | OS/memory involvement whatsoever.
  | 
  | Allocators like libpas (and mimalloc and many others) have very
  | cheap malloc/free calls that perform great in these situations.
  | They don't do any locking or call into the OS in the common
  | case when you do this.
  | 
  | - Perhaps `identmalloc` could handle common flyweight patterns
  | of instantiate-unless-exists; e.g. intelligently turn shared
  | state into offsets into a contiguous arenas. This may be more
  | useful in C++ with existing conventions around copy-by-value
  | constructs.
  | 
  | I think that most fast mallocs like libpas (and others) make
  | this situation cheap enough that you don't need anything else.
  | 
  | - Some of these are indeed components of existing
  | mallocs/allocators (e.g. libpas is type-aware), but my point is
  | that allocators have to jump through hoops to derive
  | heuristics, and these heuristics often amount to "the
  | programmer often intends to use memory like X." Performance
  | engineers then pattern-match for X rather than cooperating with
  | the compiler to expose X intrinsically from the beginning. The
  | above ideas impose slightly different programming paradigms, so
  | there's still the white wale of a perfect drop-in heuristic,
  | but it seems less subject to temporary or local maxima.
  | 
  | If the program is simple enough that you understand the object
  | lifetimes completely, or you have enough resources to throw at
  | the problem that you can completely understand lifetimes even
  | for a complex program, then the best path forward is to just
  | not call malloc. Lay out your memory a priori or write a highly
  | customized "allocator" (that may not even have an API that
  | looks anything like malloc).
  | 
  | If the program is too complex for such reasoning, then I
  | believe there's no way for the programmer to tell the malloc
  | anything it can't deduce with heuristics. The programmer can
  | say some stuff, but when you have a large sophisticated heap,
  | then the programmer will not be able to predict what it is
  | about the heap that creates pain for the malloc. In most
  | experiments where I've tried to do this kind of tuning, I end
  | up with something that runs slower or uses more memory, and the
  | investigation usually leads me to learn that my hypothesis
  | about the malloc's pain point was totally wrong.
  | 
  | - So: Are there compelling (albeit slightly less general-
  | purpose) alternatives where the allocator and programmer
  | communicate additional hints or other API contracts that allow
  | the allocator to move away from heuristics and into proven
  | patterns? Or is this basically akin to "use c++/smart pointers
  | or just get a gc"?
  | 
  | Maybe! I'm not proving a negative here. But everything I've
  | tried in this area has been a negative result for me.
 
    | twoodfin wrote:
    | _I 've thought about this a lot. I think that overall,
    | malloc/free/new/delete are already so hard to use that if you
    | added more stuff, it would create too much cognitive load for
    | the programmer. The result would be that the hints the
    | programmer gave you would be more wrong than the malloc's
    | best guess._
    | 
    | I think this is a generally applicable lesson. Programmers
    | overestimate their ability to statically optimize the dynamic
    | behavior of their code, _especially_ relative to a runtime
    | that _can_ access and act on dynamic state.
    | 
    | This comes up in the database world as application developers
    | desiring to "pin" certain tables in memory. Experience has
    | shown this to be an anti-feature: As the schema and workload
    | of a database system evolves, these pinning decisions can
    | become obsolete, starving more relevant tables of buffer
    | space. Dynamic decisions, even made by simple heuristics like
    | LRU, are an empirically winning strategy.
    | 
    | The failure of Itanium/EPIC and the "sufficiently smart
    | compiler" to beat traditional OoO execution is another
    | example from a different computing domain.
 
| akrymski wrote:
| Now if only I could _limit_ the amount of memory a browser would
| consume! The fact that desktop browsers grind a 16GB machine to a
| halt, while on mobile devices having 1000s of tabs is a non-issue
| is plain wrong. Since I can 't limit a process' memory on Mac OS
| (any OS?), why not give me a setting in preferences to do so?
| After all the browser is really just a VM at this stage. I'm
| quite certain I don't need 100 tabs in active memory at once.
| 
| While at it, why can't we hibernate tabs? Write the JSC heap and
| DOM state to disk, the way VMs do it?
| 
| What's taking up the majority of my memory anyway - the graphics
| / gpu side or the JS VM?
| 
| There's a whole spectrum of startups trying to solve the webkit
| memory consumption issues crudely, such as
| https://www.mightyapp.com
 
  | pizlonator wrote:
  | I think that desktop browser users expect that their tabs
  | generally won't get reloaded. On mobile, users have generally
  | accepted that their tabs will get reloaded. Maybe browsers
  | should just reload tabs if anything fishy happens or you're not
  | using them.
  | 
  | Hibernating tabs to disk is hard because the JS heap (and the
  | DOM heap) will contain lots of stuff that is tied to handles
  | you got from the OS. It's not impossible. But it would require
  | a lot of work to build specialized serialization functions for
  | stuff that has native state attached to it that the browser
  | doesn't control or have good visibility into.
  | 
  | I think that some websites use a lot of graphics memory while
  | other websites use a lot of JS memory. And there are other
  | categories of memory usage, too.
  | 
  | Maybe if they put the cloud browsers on the blockchain using
  | machine learning, and then write them in Rust and compile them
  | to wasm, then it will be even better.
 
    | akrymski wrote:
    | > I think that desktop browser users expect that their tabs
    | generally won't get reloaded
    | 
    | But they do get reloaded. There's absolutely no guarantee
    | provided, and browser tabs will get swapped out of memory as
    | necessary (eg opening lots of new tabs or apps).
    | 
    | This is easily solvable though: let me pin tabs that I really
    | need to persist.
 
      | pizlonator wrote:
      | Sure they do, but much less aggressively than mobile
      | browsers. I think that desktop Safari will try to keep your
      | tab live so long as nothing fishy happens. Based on my
      | experiences using Chrome, it seems to do the same thing.
 
  | tptacek wrote:
  | This is a thread about an allocator design, not really about
  | Webkit.
 
    | akrymski wrote:
    | It's also a thread about memory allocation in Webkit tho.
 
| jibcage wrote:
| Fun fact: I believe PAS stood for "Phil's awesome shit." Whether
| that's subject to NDA is anyone's guess :D
 
  | pizlonator wrote:
  | I can neither confirm nor deny this awesome conjecture.
 
___________________________________________________________________
(page generated 2022-06-01 23:00 UTC)