proxy70

	[HN Gopher] All About Libpas, Phil's Super Fast Malloc ___________________________________________________________________ All About Libpas, Phil's Super Fast Malloc Author : Jarred Score : 238 points Date : 2022-06-01 13:45 UTC (9 hours ago)
	web link (github.com)
	w3m dump (github.com)
	\| carlsborg wrote: \| Trying to understand the license for bmalloc, and webkit in \| general. The WebCore folder has an Apple license and two LGPL \| licenses. The root folder has no license file. Can anyone \| comment? \| pizlonator wrote: \| libpas is BSD 2-clause. so is bmalloc. libpas is in the bmalloc \| directory in WebKit, but is really a separate piece of code. \| \| each file has its own license header. \| SemanticStrengh wrote: \| ridiculous_fish wrote: \| Hey Phil, thanks for the write up! Curious why this is only used \| for JSC's JIT, and not to back its heap? Did the design of the \| JSC GC influence libpas at all? \| pizlonator wrote: \| It's used for JSC's JIT and for all heap allocations in \| JavaScriptCore and WebKit that aren't GC allocations. \| \| Riptide (the JSC GC) has a similar design and some of the key \| ideas were first pioneered there: \| \| - The use of fast bitvector searches to find an eligible page \| (Riptide calls it first eligible block, I think). \| \| - The use of bitvector simd to change object states in bulk. \| \| - Riptide uses a mark bitvector that is indexed by minalign \| atom (16 bytes) and a block size of 16KB. Libpas's default \| "small segregated" configuration is 16 byte minalign, 16KB \| block, and an alloc bitvector indexed by minalign. \| \| I have since thought about how Riptide could be wired up to use \| libpas. It's possible to do it. Already today you could create \| a libpas page_config that leaves room for mark bits in the page \| header, and makes them use the same indexing as the alloc bits. \| Riptide has a "newly allocated" bitvector that serves almost \| the same purpose as the alloc bits (kinda, if you squint really \| hard) - so you'd go from Riptides newly_allocated and mark \| bitvectors, to libpas's alloc bitvector and a custom mark \| bitvector stuffed into page headers using an exotic \| page_config. If JSC did this, it would buy: \| \| - Better decommit policies than Riptide. I did a lot of tuning \| in libpas to make decommit Just Right (TM). \| \| - Faster allocation of medium-sized objects. \| \| - Possibly better memory efficiency. \| \| But, Riptide is already very nicely tuned, and it gets some \| benefit from libpas's decommit rules because Riptide calls \| fastMalloc to get memory and sometimes uses fastFree to return \| it. fastFree calls bmalloc::api::free, which calls libpas's \| bmalloc_deallocate, which then goes to pas_deallocate with the \| BMALLOC_HEAP_CONFIG. So, Riptide already gets some of the \| goodness of libpas by sitting on top of it. \| [deleted] \| dragontamer wrote: \| General purpose memory-allocation is a lie. Everything has use \| cases where they're faster than other libraries. \| \| I think that's why we keep seeing newer malloc schemes pop up, \| because the performance of the heap 100% depends on the use-case, \| and different people have different use cases. \| \| Still, studying everyone else's heaps (and garbage collectors, a \| closely related discussion) is probably good for high-performance \| programmers. \| pizlonator wrote: \| Yeah, you're totally right. \| \| Libpas beats other mallocs in WebKit. I also had benchmarks \| involving non-WebKit workloads and the results were all over \| the place (sometimes slower than other mallocs by a lot, \| sometimes faster by a lot - same thing with memory, sometimes \| more efficient, sometimes less). This didn't surprise me; I've \| seen this before when writing memory management code. \| \| I think it makes sense for large software projects that use \| malloc a lot and care about perf to eventually either write \| their own malloc or to take an existing one and then tune it a \| lot. \| latenightcoding wrote: \| I've seen a few opensource projects archive their custom \| allocators because they were not beating the system's one or \| jemalloc. So if you have the skill and time then yeah go for \| it, else stick with general purpose ones. \| mananaysiempre wrote: \| There are a couple of approaches that might legitimately be \| simpler than using a general-purpose allocator: in a single- \| pass batch process, just throw things on the floor and let \| the OS sort it out on exit[1] (I think the D compiler does \| this); in a multiple-pass batch process, make each pass \| litter its own room (arena, obstack, etc.) then demolish said \| room when done (GCC does this or at least did in the past). \| \| On the other hand, these may require rearranging the logic \| somewhat to fit them, so the question of how much \| rearrangement to tolerate still remains. And, well[4], \| /* * We divy out chunks of memory rather than call \| malloc each time so * we don't have to worry about \| leaking memory. It's probably * not a big deal if all \| this memory was wasted but if this ever * goes into a \| library that would probably not be a good idea. * \| * XXX - this is in a library.... / \| \| [1] I am rather dismayed by how Raymond Chen advocates for \| this for memory[2] but insists it could not possibly be a \| good idea for Windows GDI handles, no, you stupid sloppy \| programmer[3]. Maybe because he was involved in GDI from the \| other side? (Of course, for file descriptors on Unix it's \| still the standard practice.) \| \| [2] http://bytepointer.com/resources/old_new_thing/20120105_0 \| 06_... \| \| [3] http://bytepointer.com/resources/old_new_thing/20051014_3 \| 05_... \| \| [4] https://github.com/the-tcpdump- \| group/libpcap/blob/4c1e516dd2... \| jstimpfle wrote: \| This "[2] vs [3]" is a good find. I have several possible \| explanations. It could be the difference of 7 years between \| those posts. It could also be that Windows Handles can't be \| cleaned up in userspace. Not how malloc chunks up the \| "handles" (i.e. pages / regions) that it got from Windows. \| This is, AFAIK, different than HANDLE's that have to be \| requested from the system one-by-one and can't be chunked \| up. (Or am I wrong? I actually don't know a lot about how \| this works under the hood). \| dragontamer wrote: \| Its very easy to beat the general purpose ones if you know \| your exact use case. \| \| Ex: 16-bit pointers is a 65536-sized heap. Assume 8-bytes per \| element, that's 512KB of space. A bit small, but large enough \| to so a lot of things. \| \| 65536 elements can be represented as a bitmask. The bitmask \| only takes up 8192-bytes (8KB), which fits inside of 16 \| AVX512 registers (Intel offers 32x AVX512/ZMM registers btw). \| Or it fits inside of GPU __shared__ memory. \| \| If you need multithreaded, you perform atomic AND and atomic \| OR to clear, and set, the bitmask as appropriate. \| \| ------- \| \| Do you know the exact size of your heap? The size of the \| pointer? The amount of parallelism involved? What about the \| size of the elements? Access pattern? (Bump-alloc'd Linked \| Lists are a sequential traversal over RAM btw, so that's very \| efficient), etc. etc. \| \| The 64-bit pointer is overkill for most people's purposes. \| 32-bits represents 4-billion objects, and if each object is \| 16-bytes long that's 64GBs of RAM. Right here right now, a \| 32-bit pointer / custom allocator already offers many \| benefits over the 64-bit pointer. (half-sized pointers, more \| data in cache, etc. etc.) \| thesz wrote: \| 512 in AVX512 is the number of bits per register. You are \| off by factor of 8. \| cmrdporcupine wrote: \| 100%, and also I think there's a tendency among people who \| compare 'higher up the stack' in GC'd languages to manual heap \| allocation in C/C++ etc. in a way that implies the the latter \| is almost "free" from a performance POV when in fact underneath \| the covers in these allocators there's still a lot of the same \| kind of walking of datastructures that also happens inside a \| tracing GC. \| \| You can get unpredictable pauses from malloc/free, too. \| asveikau wrote: \| I think in either case, GC or not, you write something \| intuitively and then when it becomes an actual problem, you \| study the patterns and improve them, but most of the time you \| can leave it alone and the general purpose thing is good \| enough. \| \| Or if you are in a niche like audio/video or something, you \| avoid allocations all together during the bulk of the code. \| cmrdporcupine wrote: \| Oh for sure. Premature optimizing is usually bad form. \| \| What is tricky about performance issues in allocation is \| that they can be hard to profile. There are tools for \| analyzing GC performance in GC'd languages, but sometimes \| malloc/free can just be a big black box. \| djmips wrote: \| Premature optimization is not bad form when it's re- \| framed as good architecture. So it's not 'usually bad \| form' to architect something from the outset, using your \| experience, and that's something everyone understands. \| This pervasive disdain for premature optimization leads \| to bad architecture that often leads to expensive \| rewrites. So because the word optimization is so \| overloaded and treated with disdain it feels like we need \| a new language to talk about what's really meant by \| 'premature optimization' ( the bad kind) \| asveikau wrote: \| That's one reason I point out a few comments above that \| certain niches will need to take it into account ahead of \| time. Eg. An a/v application will typically allocate all \| buffers up front and re-use them frequently rather than \| return them to the allocator. A lot of server \| applications will want to keep per-client memory usage \| low. For general purposes, there's the general purpose \| allocator. \| tptacek wrote: \| This is a pretty well-known CS truism, isn't it? One of the \| first papers I ever fell in love with covers it in detail: \| https://users.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pd... \| lidavidm wrote: \| I wonder how this compares to jemalloc, mimalloc, snmalloc? \| pizlonator wrote: \| I never got a chance to compare it to those, since I was most \| interested in beating bmalloc. And I mainly wanted to beat it \| on Safari workloads. \| \| I believe bmalloc was previously compared against jemalloc and \| tcmalloc, also using Safari workloads, and bmalloc was \| significantly faster at the time. \| IshKebab wrote: \| For other people who have never heard of bmalloc - it's a \| custom allocator used only by WebKit. I guess it's not \| surprising they added one since the Mac system allocator is \| extremely slow. A custom allocator is pretty much a free 20% \| speed up on Mac (depending on your workload) but I found they \| made no difference on Linux. Haven't tried on Windows. \| pizlonator wrote: \| Just some credit where credit is due. \| \| I measured the system malloc crushing all other mallocs on \| some workloads and they were the kind of workloads that \| some folks run every day. \| \| I measured the system malloc crushing most other mallocs on \| memory efficiency on most workloads. System malloc is \| really good at reusing memory and has very mature decommit \| policies that easily rival what I came up with. \| \| So there's that. \| pcwalton wrote: \| macOS malloc was _incredibly_ slow for Pathfinder, far \| behind every other OS. Everything became bottlenecked on \| it. It was a free 2x speedup if not more to switch to \| jemalloc. \| \| I suspect this is because Pathfinder's CPU portion is a \| multicore workload and macOS allocator performs poorly \| when under heavy multithreaded contention. It probably \| just isn't the kind of workload that macOS allocator was \| tuned for. \| pizlonator wrote: \| My understanding is that the system malloc is excellent \| under contention, but has a high baseline cost for every \| malloc/free call. \| \| I don't remember exactly what workloads it performed \| really great at (and if I did I dunno if I could say), \| but I do remember they were parallel, and the speed-ups \| got bigger the more cores you added. \| \| Everything else about your experience matches mine. \| Libpas is much faster than system malloc in WebKit and \| JSC. The difference isn't 2x on my preferred benchmarks \| (which are large and do lots of things that don't rely on \| malloc), but it is easily more than 2x on smaller \| benchmarks. So your 2x result sounds about right. \| meisel wrote: \| What sort of workloads are those? I have yet to see a \| workload where macOS's system allocator is not \| substantially slower than alternatives like jemalloc. \| astrange wrote: \| Alas, many people think performance engineering is only \| about the wall clock time of their program and not what \| impact it has on anything else. \| \| malloc performance highly depends on how large the \| allocations are though. \| KerrAvon wrote: \| What's the workload where you found that to be the case? \| The system allocator is heavily tuned for general use, and \| should be very difficult to beat -- generally. \| IshKebab wrote: \| It was a non-traditional compiler written in C++. \| \| > The system allocator is heavily tuned \| \| Perhaps but it probably has more constraints than a \| custom allocator too, e.g. backwards compatibility with \| software that unwittingly depends on its exact behaviour. \| \| I found a discussion of this where people reported \| various speedups (10%, 100%, 300%) on Mac by switching to \| a custom allocator: \| https://news.ycombinator.com/item?id=29068828 \| \| I'm sure there are better benchmarks if you Google it. \| jeffbee wrote: \| Well the other browser's design doc is not anywhere near this \| detailed, but if you want to compare and contrast the overall \| design here is Chromium's PartitionAlloc. \| \| https://chromium.googlesource.com/chromium/src/+/master/base... \| rq1 wrote: \| > Consequently, passing a function pointer (or struct of function \| pointers), where the pointer points to an always_inline function \| and the callee is always_inline results in specialization akin to \| template monomorphization. This works to any depth; the compiler \| won't be satisfied until there are no more always_inline function \| calls. This fortuitous development in compilers allowed me to \| write very nice template code in C. Libpas achieves templates in \| C using config structs that contain function pointers -- \| sometimes to always_inline functions (when we want specialization \| and inlining) and sometimes to out-of-line functions (when we \| want specialization but not inlining). Additionally, the C \| template style allows us to have true polymorphic functions. Lots \| of libpas slow paths are huge and not at all hot. We don't want \| that code specialized for every config. Luckily, this works just \| fine in C templates -- those polymorphic functions just pass \| around a pointer to the config they are using, and dynamically \| load and call things in that config, almost exactly the same way \| that the specialized code would do. This saves a lot of code size \| versus C++ templates. \| \| This document is incredible! \| photochemsyn wrote: \| Seems like a real tour de force of low-level memory management. \| Prior to seeing this my go-to for understanding custom low- \| level (for embedded) memory managers was this (explains the \| basics of stack, block, bitmap and thread-local allocation) but \| now I have something far more complicated to get confused \| about. \| \| https://www.embedded-software-engineering.de/dynamic-memory-... \| pizlonator wrote: \| Thank you! :-) \| \| I always wanted to write a malloc, and this is what I came up \| with after 3.5 years or so. \| [deleted] \| adamnemecek wrote: \| I'm always suspicious of things that are named after the maker. \| astrange wrote: \| A better rule is to be suspicious of anything with "fast" or \| synonyms in the name. It either won't be fast or it'll be \| insecure. \| fmajid wrote: \| Did you know the entire US Air Traffic Control system runs on \| long-discontinued computers and written in JOVIAL, "Jules' Own \| Version of the International Algebraic Language" (i.e. ALGOL), \| named after Jules Schwartz? \| \| https://en.wikipedia.org/wiki/JOVIAL \| \| The FAA's been trying to replce it for ages, but AFAIK it's \| still ongoing. \| thrtythreeforty wrote: \| Like Linux, or gcc.godbolt.org? \| \| I kid, but I don't see it as a downside for software. Naming a \| scientific phenomenon after yourself is a red flag though, see \| also the Crackpot Index [1]. \| \| [1]: https://math.ucr.edu/home/baez/crackpot.html \| tialaramex wrote: \| Godbolt _says_ Compiler Explorer right there, the fact people \| call it "Godbolt" is because that's the memorable name \| rather than because Matt Godbolt tried to name it after \| himself. He calls it Compiler Explorer. \| \| You can't control what people call things. My favourite \| example is Blackfriars Bridge in London. When that bridge was \| built, it was formally named after William Pitt. But like, \| who cares? In what way is this bridge, Pitt's bridge? People \| knew where it was, the Blackfriars area of London, so too bad \| William Pitt, everybody called it Blackfriars Bridge. And so, \| that's what its name is. \| \| The priory which is why "Blackfriars" got the name hadn't \| existed for two hundred years by the time the bridge was \| proposed. Nobody living when the bridge was finished would \| have ever known anybody who'd met any of the friars during \| their lifetime. Nevertheless that part of London had "always" \| been called Blackfriars, and so a bridge and a railway \| station built many years later are named that too. \| bool3max wrote: \| People call it Godbolt because you access it via \| godbolt.org, \| throwaway744678 wrote: \| From the top of my head: Linux, Bourne shell... \| adamnemecek wrote: \| kbenson wrote: \| Sometimes things are named after the maker ex post facto \| because they need a way to distinguish it. The Bourne \| shell, mentioned here, is likely called that for the same \| reason the very first shell was called the Thompson shell, \| in that they were written by a single person originally and \| not really given a name (the original being from Ken \| Thompson), since they were just different (or the original) \| implementations of the system shell. Eventually you're \| going to need a way to distinguish one from the other \| though, right? Noting who wrote it is as good (probably \| better) than any other option. \| \| Linux was probably named differently, but also probably \| doesn't really match whatever criteria you're using to be \| suspicious. It started as a hobby project to make a free \| and open source UNIX system, of which there were few \| (none?), written by one guy. There was no expectation that \| other people would use it for anything other than something \| interesting to play around with, and using a portion of \| your own name for your own little hobby project isn't \| something I would look askance at, personally, especially \| in an age where there was no expectation that it might grow \| to anything, because that was definitely not the norm at \| the time. \| ncmncm wrote: \| "Linux" was forced on him. \| pizlonator wrote: \| It's called libpas, and pas stands for whatever you like. \| mattgreenrocks wrote: \| I admit I'm a sucker for metacircularity, but have to ask: \| \| > The bootstrap heap has hacks to allow itself to allocate that \| array out of itself. \| \| Not quite following that, can someone elaborate? \| convolvatron wrote: \| I suspect its that - generally when you write a heap it needs \| to keep some metadata. if you have a multi-heap system (really \| handy), then its generally easier to host that metadata on \| another heap. sometimes you really want that, because the \| target heap might be specialized for different kinds of data. \| so I assume this is carefully constructing the heap state so \| that it can be used for its own metadata as its being \| constructed \| pizlonator wrote: \| The bootstrap heap has a freelist that is just an array of \| entries that tell you where the free memory is. \| \| That array has to be allocated somewhere. \| \| All pas_simple_large_free_heaps allocate that array from the \| bootstrap heap. Including the bootstrap heap. So, the \| pas_simple_large_free_heap has hacks that go something like: \| "if I'm allocating or freeing something and I need to allocate \| or free the free list, then _very carefully_ also perform that \| allocation /deallocation in tandem with the one I'm doing". The \| main hack there is just that there's some statically allocated \| "slop" for the boostrap free list, so if you want to add to it \| in the middle of an allocation, you add it to the slop array, \| and then after you're doing you reallocate the free list and \| copy stuff from slop into it. \| sydthrowaway wrote: \| Did the author leak this IP when he left Apple? \| pizlonator wrote: \| lol \| \| WebKit is open source. Libpas was always designed with WebKit \| in mind. It landed in WebKit half a year before I left Apple. \| RcouF1uZ4gsC wrote: \| > Consequently, passing a function pointer (or struct of function \| pointers), where the pointer points to an always_inline function \| and the callee is always_inline results in specialization akin to \| template monomorphization. This works to any depth; the compiler \| won't be satisfied until there are no more always_inline function \| calls. This fortuitous development in compilers allowed me to \| write very nice template code in C. Libpas achieves templates in \| C using config structs that contain function pointers -- \| sometimes to always_inline functions (when we want specialization \| and inlining) and sometimes to out-of-line functions (when we \| want specialization but not inlining). Additionally, the C \| template style allows us to have true polymorphic functions. Lots \| of libpas slow paths are huge and not at all hot. We don't want \| that code specialized for every config. Luckily, this works just \| fine in C templates -- those polymorphic functions just pass \| around a pointer to the config they are using, and dynamically \| load and call things in that config, almost exactly the same way \| that the specialized code would do. This saves a lot of code size \| versus C++ templates. \| \| I thought this was super interesting. Seems like a very nice \| technique for when you use function pointers in C for stuff like \| sort comparison. There was an earlier thread about PostgreSQL \| that discussed how they created multiple sort functions to avoid \| the overhead of a indirect function pointer call when sorting \| things like integers where the function call dwarfs the \| comparison cost. This seems like a technique that could have been \| used to avoid duplicating the sorting functions while still \| getting the benefits. \| thrtythreeforty wrote: \| I still don't quite understand how the technique saves code \| size versus C++ templates, although I am quite eager to \| understand it since template instantiation is a thorn in my \| side when writing code for a microcontroller with 32KB of code \| space. \| \| Is the idea that the struct contains pointers to always_inline \| functions, and that the compiler has visibility into this at \| the top of the always_inline call stack, so the always_inline \| gets monomorphized even through multiple layers of pointer \| indirection? If I'm understanding it right, that's a pretty \| nice technique. Would this allow you to mark a "class" (or \| individual members of that class) as "hot, please monomorphize" \| by making it always_inline, or "cold, please use regular \| pointers" by omitting always_inline? \| \| C++ templates would force you to monomorphize every variant, or \| use vtables (function pointers) for every variant, without the \| ability to choose the approach for each subclass. The compiler \| is also spectacularly bad at recognizing deduplication \| opportunities between template specializations. \| verdagon wrote: \| I think that's right. \| \| C++ binaries tend to be pretty large because they use \| templates/monomorphization a lot, even when there aren't \| really meaningful performance gains. For example, it often \| doesn't make sense to have 20 instantiations of std::vector, \| when only two of them are used in the critical path. Rust can \| be even further in this unfortunate direction because of the \| borrow checker's restrictions around polymorphism and \| abstraction. \| \| In comparison, C, Java, and Javascript tend to rely on \| dynamic dispatch more often (not that C has any built-in \| dynamic dispatch features, but in practice, C programs tend \| to use function pointers manually). \| \| A few languages like C# have a really nice balance, and \| monomorphize in the JIT at run-time. \| \| With this technique they mention, C is the only mainstream \| language I know of that lets you control it in a decoupled \| way, which is really cool. \| pizlonator wrote: \| Say you have: template \| void foo(T& value) { value += 42; } \| \| There's no easy way to tell the C++ compiler, "please make \| only one version of foo and use callbacks to figure out what \| it means to += on T". \| \| In libpas, this would be done by first creating: \| always_inline void foo(void value, void (add_int)(void \| value, int addend)) { add_int(value, 42); \| } \| \| And then you could also create: \| never_inline void foo_outline(void* value, void \| (add_int)(void value, int addend)) { foo(value, add_int); } \| \| Now, if you want monomorphization like C++, call foo() and \| pass a literal for add_int. But if you don't, then call \| foo_outline. \| \| Say you have 1000 callsites to foo(), they all use different \| types, and 5 of those callsites are hot while the other ones \| are hardly ever called (maybe they only run in unusual-but- \| necessary situations). So, the 5 hot callsites can call \| foo(), and the remaining 995 cold ones cal call \| foo_outline(). \| thrtythreeforty wrote: \| > There's no easy way to tell the C++ compiler, "please \| make only one version of foo and use callbacks to figure \| out what it means to += on T". \| \| Partial counterpoint: you could use class polymorphism, if \| T is always a type you control. But you're right in \| general; C++ doesn't have typeclasses or some other way to \| create an ad-hoc vtable for, say, int. \| \| > Now, if you want monomorphization like C++, call foo() \| and pass a literal for add_int. But if you don't, then call \| foo_outline. \| \| Does this work through structs of function pointers? Is \| that the reason it's so powerful? \| \| For example, a my_class constructor create_my_class makes \| the class point to foo_outline, unless it's known to be a \| hot type, then it points to foo. When you call \| create_my_class, would the compiler see the values of the \| pointers, and start inlining foo calls, including if you \| pass the my_class struct into foo as "this", continuing the \| inlining? \| pizlonator wrote: \| Yes, it works with structs of function pointers. And it \| works recursively. \| \| This works (this is slightly shorthand C, fill in the \| blanks yourself): struct config { \| void (foo)(things bar); stuff (bar)(int \| baz); }; always_inline stuff doit(config \| c) { c.foo(whatever); \| return c.bar(42); } always_inline void \| my_foo(...) { ... } always_inline stuff \| my_bar(...) { ... } stuff dostuff(void) { \| config c; c.foo = my_foo; c.bar \| = my_bar; return doit(c); } \| \| This'll all get inlined and specialized to death. \| thrtythreeforty wrote: \| Got it. That is definitely a cool insight. Thanks for \| sharing! \| RcouF1uZ4gsC wrote: \| > Would this allow you to mark a "class" (or individual \| members of that class) as "hot, please monomorphize" by \| making it always_inline, or "cold, please use regular \| pointers" by omitting always_inline? \| \| From my understanding of the technique, this is correct. \| verdagon wrote: \| IIRC, the branch predictor can work with function pointers as \| well. It can sometimes mitigate problems like this, though not \| as completely as just monomorphizing away the run-time \| branching/calling like they did. \| exikyut wrote: \| A weekend project idea I just thought of, but wouldn't be able to \| commentate on as well as someone more experienced in memory \| management: \| \| Run a bunch of memory allocator demo programs (trivial and \| complex) under _blinkenlights_ \| (https://justine.lol/blinkenlights/), a program execution \| visualizer that shows how programs interact with the stack and \| heap as they run. \| \| For bonus points, proceed to explain _why_ their visualized \| performance is different. :) \| meisel wrote: \| Will this have support for macOS by inserting itself, as jemalloc \| does, as the default malloc zone? If so, does that mean it will \| be very fast at determining if a given pointer was allocated by \| libpas? \| pizlonator wrote: \| libpas can do that, though some of the code to do it is not in \| the WebKit repo (but the enumerator support - a complex \| component that is necessary to fully replace system malloc - is \| in the repo). \| \| libpas can already determine if it owns a pointer in O(1) time \| and no locking. It uses megapages for this purpose. Though, \| right now, it's not configured to fully leverage this (if you \| pass an unknown ptr to free then as a last step it'll lock the \| heap_lock even though it would be a small change to just use \| megapages there). \| donfuzius wrote: \| This looks awesome and I really like the tone of the \| documentation. Great work! At the same time I'm sort of worried, \| that such a deep change in the memory management of my day-to-day \| browser opens up a significant security risk for a considerable \| time, until new bugs are found and closed. I'm sure this got more \| than one pair of eyes at apple, but I'd really welcome a specific \| bug bounty on such a change... \| pizlonator wrote: \| That's a fair point. Code churn creates bugs. That's a good \| argument for slowing down the pace of browser development \| overall, except maybe the kind of development that focuses on \| security hardening. \| \| Libpas has some security features that bmalloc didn't have. It \| goes further than bmalloc in avoiding inline metadata (bmalloc \| had an IsoHeap<> thing that used scrambled inline freelists). \| Also, libpas opens the door to isoheaping _all_ allocations in \| WebKit (see https://bugs.webkit.org/show_bug.cgi?id=231938), \| which would be a big security win. \| \| Also, libpas has been fuzzed to death. In test_pas, there's a \| bunch of fuzz tests I call "chaos" and I've run those for \| extended periods of time (days, weeks) after major changes. Of \| course it still has bugs, but considering how insane the heap \| organization is, libpas also makes attackers' lives much harder \| in the short term. \| \| The combination of libpas's enhanced security features and the \| fact that it substantially changes heap layout might be enough \| to cancel out the fact that it has increased bug tail. And... \| if you're worried about bug tail from major changes introducing \| security vulns then there are probably much worse major changes \| that someone could do. \| ryanianian wrote: \| Are there any projects that surgically augment the memory APIs to \| be more cooperative? Sure `malloc` implementations will make \| various tradeoffs, but what if the malloc/free APIs were expanded \| to expose more programmer-intent or cooperative defrag, etc? \| \| - Perhaps `malloc` could ask for an intended lifecycle (think GC \| generation) that could swap arenas or other algo internals. This \| opens us up to meta-allocators. \| \| - Perhaps program lifecycle hooks that give allocators more \| context into how already-allocated memory intends to be accessed. \| ("Done bootstrapping the server, so now new memory will be \| accessed by single threads and only live for one request", etc) \| \| - Perhaps the programmer could periodically call `remalloc` to \| allow objects to be moved or promoted between arenas/generations. \| \| - Perhaps mallocs/frees could be more intelligently batched \| (where explicit structs/arrays don't normally make sense) for \| surprise-free defrag or OS allocation. \| \| - Intelligent pairing of malloc/free with explicit notions of \| intended re-use; some indication of "I will free this and ask for \| the same memory again in the same call" or auto-struct-ing \| mallocs which know that we tend stack \| malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual \| OS/memory involvement whatsoever. \| \| - Perhaps `identmalloc` could handle common flyweight patterns of \| instantiate-unless-exists; e.g. intelligently turn shared state \| into offsets into a contiguous arenas. This may be more useful in \| C++ with existing conventions around copy-by-value constructs. \| \| Some of these are indeed components of existing \| mallocs/allocators (e.g. libpas is type-aware), but my point is \| that allocators have to jump through hoops to derive heuristics, \| and these heuristics often amount to "the programmer often \| intends to use memory like X." Performance engineers then \| pattern-match for X rather than cooperating with the compiler to \| expose X intrinsically from the beginning. The above ideas impose \| _slightly_ different programming paradigms, so there 's still the \| white wale of a perfect drop-in heuristic, but it seems less \| subject to temporary or local maxima. \| \| So: Are there compelling (albeit slightly less general-purpose) \| alternatives where the allocator and programmer communicate \| additional hints or other API contracts that allow the allocator \| to move away from heuristics and into proven patterns? Or is this \| basically akin to "use c++/smart pointers or just get a gc"? \| pizlonator wrote: \| I've thought about this a lot. I think that overall, \| malloc/free/new/delete are already so hard to use that if you \| added more stuff, it would create too much cognitive load for \| the programmer. The result would be that the hints the \| programmer gave you would be more wrong than the malloc's best \| guess. \| \| Let me go through these point by point and offer some thoughts. \| \| - Perhaps `malloc` could ask for an intended lifecycle (think \| GC generation) that could swap arenas or other algo internals. \| This opens us up to meta-allocators. \| \| It could, and if that information was accurate, then it would \| be profitable. But if that information was inaccurate, then \| you'd get worse perf and worse memory usage than if you let \| malloc just do whatever it wants. Also, anything like this \| creates more fragmentation since it creates a new opportunity \| for the programmer to force the allocator to pick something \| other than the first fit. \| \| - Perhaps program lifecycle hooks that give allocators more \| context into how already-allocated memory intends to be \| accessed. ("Done bootstrapping the server, so now new memory \| will be accessed by single threads and only live for one \| request", etc) \| \| If you convey this information after you've already allocated \| the object then there isn't much the malloc could do. If the \| malloc uses this information to decide where the next malloc \| call puts memory then you've created a fragmentation risk. \| \| - Perhaps the programmer could periodically call `remalloc` to \| allow objects to be moved or promoted between \| arenas/generations. \| \| Yes, they could, and this is a great idea. I believe that \| calling realloc (or just mallocing a new object, memcpying or \| copy-constructing, and freeing the old one) relieves \| fragmentation because it gives the malloc an opportunity to put \| the object in a more profitable location. \| \| - Perhaps mallocs/frees could be more intelligently batched \| (where explicit structs/arrays don't normally make sense) for \| surprise-free defrag or OS allocation. \| \| libpas batches allocation and deallocation. These strategies \| help performance. They have only a small effect on memory usage \| (and that effect is that you use a bit more memory by \| batching). \| \| - Intelligent pairing of malloc/free with explicit notions of \| intended re-use; some indication of "I will free this and ask \| for the same memory again in the same call" or auto-struct-ing \| mallocs which know that we tend stack \| malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual \| OS/memory involvement whatsoever. \| \| Allocators like libpas (and mimalloc and many others) have very \| cheap malloc/free calls that perform great in these situations. \| They don't do any locking or call into the OS in the common \| case when you do this. \| \| - Perhaps `identmalloc` could handle common flyweight patterns \| of instantiate-unless-exists; e.g. intelligently turn shared \| state into offsets into a contiguous arenas. This may be more \| useful in C++ with existing conventions around copy-by-value \| constructs. \| \| I think that most fast mallocs like libpas (and others) make \| this situation cheap enough that you don't need anything else. \| \| - Some of these are indeed components of existing \| mallocs/allocators (e.g. libpas is type-aware), but my point is \| that allocators have to jump through hoops to derive \| heuristics, and these heuristics often amount to "the \| programmer often intends to use memory like X." Performance \| engineers then pattern-match for X rather than cooperating with \| the compiler to expose X intrinsically from the beginning. The \| above ideas impose slightly different programming paradigms, so \| there's still the white wale of a perfect drop-in heuristic, \| but it seems less subject to temporary or local maxima. \| \| If the program is simple enough that you understand the object \| lifetimes completely, or you have enough resources to throw at \| the problem that you can completely understand lifetimes even \| for a complex program, then the best path forward is to just \| not call malloc. Lay out your memory a priori or write a highly \| customized "allocator" (that may not even have an API that \| looks anything like malloc). \| \| If the program is too complex for such reasoning, then I \| believe there's no way for the programmer to tell the malloc \| anything it can't deduce with heuristics. The programmer can \| say some stuff, but when you have a large sophisticated heap, \| then the programmer will not be able to predict what it is \| about the heap that creates pain for the malloc. In most \| experiments where I've tried to do this kind of tuning, I end \| up with something that runs slower or uses more memory, and the \| investigation usually leads me to learn that my hypothesis \| about the malloc's pain point was totally wrong. \| \| - So: Are there compelling (albeit slightly less general- \| purpose) alternatives where the allocator and programmer \| communicate additional hints or other API contracts that allow \| the allocator to move away from heuristics and into proven \| patterns? Or is this basically akin to "use c++/smart pointers \| or just get a gc"? \| \| Maybe! I'm not proving a negative here. But everything I've \| tried in this area has been a negative result for me. \| twoodfin wrote: \| _I 've thought about this a lot. I think that overall, \| malloc/free/new/delete are already so hard to use that if you \| added more stuff, it would create too much cognitive load for \| the programmer. The result would be that the hints the \| programmer gave you would be more wrong than the malloc's \| best guess._ \| \| I think this is a generally applicable lesson. Programmers \| overestimate their ability to statically optimize the dynamic \| behavior of their code, _especially_ relative to a runtime \| that _can_ access and act on dynamic state. \| \| This comes up in the database world as application developers \| desiring to "pin" certain tables in memory. Experience has \| shown this to be an anti-feature: As the schema and workload \| of a database system evolves, these pinning decisions can \| become obsolete, starving more relevant tables of buffer \| space. Dynamic decisions, even made by simple heuristics like \| LRU, are an empirically winning strategy. \| \| The failure of Itanium/EPIC and the "sufficiently smart \| compiler" to beat traditional OoO execution is another \| example from a different computing domain. \| akrymski wrote: \| Now if only I could _limit_ the amount of memory a browser would \| consume! The fact that desktop browsers grind a 16GB machine to a \| halt, while on mobile devices having 1000s of tabs is a non-issue \| is plain wrong. Since I can 't limit a process' memory on Mac OS \| (any OS?), why not give me a setting in preferences to do so? \| After all the browser is really just a VM at this stage. I'm \| quite certain I don't need 100 tabs in active memory at once. \| \| While at it, why can't we hibernate tabs? Write the JSC heap and \| DOM state to disk, the way VMs do it? \| \| What's taking up the majority of my memory anyway - the graphics \| / gpu side or the JS VM? \| \| There's a whole spectrum of startups trying to solve the webkit \| memory consumption issues crudely, such as \| https://www.mightyapp.com \| pizlonator wrote: \| I think that desktop browser users expect that their tabs \| generally won't get reloaded. On mobile, users have generally \| accepted that their tabs will get reloaded. Maybe browsers \| should just reload tabs if anything fishy happens or you're not \| using them. \| \| Hibernating tabs to disk is hard because the JS heap (and the \| DOM heap) will contain lots of stuff that is tied to handles \| you got from the OS. It's not impossible. But it would require \| a lot of work to build specialized serialization functions for \| stuff that has native state attached to it that the browser \| doesn't control or have good visibility into. \| \| I think that some websites use a lot of graphics memory while \| other websites use a lot of JS memory. And there are other \| categories of memory usage, too. \| \| Maybe if they put the cloud browsers on the blockchain using \| machine learning, and then write them in Rust and compile them \| to wasm, then it will be even better. \| akrymski wrote: \| > I think that desktop browser users expect that their tabs \| generally won't get reloaded \| \| But they do get reloaded. There's absolutely no guarantee \| provided, and browser tabs will get swapped out of memory as \| necessary (eg opening lots of new tabs or apps). \| \| This is easily solvable though: let me pin tabs that I really \| need to persist. \| pizlonator wrote: \| Sure they do, but much less aggressively than mobile \| browsers. I think that desktop Safari will try to keep your \| tab live so long as nothing fishy happens. Based on my \| experiences using Chrome, it seems to do the same thing. \| tptacek wrote: \| This is a thread about an allocator design, not really about \| Webkit. \| akrymski wrote: \| It's also a thread about memory allocation in Webkit tho. \| jibcage wrote: \| Fun fact: I believe PAS stood for "Phil's awesome shit." Whether \| that's subject to NDA is anyone's guess :D \| pizlonator wrote: \| I can neither confirm nor deny this awesome conjecture. ___________________________________________________________________ (page generated 2022-06-01 23:00 UTC)