|
| carlsborg wrote:
| Trying to understand the license for bmalloc, and webkit in
| general. The WebCore folder has an Apple license and two LGPL
| licenses. The root folder has no license file. Can anyone
| comment?
| pizlonator wrote:
| libpas is BSD 2-clause. so is bmalloc. libpas is in the bmalloc
| directory in WebKit, but is really a separate piece of code.
|
| each file has its own license header.
| SemanticStrengh wrote:
| ridiculous_fish wrote:
| Hey Phil, thanks for the write up! Curious why this is only used
| for JSC's JIT, and not to back its heap? Did the design of the
| JSC GC influence libpas at all?
| pizlonator wrote:
| It's used for JSC's JIT and for all heap allocations in
| JavaScriptCore and WebKit that aren't GC allocations.
|
| Riptide (the JSC GC) has a similar design and some of the key
| ideas were first pioneered there:
|
| - The use of fast bitvector searches to find an eligible page
| (Riptide calls it first eligible block, I think).
|
| - The use of bitvector simd to change object states in bulk.
|
| - Riptide uses a mark bitvector that is indexed by minalign
| atom (16 bytes) and a block size of 16KB. Libpas's default
| "small segregated" configuration is 16 byte minalign, 16KB
| block, and an alloc bitvector indexed by minalign.
|
| I have since thought about how Riptide could be wired up to use
| libpas. It's possible to do it. Already today you could create
| a libpas page_config that leaves room for mark bits in the page
| header, and makes them use the same indexing as the alloc bits.
| Riptide has a "newly allocated" bitvector that serves almost
| the same purpose as the alloc bits (kinda, if you squint really
| hard) - so you'd go from Riptides newly_allocated and mark
| bitvectors, to libpas's alloc bitvector and a custom mark
| bitvector stuffed into page headers using an exotic
| page_config. If JSC did this, it would buy:
|
| - Better decommit policies than Riptide. I did a lot of tuning
| in libpas to make decommit Just Right (TM).
|
| - Faster allocation of medium-sized objects.
|
| - Possibly better memory efficiency.
|
| But, Riptide is already very nicely tuned, and it gets some
| benefit from libpas's decommit rules because Riptide calls
| fastMalloc to get memory and sometimes uses fastFree to return
| it. fastFree calls bmalloc::api::free, which calls libpas's
| bmalloc_deallocate, which then goes to pas_deallocate with the
| BMALLOC_HEAP_CONFIG. So, Riptide already gets some of the
| goodness of libpas by sitting on top of it.
| [deleted]
| dragontamer wrote:
| General purpose memory-allocation is a lie. Everything has use
| cases where they're faster than other libraries.
|
| I think that's why we keep seeing newer malloc schemes pop up,
| because the performance of the heap 100% depends on the use-case,
| and different people have different use cases.
|
| Still, studying everyone else's heaps (and garbage collectors, a
| closely related discussion) is probably good for high-performance
| programmers.
| pizlonator wrote:
| Yeah, you're totally right.
|
| Libpas beats other mallocs in WebKit. I also had benchmarks
| involving non-WebKit workloads and the results were all over
| the place (sometimes slower than other mallocs by a lot,
| sometimes faster by a lot - same thing with memory, sometimes
| more efficient, sometimes less). This didn't surprise me; I've
| seen this before when writing memory management code.
|
| I think it makes sense for large software projects that use
| malloc a lot and care about perf to eventually either write
| their own malloc or to take an existing one and then tune it a
| lot.
| latenightcoding wrote:
| I've seen a few opensource projects archive their custom
| allocators because they were not beating the system's one or
| jemalloc. So if you have the skill and time then yeah go for
| it, else stick with general purpose ones.
| mananaysiempre wrote:
| There are a couple of approaches that might legitimately be
| simpler than using a general-purpose allocator: in a single-
| pass batch process, just throw things on the floor and let
| the OS sort it out on exit[1] (I think the D compiler does
| this); in a multiple-pass batch process, make each pass
| litter its own room (arena, obstack, etc.) then demolish said
| room when done (GCC does this or at least did in the past).
|
| On the other hand, these may require rearranging the logic
| somewhat to fit them, so the question of how much
| rearrangement to tolerate still remains. And, well[4],
| /* * We divy out chunks of memory rather than call
| malloc each time so * we don't have to worry about
| leaking memory. It's probably * not a big deal if all
| this memory was wasted but if this ever * goes into a
| library that would probably not be a good idea. *
| * XXX - this *is* in a library.... */
|
| [1] I am rather dismayed by how Raymond Chen advocates for
| this for memory[2] but insists it could not possibly be a
| good idea for Windows GDI handles, no, you stupid sloppy
| programmer[3]. Maybe because he was involved in GDI from the
| other side? (Of course, for file descriptors on Unix it's
| still the standard practice.)
|
| [2] http://bytepointer.com/resources/old_new_thing/20120105_0
| 06_...
|
| [3] http://bytepointer.com/resources/old_new_thing/20051014_3
| 05_...
|
| [4] https://github.com/the-tcpdump-
| group/libpcap/blob/4c1e516dd2...
| jstimpfle wrote:
| This "[2] vs [3]" is a good find. I have several possible
| explanations. It could be the difference of 7 years between
| those posts. It could also be that Windows Handles can't be
| cleaned up in userspace. Not how malloc chunks up the
| "handles" (i.e. pages / regions) that it got from Windows.
| This is, AFAIK, different than HANDLE's that have to be
| requested from the system one-by-one and can't be chunked
| up. (Or am I wrong? I actually don't know a lot about how
| this works under the hood).
| dragontamer wrote:
| Its very easy to beat the general purpose ones if you know
| your exact use case.
|
| Ex: 16-bit pointers is a 65536-sized heap. Assume 8-bytes per
| element, that's 512KB of space. A bit small, but large enough
| to so a lot of things.
|
| 65536 elements can be represented as a bitmask. The bitmask
| only takes up 8192-bytes (8KB), which fits inside of 16
| AVX512 registers (Intel offers 32x AVX512/ZMM registers btw).
| Or it fits inside of GPU __shared__ memory.
|
| If you need multithreaded, you perform atomic AND and atomic
| OR to clear, and set, the bitmask as appropriate.
|
| -------
|
| Do you know the exact size of your heap? The size of the
| pointer? The amount of parallelism involved? What about the
| size of the elements? Access pattern? (Bump-alloc'd Linked
| Lists are a sequential traversal over RAM btw, so that's very
| efficient), etc. etc.
|
| The 64-bit pointer is overkill for most people's purposes.
| 32-bits represents 4-billion objects, and if each object is
| 16-bytes long that's 64GBs of RAM. Right here right now, a
| 32-bit pointer / custom allocator already offers many
| benefits over the 64-bit pointer. (half-sized pointers, more
| data in cache, etc. etc.)
| thesz wrote:
| 512 in AVX512 is the number of bits per register. You are
| off by factor of 8.
| cmrdporcupine wrote:
| 100%, and also I think there's a tendency among people who
| compare 'higher up the stack' in GC'd languages to manual heap
| allocation in C/C++ etc. in a way that implies the the latter
| is almost "free" from a performance POV when in fact underneath
| the covers in these allocators there's still a lot of the same
| kind of walking of datastructures that also happens inside a
| tracing GC.
|
| You can get unpredictable pauses from malloc/free, too.
| asveikau wrote:
| I think in either case, GC or not, you write something
| intuitively and then when it becomes an actual problem, you
| study the patterns and improve them, but most of the time you
| can leave it alone and the general purpose thing is good
| enough.
|
| Or if you are in a niche like audio/video or something, you
| avoid allocations all together during the bulk of the code.
| cmrdporcupine wrote:
| Oh for sure. Premature optimizing is usually bad form.
|
| What is tricky about performance issues in allocation is
| that they can be hard to profile. There are tools for
| analyzing GC performance in GC'd languages, but sometimes
| malloc/free can just be a big black box.
| djmips wrote:
| Premature optimization is not bad form when it's re-
| framed as good architecture. So it's not 'usually bad
| form' to architect something from the outset, using your
| experience, and that's something everyone understands.
| This pervasive disdain for premature optimization leads
| to bad architecture that often leads to expensive
| rewrites. So because the word optimization is so
| overloaded and treated with disdain it feels like we need
| a new language to talk about what's really meant by
| 'premature optimization' ( the bad kind)
| asveikau wrote:
| That's one reason I point out a few comments above that
| certain niches will need to take it into account ahead of
| time. Eg. An a/v application will typically allocate all
| buffers up front and re-use them frequently rather than
| return them to the allocator. A lot of server
| applications will want to keep per-client memory usage
| low. For general purposes, there's the general purpose
| allocator.
| tptacek wrote:
| This is a pretty well-known CS truism, isn't it? One of the
| first papers I ever fell in love with covers it in detail:
| https://users.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pd...
| lidavidm wrote:
| I wonder how this compares to jemalloc, mimalloc, snmalloc?
| pizlonator wrote:
| I never got a chance to compare it to those, since I was most
| interested in beating bmalloc. And I mainly wanted to beat it
| on Safari workloads.
|
| I believe bmalloc was previously compared against jemalloc and
| tcmalloc, also using Safari workloads, and bmalloc was
| significantly faster at the time.
| IshKebab wrote:
| For other people who have never heard of bmalloc - it's a
| custom allocator used only by WebKit. I guess it's not
| surprising they added one since the Mac system allocator is
| extremely slow. A custom allocator is pretty much a free 20%
| speed up on Mac (depending on your workload) but I found they
| made no difference on Linux. Haven't tried on Windows.
| pizlonator wrote:
| Just some credit where credit is due.
|
| I measured the system malloc crushing all other mallocs on
| some workloads and they were the kind of workloads that
| some folks run every day.
|
| I measured the system malloc crushing most other mallocs on
| memory efficiency on most workloads. System malloc is
| really good at reusing memory and has very mature decommit
| policies that easily rival what I came up with.
|
| So there's that.
| pcwalton wrote:
| macOS malloc was _incredibly_ slow for Pathfinder, far
| behind every other OS. Everything became bottlenecked on
| it. It was a free 2x speedup if not more to switch to
| jemalloc.
|
| I suspect this is because Pathfinder's CPU portion is a
| multicore workload and macOS allocator performs poorly
| when under heavy multithreaded contention. It probably
| just isn't the kind of workload that macOS allocator was
| tuned for.
| pizlonator wrote:
| My understanding is that the system malloc is excellent
| under contention, but has a high baseline cost for every
| malloc/free call.
|
| I don't remember exactly what workloads it performed
| really great at (and if I did I dunno if I could say),
| but I do remember they were parallel, and the speed-ups
| got bigger the more cores you added.
|
| Everything else about your experience matches mine.
| Libpas is much faster than system malloc in WebKit and
| JSC. The difference isn't 2x on my preferred benchmarks
| (which are large and do lots of things that don't rely on
| malloc), but it is easily more than 2x on smaller
| benchmarks. So your 2x result sounds about right.
| meisel wrote:
| What sort of workloads are those? I have yet to see a
| workload where macOS's system allocator is not
| substantially slower than alternatives like jemalloc.
| astrange wrote:
| Alas, many people think performance engineering is only
| about the wall clock time of their program and not what
| impact it has on anything else.
|
| malloc performance highly depends on how large the
| allocations are though.
| KerrAvon wrote:
| What's the workload where you found that to be the case?
| The system allocator is heavily tuned for general use, and
| should be very difficult to beat -- generally.
| IshKebab wrote:
| It was a non-traditional compiler written in C++.
|
| > The system allocator is heavily tuned
|
| Perhaps but it probably has more constraints than a
| custom allocator too, e.g. backwards compatibility with
| software that unwittingly depends on its exact behaviour.
|
| I found a discussion of this where people reported
| various speedups (10%, 100%, 300%) on Mac by switching to
| a custom allocator:
| https://news.ycombinator.com/item?id=29068828
|
| I'm sure there are better benchmarks if you Google it.
| jeffbee wrote:
| Well the other browser's design doc is not anywhere near this
| detailed, but if you want to compare and contrast the overall
| design here is Chromium's PartitionAlloc.
|
| https://chromium.googlesource.com/chromium/src/+/master/base...
| rq1 wrote:
| > Consequently, passing a function pointer (or struct of function
| pointers), where the pointer points to an always_inline function
| and the callee is always_inline results in specialization akin to
| template monomorphization. This works to any depth; the compiler
| won't be satisfied until there are no more always_inline function
| calls. This fortuitous development in compilers allowed me to
| write very nice template code in C. Libpas achieves templates in
| C using config structs that contain function pointers --
| sometimes to always_inline functions (when we want specialization
| and inlining) and sometimes to out-of-line functions (when we
| want specialization but not inlining). Additionally, the C
| template style allows us to have true polymorphic functions. Lots
| of libpas slow paths are huge and not at all hot. We don't want
| that code specialized for every config. Luckily, this works just
| fine in C templates -- those polymorphic functions just pass
| around a pointer to the config they are using, and dynamically
| load and call things in that config, almost exactly the same way
| that the specialized code would do. This saves a lot of code size
| versus C++ templates.
|
| This document is incredible!
| photochemsyn wrote:
| Seems like a real tour de force of low-level memory management.
| Prior to seeing this my go-to for understanding custom low-
| level (for embedded) memory managers was this (explains the
| basics of stack, block, bitmap and thread-local allocation) but
| now I have something far more complicated to get confused
| about.
|
| https://www.embedded-software-engineering.de/dynamic-memory-...
| pizlonator wrote:
| Thank you! :-)
|
| I always wanted to write a malloc, and this is what I came up
| with after 3.5 years or so.
| [deleted]
| adamnemecek wrote:
| I'm always suspicious of things that are named after the maker.
| astrange wrote:
| A better rule is to be suspicious of anything with "fast" or
| synonyms in the name. It either won't be fast or it'll be
| insecure.
| fmajid wrote:
| Did you know the entire US Air Traffic Control system runs on
| long-discontinued computers and written in JOVIAL, "Jules' Own
| Version of the International Algebraic Language" (i.e. ALGOL),
| named after Jules Schwartz?
|
| https://en.wikipedia.org/wiki/JOVIAL
|
| The FAA's been trying to replce it for ages, but AFAIK it's
| still ongoing.
| thrtythreeforty wrote:
| Like Linux, or gcc.godbolt.org?
|
| I kid, but I don't see it as a downside for software. Naming a
| scientific phenomenon after yourself is a red flag though, see
| also the Crackpot Index [1].
|
| [1]: https://math.ucr.edu/home/baez/crackpot.html
| tialaramex wrote:
| Godbolt _says_ Compiler Explorer right there, the fact people
| call it "Godbolt" is because that's the memorable name
| rather than because Matt Godbolt tried to name it after
| himself. He calls it Compiler Explorer.
|
| You can't control what people call things. My favourite
| example is Blackfriars Bridge in London. When that bridge was
| built, it was formally named after William Pitt. But like,
| who cares? In what way is this bridge, Pitt's bridge? People
| knew where it was, the Blackfriars area of London, so too bad
| William Pitt, everybody called it Blackfriars Bridge. And so,
| that's what its name is.
|
| The priory which is why "Blackfriars" got the name hadn't
| existed for two hundred years by the time the bridge was
| proposed. Nobody living when the bridge was finished would
| have ever known anybody who'd met any of the friars during
| their lifetime. Nevertheless that part of London had "always"
| been called Blackfriars, and so a bridge and a railway
| station built many years later are named that too.
| bool3max wrote:
| People call it Godbolt because you access it via
| godbolt.org,
| throwaway744678 wrote:
| From the top of my head: Linux, Bourne shell...
| adamnemecek wrote:
| kbenson wrote:
| Sometimes things are named after the maker ex post facto
| because they need a way to distinguish it. The Bourne
| shell, mentioned here, is likely called that for the same
| reason the very first shell was called the Thompson shell,
| in that they were written by a single person originally and
| not really given a name (the original being from Ken
| Thompson), since they were just different (or the original)
| implementations of the system shell. Eventually you're
| going to need a way to distinguish one from the other
| though, right? Noting who wrote it is as good (probably
| better) than any other option.
|
| Linux was probably named differently, but also probably
| doesn't really match whatever criteria you're using to be
| suspicious. It started as a hobby project to make a free
| and open source UNIX system, of which there were few
| (none?), written by one guy. There was no expectation that
| other people would use it for anything other than something
| interesting to play around with, and using a portion of
| your own name for your own little hobby project isn't
| something I would look askance at, personally, especially
| in an age where there was no expectation that it might grow
| to anything, because that was definitely not the norm at
| the time.
| ncmncm wrote:
| "Linux" was forced on him.
| pizlonator wrote:
| It's called libpas, and pas stands for whatever you like.
| mattgreenrocks wrote:
| I admit I'm a sucker for metacircularity, but have to ask:
|
| > The bootstrap heap has hacks to allow itself to allocate that
| array out of itself.
|
| Not quite following that, can someone elaborate?
| convolvatron wrote:
| I suspect its that - generally when you write a heap it needs
| to keep some metadata. if you have a multi-heap system (really
| handy), then its generally easier to host that metadata on
| another heap. sometimes you really want that, because the
| target heap might be specialized for different kinds of data.
| so I assume this is carefully constructing the heap state so
| that it can be used for its own metadata as its being
| constructed
| pizlonator wrote:
| The bootstrap heap has a freelist that is just an array of
| entries that tell you where the free memory is.
|
| That array has to be allocated somewhere.
|
| All pas_simple_large_free_heaps allocate that array from the
| bootstrap heap. Including the bootstrap heap. So, the
| pas_simple_large_free_heap has hacks that go something like:
| "if I'm allocating or freeing something and I need to allocate
| or free the free list, then _very carefully_ also perform that
| allocation /deallocation in tandem with the one I'm doing". The
| main hack there is just that there's some statically allocated
| "slop" for the boostrap free list, so if you want to add to it
| in the middle of an allocation, you add it to the slop array,
| and then after you're doing you reallocate the free list and
| copy stuff from slop into it.
| sydthrowaway wrote:
| Did the author leak this IP when he left Apple?
| pizlonator wrote:
| lol
|
| WebKit is open source. Libpas was always designed with WebKit
| in mind. It landed in WebKit half a year before I left Apple.
| RcouF1uZ4gsC wrote:
| > Consequently, passing a function pointer (or struct of function
| pointers), where the pointer points to an always_inline function
| and the callee is always_inline results in specialization akin to
| template monomorphization. This works to any depth; the compiler
| won't be satisfied until there are no more always_inline function
| calls. This fortuitous development in compilers allowed me to
| write very nice template code in C. Libpas achieves templates in
| C using config structs that contain function pointers --
| sometimes to always_inline functions (when we want specialization
| and inlining) and sometimes to out-of-line functions (when we
| want specialization but not inlining). Additionally, the C
| template style allows us to have true polymorphic functions. Lots
| of libpas slow paths are huge and not at all hot. We don't want
| that code specialized for every config. Luckily, this works just
| fine in C templates -- those polymorphic functions just pass
| around a pointer to the config they are using, and dynamically
| load and call things in that config, almost exactly the same way
| that the specialized code would do. This saves a lot of code size
| versus C++ templates.
|
| I thought this was super interesting. Seems like a very nice
| technique for when you use function pointers in C for stuff like
| sort comparison. There was an earlier thread about PostgreSQL
| that discussed how they created multiple sort functions to avoid
| the overhead of a indirect function pointer call when sorting
| things like integers where the function call dwarfs the
| comparison cost. This seems like a technique that could have been
| used to avoid duplicating the sorting functions while still
| getting the benefits.
| thrtythreeforty wrote:
| I still don't quite understand how the technique saves code
| size versus C++ templates, although I am quite eager to
| understand it since template instantiation is a thorn in my
| side when writing code for a microcontroller with 32KB of code
| space.
|
| Is the idea that the struct contains pointers to always_inline
| functions, and that the compiler has visibility into this at
| the top of the always_inline call stack, so the always_inline
| gets monomorphized even through multiple layers of pointer
| indirection? If I'm understanding it right, that's a pretty
| nice technique. Would this allow you to mark a "class" (or
| individual members of that class) as "hot, please monomorphize"
| by making it always_inline, or "cold, please use regular
| pointers" by omitting always_inline?
|
| C++ templates would force you to monomorphize every variant, or
| use vtables (function pointers) for every variant, without the
| ability to choose the approach for each subclass. The compiler
| is also spectacularly bad at recognizing deduplication
| opportunities between template specializations.
| verdagon wrote:
| I think that's right.
|
| C++ binaries tend to be pretty large because they use
| templates/monomorphization a lot, even when there aren't
| really meaningful performance gains. For example, it often
| doesn't make sense to have 20 instantiations of std::vector,
| when only two of them are used in the critical path. Rust can
| be even further in this unfortunate direction because of the
| borrow checker's restrictions around polymorphism and
| abstraction.
|
| In comparison, C, Java, and Javascript tend to rely on
| dynamic dispatch more often (not that C has any built-in
| dynamic dispatch features, but in practice, C programs tend
| to use function pointers manually).
|
| A few languages like C# have a really nice balance, and
| monomorphize in the JIT at run-time.
|
| With this technique they mention, C is the only mainstream
| language I know of that lets you control it in a decoupled
| way, which is really cool.
| pizlonator wrote:
| Say you have: template
| void foo(T& value) { value += 42; }
|
| There's no easy way to tell the C++ compiler, "please make
| only one version of foo and use callbacks to figure out what
| it means to += on T".
|
| In libpas, this would be done by first creating:
| always_inline void foo(void* value, void (*add_int)(void*
| value, int addend)) { add_int(value, 42);
| }
|
| And then you could also create:
| never_inline void foo_outline(void* value, void
| (*add_int)(void* value, int addend)) { foo(value, add_int); }
|
| Now, if you want monomorphization like C++, call foo() and
| pass a literal for add_int. But if you don't, then call
| foo_outline.
|
| Say you have 1000 callsites to foo(), they all use different
| types, and 5 of those callsites are hot while the other ones
| are hardly ever called (maybe they only run in unusual-but-
| necessary situations). So, the 5 hot callsites can call
| foo(), and the remaining 995 cold ones cal call
| foo_outline().
| thrtythreeforty wrote:
| > There's no easy way to tell the C++ compiler, "please
| make only one version of foo and use callbacks to figure
| out what it means to += on T".
|
| Partial counterpoint: you could use class polymorphism, if
| T is always a type you control. But you're right in
| general; C++ doesn't have typeclasses or some other way to
| create an ad-hoc vtable for, say, int.
|
| > Now, if you want monomorphization like C++, call foo()
| and pass a literal for add_int. But if you don't, then call
| foo_outline.
|
| Does this work through structs of function pointers? Is
| that the reason it's so powerful?
|
| For example, a my_class constructor create_my_class makes
| the class point to foo_outline, unless it's known to be a
| hot type, then it points to foo. When you call
| create_my_class, would the compiler see the values of the
| pointers, and start inlining foo calls, including if you
| pass the my_class struct into foo as "this", continuing the
| inlining?
| pizlonator wrote:
| Yes, it works with structs of function pointers. And it
| works recursively.
|
| This works (this is slightly shorthand C, fill in the
| blanks yourself): struct config {
| void (*foo)(things bar); stuff (*bar)(int
| baz); }; always_inline stuff doit(config
| c) { c.foo(whatever);
| return c.bar(42); } always_inline void
| my_foo(...) { ... } always_inline stuff
| my_bar(...) { ... } stuff dostuff(void) {
| config c; c.foo = my_foo; c.bar
| = my_bar; return doit(c); }
|
| This'll all get inlined and specialized to death.
| thrtythreeforty wrote:
| Got it. That is definitely a cool insight. Thanks for
| sharing!
| RcouF1uZ4gsC wrote:
| > Would this allow you to mark a "class" (or individual
| members of that class) as "hot, please monomorphize" by
| making it always_inline, or "cold, please use regular
| pointers" by omitting always_inline?
|
| From my understanding of the technique, this is correct.
| verdagon wrote:
| IIRC, the branch predictor can work with function pointers as
| well. It can sometimes mitigate problems like this, though not
| as completely as just monomorphizing away the run-time
| branching/calling like they did.
| exikyut wrote:
| A weekend project idea I just thought of, but wouldn't be able to
| commentate on as well as someone more experienced in memory
| management:
|
| Run a bunch of memory allocator demo programs (trivial and
| complex) under _blinkenlights_
| (https://justine.lol/blinkenlights/), a program execution
| visualizer that shows how programs interact with the stack and
| heap as they run.
|
| For bonus points, proceed to explain _why_ their visualized
| performance is different. :)
| meisel wrote:
| Will this have support for macOS by inserting itself, as jemalloc
| does, as the default malloc zone? If so, does that mean it will
| be very fast at determining if a given pointer was allocated by
| libpas?
| pizlonator wrote:
| libpas can do that, though some of the code to do it is not in
| the WebKit repo (but the enumerator support - a complex
| component that is necessary to fully replace system malloc - is
| in the repo).
|
| libpas can already determine if it owns a pointer in O(1) time
| and no locking. It uses megapages for this purpose. Though,
| right now, it's not configured to fully leverage this (if you
| pass an unknown ptr to free then as a last step it'll lock the
| heap_lock even though it would be a small change to just use
| megapages there).
| donfuzius wrote:
| This looks awesome and I really like the tone of the
| documentation. Great work! At the same time I'm sort of worried,
| that such a deep change in the memory management of my day-to-day
| browser opens up a significant security risk for a considerable
| time, until new bugs are found and closed. I'm sure this got more
| than one pair of eyes at apple, but I'd really welcome a specific
| bug bounty on such a change...
| pizlonator wrote:
| That's a fair point. Code churn creates bugs. That's a good
| argument for slowing down the pace of browser development
| overall, except maybe the kind of development that focuses on
| security hardening.
|
| Libpas has some security features that bmalloc didn't have. It
| goes further than bmalloc in avoiding inline metadata (bmalloc
| had an IsoHeap<> thing that used scrambled inline freelists).
| Also, libpas opens the door to isoheaping _all_ allocations in
| WebKit (see https://bugs.webkit.org/show_bug.cgi?id=231938),
| which would be a big security win.
|
| Also, libpas has been fuzzed to death. In test_pas, there's a
| bunch of fuzz tests I call "chaos" and I've run those for
| extended periods of time (days, weeks) after major changes. Of
| course it still has bugs, but considering how insane the heap
| organization is, libpas also makes attackers' lives much harder
| in the short term.
|
| The combination of libpas's enhanced security features and the
| fact that it substantially changes heap layout might be enough
| to cancel out the fact that it has increased bug tail. And...
| if you're worried about bug tail from major changes introducing
| security vulns then there are probably much worse major changes
| that someone could do.
| ryanianian wrote:
| Are there any projects that surgically augment the memory APIs to
| be more cooperative? Sure `malloc` implementations will make
| various tradeoffs, but what if the malloc/free APIs were expanded
| to expose more programmer-intent or cooperative defrag, etc?
|
| - Perhaps `malloc` could ask for an intended lifecycle (think GC
| generation) that could swap arenas or other algo internals. This
| opens us up to meta-allocators.
|
| - Perhaps program lifecycle hooks that give allocators more
| context into how already-allocated memory intends to be accessed.
| ("Done bootstrapping the server, so now new memory will be
| accessed by single threads and only live for one request", etc)
|
| - Perhaps the programmer could periodically call `remalloc` to
| allow objects to be moved or promoted between arenas/generations.
|
| - Perhaps mallocs/frees could be more intelligently batched
| (where explicit structs/arrays don't normally make sense) for
| surprise-free defrag or OS allocation.
|
| - Intelligent pairing of malloc/free with explicit notions of
| intended re-use; some indication of "I will free this and ask for
| the same memory again in the same call" or auto-struct-ing
| mallocs which know that we tend stack
| malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual
| OS/memory involvement whatsoever.
|
| - Perhaps `identmalloc` could handle common flyweight patterns of
| instantiate-unless-exists; e.g. intelligently turn shared state
| into offsets into a contiguous arenas. This may be more useful in
| C++ with existing conventions around copy-by-value constructs.
|
| Some of these are indeed components of existing
| mallocs/allocators (e.g. libpas is type-aware), but my point is
| that allocators have to jump through hoops to derive heuristics,
| and these heuristics often amount to "the programmer often
| intends to use memory like X." Performance engineers then
| pattern-match for X rather than cooperating with the compiler to
| expose X intrinsically from the beginning. The above ideas impose
| _slightly_ different programming paradigms, so there 's still the
| white wale of a perfect drop-in heuristic, but it seems less
| subject to temporary or local maxima.
|
| So: Are there compelling (albeit slightly less general-purpose)
| alternatives where the allocator and programmer communicate
| additional hints or other API contracts that allow the allocator
| to move away from heuristics and into proven patterns? Or is this
| basically akin to "use c++/smart pointers or just get a gc"?
| pizlonator wrote:
| I've thought about this a lot. I think that overall,
| malloc/free/new/delete are already so hard to use that if you
| added more stuff, it would create too much cognitive load for
| the programmer. The result would be that the hints the
| programmer gave you would be more wrong than the malloc's best
| guess.
|
| Let me go through these point by point and offer some thoughts.
|
| - Perhaps `malloc` could ask for an intended lifecycle (think
| GC generation) that could swap arenas or other algo internals.
| This opens us up to meta-allocators.
|
| It could, and if that information was accurate, then it would
| be profitable. But if that information was inaccurate, then
| you'd get worse perf and worse memory usage than if you let
| malloc just do whatever it wants. Also, anything like this
| creates more fragmentation since it creates a new opportunity
| for the programmer to force the allocator to pick something
| other than the first fit.
|
| - Perhaps program lifecycle hooks that give allocators more
| context into how already-allocated memory intends to be
| accessed. ("Done bootstrapping the server, so now new memory
| will be accessed by single threads and only live for one
| request", etc)
|
| If you convey this information after you've already allocated
| the object then there isn't much the malloc could do. If the
| malloc uses this information to decide where the next malloc
| call puts memory then you've created a fragmentation risk.
|
| - Perhaps the programmer could periodically call `remalloc` to
| allow objects to be moved or promoted between
| arenas/generations.
|
| Yes, they could, and this is a great idea. I believe that
| calling realloc (or just mallocing a new object, memcpying or
| copy-constructing, and freeing the old one) relieves
| fragmentation because it gives the malloc an opportunity to put
| the object in a more profitable location.
|
| - Perhaps mallocs/frees could be more intelligently batched
| (where explicit structs/arrays don't normally make sense) for
| surprise-free defrag or OS allocation.
|
| libpas batches allocation and deallocation. These strategies
| help performance. They have only a small effect on memory usage
| (and that effect is that you use a bit more memory by
| batching).
|
| - Intelligent pairing of malloc/free with explicit notions of
| intended re-use; some indication of "I will free this and ask
| for the same memory again in the same call" or auto-struct-ing
| mallocs which know that we tend stack
| malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual
| OS/memory involvement whatsoever.
|
| Allocators like libpas (and mimalloc and many others) have very
| cheap malloc/free calls that perform great in these situations.
| They don't do any locking or call into the OS in the common
| case when you do this.
|
| - Perhaps `identmalloc` could handle common flyweight patterns
| of instantiate-unless-exists; e.g. intelligently turn shared
| state into offsets into a contiguous arenas. This may be more
| useful in C++ with existing conventions around copy-by-value
| constructs.
|
| I think that most fast mallocs like libpas (and others) make
| this situation cheap enough that you don't need anything else.
|
| - Some of these are indeed components of existing
| mallocs/allocators (e.g. libpas is type-aware), but my point is
| that allocators have to jump through hoops to derive
| heuristics, and these heuristics often amount to "the
| programmer often intends to use memory like X." Performance
| engineers then pattern-match for X rather than cooperating with
| the compiler to expose X intrinsically from the beginning. The
| above ideas impose slightly different programming paradigms, so
| there's still the white wale of a perfect drop-in heuristic,
| but it seems less subject to temporary or local maxima.
|
| If the program is simple enough that you understand the object
| lifetimes completely, or you have enough resources to throw at
| the problem that you can completely understand lifetimes even
| for a complex program, then the best path forward is to just
| not call malloc. Lay out your memory a priori or write a highly
| customized "allocator" (that may not even have an API that
| looks anything like malloc).
|
| If the program is too complex for such reasoning, then I
| believe there's no way for the programmer to tell the malloc
| anything it can't deduce with heuristics. The programmer can
| say some stuff, but when you have a large sophisticated heap,
| then the programmer will not be able to predict what it is
| about the heap that creates pain for the malloc. In most
| experiments where I've tried to do this kind of tuning, I end
| up with something that runs slower or uses more memory, and the
| investigation usually leads me to learn that my hypothesis
| about the malloc's pain point was totally wrong.
|
| - So: Are there compelling (albeit slightly less general-
| purpose) alternatives where the allocator and programmer
| communicate additional hints or other API contracts that allow
| the allocator to move away from heuristics and into proven
| patterns? Or is this basically akin to "use c++/smart pointers
| or just get a gc"?
|
| Maybe! I'm not proving a negative here. But everything I've
| tried in this area has been a negative result for me.
| twoodfin wrote:
| _I 've thought about this a lot. I think that overall,
| malloc/free/new/delete are already so hard to use that if you
| added more stuff, it would create too much cognitive load for
| the programmer. The result would be that the hints the
| programmer gave you would be more wrong than the malloc's
| best guess._
|
| I think this is a generally applicable lesson. Programmers
| overestimate their ability to statically optimize the dynamic
| behavior of their code, _especially_ relative to a runtime
| that _can_ access and act on dynamic state.
|
| This comes up in the database world as application developers
| desiring to "pin" certain tables in memory. Experience has
| shown this to be an anti-feature: As the schema and workload
| of a database system evolves, these pinning decisions can
| become obsolete, starving more relevant tables of buffer
| space. Dynamic decisions, even made by simple heuristics like
| LRU, are an empirically winning strategy.
|
| The failure of Itanium/EPIC and the "sufficiently smart
| compiler" to beat traditional OoO execution is another
| example from a different computing domain.
| akrymski wrote:
| Now if only I could _limit_ the amount of memory a browser would
| consume! The fact that desktop browsers grind a 16GB machine to a
| halt, while on mobile devices having 1000s of tabs is a non-issue
| is plain wrong. Since I can 't limit a process' memory on Mac OS
| (any OS?), why not give me a setting in preferences to do so?
| After all the browser is really just a VM at this stage. I'm
| quite certain I don't need 100 tabs in active memory at once.
|
| While at it, why can't we hibernate tabs? Write the JSC heap and
| DOM state to disk, the way VMs do it?
|
| What's taking up the majority of my memory anyway - the graphics
| / gpu side or the JS VM?
|
| There's a whole spectrum of startups trying to solve the webkit
| memory consumption issues crudely, such as
| https://www.mightyapp.com
| pizlonator wrote:
| I think that desktop browser users expect that their tabs
| generally won't get reloaded. On mobile, users have generally
| accepted that their tabs will get reloaded. Maybe browsers
| should just reload tabs if anything fishy happens or you're not
| using them.
|
| Hibernating tabs to disk is hard because the JS heap (and the
| DOM heap) will contain lots of stuff that is tied to handles
| you got from the OS. It's not impossible. But it would require
| a lot of work to build specialized serialization functions for
| stuff that has native state attached to it that the browser
| doesn't control or have good visibility into.
|
| I think that some websites use a lot of graphics memory while
| other websites use a lot of JS memory. And there are other
| categories of memory usage, too.
|
| Maybe if they put the cloud browsers on the blockchain using
| machine learning, and then write them in Rust and compile them
| to wasm, then it will be even better.
| akrymski wrote:
| > I think that desktop browser users expect that their tabs
| generally won't get reloaded
|
| But they do get reloaded. There's absolutely no guarantee
| provided, and browser tabs will get swapped out of memory as
| necessary (eg opening lots of new tabs or apps).
|
| This is easily solvable though: let me pin tabs that I really
| need to persist.
| pizlonator wrote:
| Sure they do, but much less aggressively than mobile
| browsers. I think that desktop Safari will try to keep your
| tab live so long as nothing fishy happens. Based on my
| experiences using Chrome, it seems to do the same thing.
| tptacek wrote:
| This is a thread about an allocator design, not really about
| Webkit.
| akrymski wrote:
| It's also a thread about memory allocation in Webkit tho.
| jibcage wrote:
| Fun fact: I believe PAS stood for "Phil's awesome shit." Whether
| that's subject to NDA is anyone's guess :D
| pizlonator wrote:
| I can neither confirm nor deny this awesome conjecture.
___________________________________________________________________
(page generated 2022-06-01 23:00 UTC) |