proxy70

	[HN Gopher] A single line of code made a 24-core server slower t... ___________________________________________________________________ A single line of code made a 24-core server slower than a laptop Author : Ygg2 Score : 456 points Date : 2021-12-31 13:40 UTC (9 hours ago)
	web link (pkolaczk.github.io)
	w3m dump (pkolaczk.github.io)
	\| jv22222 wrote: \| Yes! Finally, HN removed the rule to strip thew word "How" from \| the title. \| hinkley wrote: \| A few hours later... \| ncmncm wrote: \| The notion of "lock-free" is fundamentally misleading. \| \| It trips up everybody who hopes "lock-free" will be a magic \| bullet freeing them from resource contention bottlenecks. \| \| When you have explicit locks, aka mutexes (condition variables, \| semaphores, what-have-you), the interaction between your threads \| is visible on the surface. Replacing that with "lock-free" \| interaction, you have essentially the same set of operations as \| when taking a mutex. On the up side, overhead cost may be lower. \| On the down side, mistakes show up as subtly wrong results \| instead of deadlocks. And, you get less control, so when you have \| a problem, it is at best harder to see why, and harder to fix. \| \| Because the lock-free operations involve similar hardware bus \| interactions under the covers, they cannot fix contention \| problems. You have no choice but to fix contention problems at a \| higher, architectural design level, by reducing actual \| contention. Having solved the problems there, the extra cost of \| explicit mutex operations often does not matter, and the extra \| control you get may be worth any extra cost. \| \| What is lock-free good for, then? Lock-free components reduce \| overhead when there is no contention. Given any actual \| contention, performance is abandoned in favor of correctness. So, \| if you have mutexes and performance is good, moving to lock-free \| operations _might_ make performance a little better. If \| performance is bad, mutexes and lock-free operations will be \| about equally as bad. \| Kranar wrote: \| Don't see anything more misleading about lock free than about \| anything else involving parallelism. Lock freedom has to do \| with guarantees about scheduling, you can be lock free and \| still have contention but a lock free algorithm is guaranteed \| to make progress within a bounded amount of time. A blocking \| algorithm is not guaranteed to make any kind of progress within \| a bounded amount of time. \| \| If you're working on a real time system, and I don't mean the \| colloquial meaning of that word, but rather a system that must \| guarantee that a side effect is produced no later than a given \| period of time, then you must use a lock free algorithm, there \| is no alternative. Avionics, DSP, high frequency trading, these \| are all domains where lock free algorithms are necessary. \| Making a fast lock free algorithm is great and generally the \| goal of any kind of parallelism is performance, but fast is not \| an unambiguous term, it can mean low latency or high bandwidth \| and it's very unlikely to write an algorithm that is both. \| \| Lock free algorithms are great when you need low latency and \| guaranteed throughput. If you don't need that then it's \| possible to trade those properties for much higher bandwidth \| and use blocking algorithms. \| ncmncm wrote: \| If you are counting on lock-free operations to guarantee you \| will meet your real-time scheduling requirements, you have \| chosen to fail, right out of the gate. Guaranteeing met \| deadlines takes engineering. Using lock-free operations is \| not a substitute for engineering. There is no such \| substitute. \| \| Also: All this has nothing to do with parallelism. \| Parallelism is about things happening at the same time. \| Locking or not locking, and contention, are about \| concurrency, which is about synchronization and \| serialization, a different topic. \| Kranar wrote: \| If you are using a mutex to guarantee real-time scheduling \| requirements, you have failed. \| \| Furthermore if you think a discussion about benchmarking an \| algorithm on a 24-core SMP system involving atomic reads, \| writes and shared caches has nothing to do with parallelism \| then let's just agree to disagree with one another and I \| wish you all the best in your programming endeavors. \| \| People can decide for themselves what credibility they wish \| to lend to your argument given your perspective on this \| matter. \| duped wrote: \| Unfortunately I think this comment does is muddy the waters \| even more. All "lock free" means is that one process(+) may not \| cause any other process to block during execution. \| \| "Contention" is not very meaningful to a lock-free algorithm, \| since lock-free algorithms typically do not permit shared data \| access semantics. Rather they describe how data is _sent \| between_ processes, which is why a lock-free solution looks \| very different than something behind mutexes. \| \| But to the heart of your point, I do wish people spent more \| time understanding that lock-free vs lock-ful is a distinction \| in semantics and doesn't imply anything about performance. \| There are subtle trade offs. \| \| (+) I'm using "process" with a lot of hand waving, assume it \| means whatever your atom of parallelism is \| ncmncm wrote: \| You are confusing two fundamentally different topics. \| Anywhere there is no possibility of contention for access to \| a resource, "lock-free" is trivially meaningless. \| twoodfin wrote: \| In situations like this, it's valuable to know about all the \| event types perf can sample beyond CPU cycles, including cache \| misses at various levels and branch mis-predictions. These can go \| right into flamegraphs as well. \| Thiez wrote: \| Is this the kind of situation where assigning a core affinity to \| the various threads would have made a difference (to make sure \| they all run on the same physical processor), or is the mere \| presence of another processor enough to trigger the behavior \| shown in the article? \| loeg wrote: \| No, that would not make a significant difference here. The \| problem is that the same line has to bounce around all of the \| threads that share it; pinning doesn't reduce that very much on \| this time-scale. Thread migrations will be very infrequent \| relative to atomic operations, especially when the number of \| runnable threads is lower than number of physical cores. \| dagmx wrote: \| Core affinity would only help if he didn't have more threads \| than clusters. Otherwise he'd either hit the wall his laptop \| did, or cross the cluster and hit the cache issues \| [deleted] \| hermitdev wrote: \| I don't think core affinity would have helped in this \| situation. They had one instance that was being shared across \| all cores, necessitating the reference count to be sync'd \| between sockets at L3 cache level. They may have seen some \| benefit if you had, say, 2 instances, one per CPU socket, and \| dedicated an instance per socket to avoid L3 cache \| interactions. But, in this case, it is far simpler, cleaner and \| better performing to not share the instance and avoid the \| reference counting entirely. \| [deleted] \| w_t_payne wrote: \| Shared-nothing dataflow models are an interesting way to engineer \| (some types of) parallelism. \| bob1029 wrote: \| > The fact that our reference counters are atomic is an \| additional problem that makes things even more complex for the \| processor. Although using atomic instructions is often referred \| to as "lockless programming", this is slightly misleading - in \| fact, atomic operations require some locking to happen at the \| hardware level. This locking is very fine-grained and cheap as \| long as there is no congestion, but as usual with locking, you \| may expect very poor performance if many things try to fight for \| the same lock at the same time. And it is of course much worse if \| those "things" are whole CPUs and not just single cores that are \| close to each other. \| \| Contention is the devil and you can _never_ hide from it if you \| bring it into your life. The fastest "contention avoidance hack" \| would be the CAS operation as mentioned above, which still \| performs approximately 2 orders of magnitude slower than a single \| threaded operation would in the most adverse scenarios. \| \| If you aren't sure of which problems actually "fit" on a modern \| x86 thread these days, rest assured that many developers in \| fintech have been able to squeeze entire financial exchanges onto \| one: \| \| https://lmax-exchange.github.io/disruptor/disruptor.html \| \| For most scenarios, you really don't need all those damn cores. \| They are almost certainly getting in the way of you solving your \| problems. \| hinkley wrote: \| I believe Rust is pulling on the right thread here but I have \| no idea how this will look in fifteen years. At some point we \| will need to be able to tell that two threads don't contend and \| thus can be "farther away" from each other, but these threads \| are constantly communicating and need to use the same complex \| and/or memory banks. \| dagmx wrote: \| I think it could be something built upon lifetime analysis. I \| think a sufficiently complex compiler could use the data \| available already today to infer thread affinity and other \| rules. \| hinkley wrote: \| Basically what I was thinking. In the near term it doesn't \| even have to be that precise to have a benefit (Rust builds \| on escape analysis that helped with distributed garbage \| collection, even when it wasn't that exhaustive). Even \| ranking threads by suspected degree of interaction is a lot \| of data. \| \| These definitely don't talk. Offload. These definitely \| talk. Colocate. These probably talk. Bin pack. \| cjfd wrote: \| "For most scenarios, you really don't need all those damn \| cores. They are almost certainly getting in the way of you \| solving your problems." This rings true. I still think the most \| promising multi-threading strategy is to, by default, keep \| everything on one core and if you find some expensive and more \| or less independent workload to put that in another thread. \| This is the old-fashioned way though. All of the cool kids will \| push for thread pools where every workload will jump from core \| to core and back again.... This potentially can require a lot \| of locking, though, which also is not good for performance. \| zmmmmm wrote: \| It's an interesting reflection then on the Actor paradigm \| where you are encouraged to create large numbers of fine \| grained tasks and toss them into a giant thread pool blender \| because light weight threads are "free" now. It would be \| quite interesting to see the consequences of thread affinity \| loss in that setting. \| cecilpl2 wrote: \| Games are a prime example of systems which must take advantage \| of all possible hardware resources. \| bob1029 wrote: \| The fact that we believe this is probably part of the problem \| with game development today. \| dagmx wrote: \| Games similarly struggle with multi core/ccx designed chips. \| \| Battlefield 4 was famously terrible on Ryzen when AMD first \| released it. \| littlecranky67 wrote: \| To be fair, Ryzen mostly underperformed when it was \| released which was not AMDs fault. When for years engineers \| used Intel as their benchmarking and testing machines and \| were tweaking all sorts of parameters to measure against \| Intel-only, you inevitable end up with an optimized \| implementation that resembles the hardware details of the \| Chip in some kind. Any new architecture would perform \| badly, except if it is a perfect clone of the original \| testing hardware. \| dagmx wrote: \| I wasn't implying it was AMDs fault, but the idea that \| the games were optimized for Intel doesn't hold true IMHO \| because they were also built for consoles that were AMD. \| \| The majority of the performance was down to not \| accounting for CCX scheduling \| PaulDavisThe1st wrote: \| Another interesting example is audio synthesis. It is \| interesting because there are at least two models for audio \| synthesis: single sample processing (as used in, for example, \| VCV Rack) and block structured processing (as used by every \| plugin API and thus every DAW-that-runs-plugins). But it gets \| even more interesting because the degree of potential \| parallelism is entirely controlled by the signal routing \| pathways chosen by the user. If the software is processing N \| independent signal pathways that do not meet until a final \| output stage, you have a high (well, "N") level of potential \| parallelism, and if N can expand to a high value then you will \| always be able to benefit from more cores. \| \| However, if the signal pathway is more convoluted (no pun \| intended for my audio nerd crew), then not only does the degree \| of parallelism decrease but the issue of single-sample vs. \| block structured can become much more important. In the single \| sample case, the cost of computing a single sample has \| (relatively) enormous fixed overhead if the code has any \| serious level of modularity, but is still very cheap compared \| to inter-thread and inter-core synchronization. Spreading the \| computation across cores will frequently, but not always, slow \| things down quite dramatically. By contrast, block structured \| processing has a much lower per-sample cost, because the \| function call tree is invoked only once for an entire block of \| samples, and so the relative cost of synchronization decreases. \| \| This sounds like a slam dunk for block structured processing. \| \| Problem is, there are things you cannot do correctly with block \| structured processing, and so there is always a place for \| single sample processing. However, the potential for \| parallelization and the level at which it should take place \| differs significantly between the two processing models, which \| opens to door to some highly questionable design. \| \| The short version of this is that audio synthesis in the \| abstract can always expand to use any number of cores, \| particularly for cpu-intensive synthesis like physical \| modelling, but that in the presence of single-sample \| processing, the cost of synchronization may completely negate \| the benefits of parallelism. \| spacechild1 wrote: \| Great write up! \| \| What I like about Pd is that you can freely reblock and \| resample any subpatch. Want some section with single-sample- \| feedback? Just put a [block~ 1]. You can also increase the \| blocksize. Usually, this is done for upsampling and FFT \| processing. Finally, reblocking can be nested, meaning that \| you can reblock to 1024 samples and inside have another \| subpatch running at 1 sample blocksize. \| \| SuperCollider, on the other hand, has a fixed global \| blocksize and samplerate, which I think is one of its biggest \| limitations. (Needless to say, there are many things it does \| better than Pd!) \| \| --- \| \| In the last few days I have been experimenting with adding \| multi-threading support to Pd \| (https://github.com/Spacechild1/pure-data/tree/multi- \| threadin...). With the usual blocksize of 64 sample, you can \| definitely observe the scheduling overhead in the CPU meter. \| If you have a few (heavy-weight) subpatches running in \| parallel, the overhead is neglible. But for [clone] with a \| high number of (light-weight) copies, the overhead becomes \| rather noticable. In my quick tests, reblocking to 256 \| samples already reduces the overhead significantly, at the \| cost of increased latency, of course. \| \| --- \| \| Also, in my plugin host for Pd/Supercollider \| (https://git.iem.at/pd/vstplugin/) I have a multi-threading \| and bridging/sandboxing option. If the plugin itself is \| rather lightweight and the blocksize is small, the scheduling \| overhead becomes quite noticable. In Pd you can just put \| [vstplugin~] in a subpatch + [block~]. For the SuperCollider \| version I have added a "reblock" argument to process the \| plugin at a higher blocksize, at the cost of increased \| latency. \| michaelrmmiller wrote: \| To add onto this very good explanation from Paul, in \| practice, audio processing graphs always rapidly decay into \| serial processing after starting out massively parallel. Most \| of the time, we are writing to a single output (i.e. speakers \| via your audio interface or a file). On top of that, some of \| the heaviest weight DSP happens on this final output stage \| (mastering chains with multiband compressors, linear phase \| equalizers, limiters etc.) So every 5.3ms (256 sample buffer \| @ 48kHz sample rate) we start massively parallel processing \| all the leaf nodes (audio tracks with sound files, virtual \| instruments and synths) and end up bottlenecking as the tree \| collapses into a line. Then we are stuck doing some of the \| most CPU intensive work on a single core since we can't \| process the limiter DSP plug-in until the EQ finishes its \| work, for example. \| \| We need to meet our real-time deadline or risk dropping \| buffers and making nasty pops and clicks. That mastering \| stage can pretty easily be the limiting (hah) step that \| causes us to miss the deadline, even if we processed hundreds \| of tracks in parallel moments before in less time. \| \| The plug-in APIs (AudioUnits, VSTs, AAX) which are \| responsible for all the DSP and virtual instruments are also \| designed to process synchronously. Some plug-ins implement \| their own threading under the hood but this can often get in \| the way of the host application's real-time processing. On \| top of that, because the API isn't designed to be \| asynchronous, the host's processing thread is tied up waiting \| for the completed result from the plug-in before it can move \| on. \| \| Add on that many DSP algorithms are time-dependent. You can't \| chop up the sample buffer into N different parts and process \| them independently. The result for sample i+1 depends on \| processing sample i first. \| cletus wrote: \| The mantra I live by is that multi-threaded programming is so \| potentially complex and it's so easy to screw up (eg erasing all \| your supposed performance gains as was the case in this article) \| or by introducing subtle bugs or (worse) security flaws that you \| should do absolutely everything you can to avoid it. \| \| IME C++ programmers seem to put an unreasonable amount of faith \| on RAII to "solve" these problems eg: \| std::mutex m; ... { \| std::lock_guard lg(m); // do stuff with a \| mutex lock } // lock_guard destructor releases mutex \| \| If you're just updating a (private) map or something, this is \| fine. But as soon as this gets complex (eg by calling a function \| and you lose control over what thread is executing because some \| deep code creates a new thread) then this all turns to crap. \| \| So whenever anyone tells you "just start a thread" without any \| hesitation, it's reasonably safe to assume they have no idea what \| they're talking about. \| \| This is one reason I love the async code model used in a number \| of languages. I have the most experience with Hack [1] but it's \| not unique to that. What I tend to think of as "product" code \| (which is a real Facebook-ism) has no need to every create a \| thread and dealing with any concurrency issues. \| \| Now if you're programming an HTTP server or an RDMBS or other \| infrastructure then sure, you have a valid need. But if you're \| just serving HTTP requests, just avoid it entirely if you can \| because any performance gain is probably imaginary but the \| complexity increase isn't. \| \| [1]: https://docs.hhvm.com/hack/asynchronous- \| operations/introduct... \| otabdeveloper4 wrote: \| Don't call functions (that aren't part of the C++ standard \| library) inside the lock guard. It's not a hard rule and you'll \| get 99% of the way there. \| Thiez wrote: \| Isn't that mantra a bit too pessimistic? After fixing the \| performance issue, a speedup that was almost linear in the \| number of cores was achieved using multithreading. \| \| In addition, async code can bring much of the same challenges \| as multithreading. For example, in C# tasks may run on the \| current thread, but they may also run on a thread-pool, and \| this is mostly transparent to the programmer. When you don't \| immediately await all asynchronous operations your tasks may \| end up running at the same time. \| cletus wrote: \| Like anything you weigh up the upside against the downside. \| \| The downside is more complex code, code that's harder to \| reason about, the possibility of concurrency bugs, slower \| development times, the possibility of security flaws and you \| may just be shifting your performance bottleneck (eg creating \| a hot mutex). \| \| The upside is lower latency, more throughput and higher QPS. \| \| My argument is quite simply YAGNI. A lot of these benefits \| are for many people purely imaginary. They're the very \| definition of premature optimization. Code that is less \| complex, faster to develop in and likely "good enough" is \| probably going to help you way more than that. \| \| If you get to the scale where heavy multithreading actually \| matters, that's a good problem to have. My argument is that \| many people aren't there and I would hazard a guess that a \| contributing factor to not getting there is worrying about \| this level of performance before they need to. \| thewarrior wrote: \| This model is good for I/O heavy code that is typical for most \| web services. However if your load is a mix of I/O and compute \| something like Go should do a lot better. \| cletus wrote: \| Not sure why you're being downvoted because you raise a valid \| point: Go has a concurrency model aimed at avoiding many of \| these problems. This is of course the channel abstraction. \| \| Go has other issues. My first piece of advice to any new Go \| programmer is make every single one of your channels \| unbuffered. This way it operates a lot like the async model I \| mentioned and it tends to be easy to reason about. \| \| But people fall into a trap of thinking adding a buffer will \| improve concurrency and throughput. This might be true but \| most of the time it's a premature optimization. Buffered \| channels also often lead to obscure bugs and a system that's \| simply harder to reason about. \| \| It's really better to think of Go as an tool for organizing \| concurrent code, not necessary a direct solution to the \| problems involved. But organization matters. \| thewarrior wrote: \| All concurrency tools are foot guns but go seems to be the \| safest foot gun :) \| FpUser wrote: \| >"But if you're just serving HTTP requests, just avoid it \| entirely if you can because any performance gain is probably \| imaginary but the complexity increase isn't" \| \| I serve HTTP requests offering JSON based RPC in my C++ servers \| that do all kinds of calculations (often CPU consuming) in \| multiple threads. Works just fine and scales well. Yes I have \| to be careful on how I work with shared data but it is \| centralized encapsulated and done in a single place I call \| request router / controller. I reuse it among the projects. \| Nothing too exciting about it and no reason to be scare do \| death ;) \| amichal wrote: \| Would this tool have have scaled as well as 24 separate OS \| processes? \| \| I remember hitting this same issue many years ago. Some primitive \| in the library (c++) didn't on the surface appear to involve \| shared state and cache write lock but of course it did if you \| thought about it. Since I switched over to more application level \| coding I don't work that low level very much and tend to scale \| with processes now (typically not needing that much parallelism \| anyway) so i don't often have to think these things through \| anymore. \| 10000truths wrote: \| The problem outlined in the article is not with process vs \| thread, the problem is with multiple parallel execution \| contexts reading and writing the same cache line. This issue \| would have manifested in the same way if your separate \| processes touched the same offset in a shared memory mapping. \| \| Of course, designing with multiple processes in mind helps to \| design a shared-nothing architecture, as the costs of shared \| state between processes are much more explicit. But _any_ kind \| of state that needs to be shared will run into the same \| bottleneck, because all forms of shared IPC fundamentally have \| to use some form of locking somewhere in the stack to serialize \| incoming inputs. \| lmeyerov wrote: \| I think the intuition on processes vs threads was right: \| \| 1. The Arc use is a case (I believe! Am not a Rust'er!) of \| true sharing of references. The cacheline will be contention \| across cores. Doing processes would have eliminated true \| sharing. Tracking logical core affinity for references via a \| modal type system could have flagged this at compile time, \| but I suspect that's not a thing in Rust land. \| \| 2. Even if it was false sharing at the cacheline level, \| process abstractions in shared memory systems are generally \| aligned to avoid false sharing at cacheline granularity (and \| misaligned ops at smaller granularities). Of course, in the \| extreme, a 10X hero programmer may have tried to do amazing \| levels of bit packing and syscall hackery and thus break all \| those niceties, but that's the cleanup that everyone else in \| a team has to make the hero a 10Xer. \| loeg wrote: \| Probably yes, unless the author went to great lengths to \| contend on shared memory segments. \| wabain wrote: \| This is an excellent writeup, but I'm curious about how dependent \| the microbenchmark is on the scripting engine doing no real work. \| My off-the-cuff guess is that the unshared version might still \| perform better with other workloads (it would have higher \| throughput and avoid variance caused by fluctuating degrees of \| cache contention) but it might be that the observable overhead \| from hammering a single cache line would disappear if each thread \| had more varied work to do. And of course if the scripting engine \| benefits substantially from caching internal state across \| invocations (things like incrementally optimized JIT code) then \| having 24 instances can incur substantial extra work. \| dathinab wrote: \| Has anyone any idea where the `` format strings come from?? \| \| As far as I know rust doesn't have them. \| \| And as far as I know if rust gets format strings it most likely \| get them in form of a `f` prefix (like raw strings have an `r` \| prefix and binary strings have a `b` prefix). \| \| Is it just a thing to make the code more readable/shorter on the \| web? \| metabagel wrote: \| Markdown uses backticks around code. \| \| https://docs.github.com/en/github/writing-on-github/getting-... \| cdirkx wrote: \| The code shown in the article isn't Rust, but Rune [0], which \| has similar syntax, but is a dynamic scripting language. It has \| template literals [1]. \| \| [0] https://rune-rs.github.io/ \| \| [1] https://rune-rs.github.io/book/template_literals.html \| dathinab wrote: \| Oh, that makes sense. \| michalhosna wrote: \| If you are talking about "Here is an example of a complete \| workload that measures performance of selecting rows by random \| keys", then you are looking at Rune code rather than Rust code. \| \| See https://rune-rs.github.io/book/template_literals.html \| johnisgood wrote: \| Can anyone TL;DR it? A single line, in Rust, can make a 24-core \| server slower than a laptop? What line is that? \| jkelleyrtp wrote: \| In a multithreaded context, shared data was shared via atomic \| reference counting. The atomic operation acted as a single \| bottleneck for every operation. The solution was to derive deep \| cloning for the shared data (a 1 line fix) to reduce atomic \| contention. \| PhileinSophia wrote: \| vlovich123 wrote: \| Is there any tooling that can demonstrate cache coherency \| hotspots like cache line bouncing or inter-core communication? \| mhh__ wrote: \| All of the "serious" profilers support this. Intel have a bunch \| of tooling for diagnosing threading issues, AMD also have some \| less polished but still functional metrics you can use. \| vlovich123 wrote: \| Do they track it down to the actual problem point in the \| code? Or do you just see that maybe there's a problem and \| then experimenting with changes to figure out what needs to \| be fixed? \| mhh__ wrote: \| Short answer: yes. Much better than perf does. \| maxwell86 wrote: \| Yes. Linux perf has a c2c hit modify "mode"/"option" that will \| report precisely this issue. \| loeg wrote: \| What is c2c? Thanks. \| tkhattra wrote: \| c2c stands for cache-2-cache. \| \| perf-c2c allows you to measure cacheline contention - see \| https://man7.org/linux/man-pages/man1/perf-c2c.1.html and \| https://joemario.github.io/blog/2016/09/01/c2c-blog/. \| eptcyka wrote: \| Cache to cents? Kernels 2 cents on how your code is \| utilizing the cache? I have 0 clue too. \| masklinn wrote: \| Cache to Cache, \| \| It tracks a bunch of cache-related counters, and HITM \| (hit modify) tracks loads of data modified by either an \| other core of the same node ("local hitm") or by an other \| node entirely ("remote hitm"). \| \| Here, the contention would have shown up as extreme \| amounts of both, as each core would almost certainly try \| atomically updating a value which had just been updated \| by an other core, with higher than even chances the value \| had been updated by a core of the other socket (= node). \| gjulianm wrote: \| I've used the Intel VTune Profiler, I think it's free. It's a \| pretty good tool for profiling code and gives you quite a lot \| of information on how you're using the core pipelines, but you \| need an Intel processor to use it. I think it would have \| quickly flagged this kind of problem and told you that the \| processor were stalled waiting for cache. \| mhh__ wrote: \| Vtune is free now. it's by the far the best profiler I've \| ever come across, AMD's equivalent is a bit rough and ready \| by comparison. \| game_the0ry wrote: \| Relevant video - Scott Meyers: Cpu Caches and Why You Care \| \| https://www.youtube.com/watch?v=WDIkqP4JbkE \| \| Remember, boys and girls, small mistakes can have big \| consequences. \| \| Keep those data in immutable data structures when mutli- \| threading. Actor model also helps. \| teen wrote: \| That code? Thread.sleep(5000) \| errcorrectcode wrote: \| My laptop (i7-8650U) has slightly faster single-core performance \| than my 96-thread dual 7402 EPYC Milan home server. I'm fine with \| that. I'm not fine with undue inefficiencies, i.e., excessive \| cache misses, branch mis-predictions, and unparallelization due \| to sloppy or unfortunate code. \| kentonv wrote: \| An aside: \| \| > L3 cache is shared by all cores of a CPU. \| \| I recently learned this is no longer true! AMD EPYC processors \| (and maybe other recent AMDs?) divide cores into groups called \| "core complexes" (CCX), each of which has a separate L3 cache. My \| particular processor is 32-core where each set of 4 cores is a \| CCX. I discovered this when trying to figure out why a benchmark \| was performing wildly differently from one run to the next with a \| bimodal distribution -- it turned out to depend on whether Linux \| has scheduled the two processes in my benchmark to run on the \| same CCX vs. different CCXs. \| \| https://en.wikipedia.org/wiki/Epyc shows the "core config" of \| each model, which is (number of CCX) x (cores per CCX). \| [deleted] \| Ataraxic wrote: \| Just as information for some later readers, but this is \| applicable not just to Epyc but also the consumer version \| Threadripper. My understanding is that this is an example of a \| Non-Uniform Memory Access (NUMA) which was also used to link \| multiple cpus in different sockets together for a long time \| now, but now they've been integrated into a cpu that fits in \| one socket. \| \| This actually has an importance if you are running a VM on such \| a system since you'll run into things like the actual RAM (not \| l3 cache) is often directly linked to a particular NUMA node. \| For example accessing memory in the first ram stick vs the \| second will give different latencies as it goes from ccx1 => \| ccx2 => stick2 versus ccx1 => stick1. This is applicable to I \| think 2XXX versions and earlier for threadripper. My \| understanding is that they solved this in later versions using \| the infinity fabric (IO die) so now all ccx's go through the IO \| die. \| \| I ran into all of this trying to run an ubuntu machine that ran \| windows using KVM while passing through my nvidia graphics \| card. \| reilly3000 wrote: \| This sheds some light on why I had so much trouble running \| GPU passthrough with Unraid (KVM) on my 1950X w/ Nvida. Those \| were dark days. \| rob_c wrote: \| Frankly that was probably 200% unraid ... \| masklinn wrote: \| > Just as information for some later readers, but this is \| applicable not just to Epyc but also the consumer version \| Threadripper. \| \| It's applicable to any Zen design with more than one CCX, \| which is... any Zen 3 CPU of more than 8 cores (in Zen 2 it \| was 4). \| \| The wiki has the explanation under the "Core config" entry of \| the Zen CPUs, but for the 5000s it's all the 59xx (12 and 16 \| cores). \| \| Zen 3 APU are all single-CCX, though there are Zen 2 parts in \| the 5000 range which are multi-CCX (because why not confuse \| people): the 6-cores 5500U is a 2x3 and the 8-core 5700U is a \| 2x4. \| \| The rest is either low-core enough to be single-CCX Zen 2 \| (5300U) or topping out at a single 8-cores CCX (everything \| else). \| dagmx wrote: \| This also applies to Ryzen as well. The first gen Ryzen was \| plagued by software and schedulers that couldn't handle the \| ccx affinity \| 123pie123 wrote: \| this is important when sizing and tuning VMs/ hypervisor \| masklinn wrote: \| > I recently learned this is no longer true! AMD EPYC \| processors (and maybe other recent AMDs?) divide cores into \| groups called "core complexes" (CCX), each of which has a \| separate L3 cache. \| \| It's still kinda true, just that "a CPU" in that context is a \| CCX. Cross CCX communications has shown up a fair bit in \| reviews and benches, and really in all chips at that scale \| (e.g. Ampere's Altras and Intel's Xeons): \| https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 \| and one of the "improvements" in Zen 3 is the CCX are much \| larger (there's one CCX per CCD, rather than 2) so there's less \| crosstalk. \| \| And it could already be untrue previously e.g. the Pentium D \| were rushed out by sticking two P4 dies on the same PCB, I \| think they even had to go through the northbridge to \| communicate, so they were dual socket in all but physical \| conformation (hence being absolute turds). I don't think they \| had L3 at all though, so that wasn't really a factor, but \| still... \| tentacleuno wrote: \| > And it could already be untrue previously e.g. the Pentium \| D were rushed out by sticking two P4 dies on the same PCB, I \| think they even had to go through the northbridge to \| communicate \| \| Part of me hopes you're wrong here. That is absolutely \| absurd. \| masklinn wrote: \| Yeah you'd hope so wouldn't you? \| \| But Intel got caught completely unaware by the switch to \| multi-core, just as it had by the 64b switch. \| \| The eventual Core 2 was not ready yet (Intel even had to \| bridge to it with the intermediate Core, which really had \| more to do with the Pentium M than with the Core 2 though \| it did feature a proper dual-core design... so much so that \| the Core solo was actually a binned Duo with one core \| disabled). \| \| So anyway Intel was caught with its pants around its ankle \| for the second time and they couldn't let that happen. And \| they actually beat AMD to market, having turned out a \| working dual-core design in the time between AMD's \| announcement of the dual-core opteron (and strong hints of \| x2) and the actual release, about 8 months. \| \| To manage that Intel could not rearchitecture their chip \| (and probably didn't want to as it'd become clear Netburst \| was a dead-end), so they stapled two Prescott cores \| together, FSB included, and connected both to the \| northbridge. \| \| It probably took more time to validate that solution for \| the server market, which is why where AMD released the dual \| core Opterons in April and Athlons in May, it took until \| October for the first dual-core Xeon to be available. \| danudey wrote: \| Here's the wikipedia article: \| \| https://en.wikipedia.org/wiki/Pentium_D \| \| "The Pentium D brand refers to two series of desktop dual- \| core 64-bit x86-64 microprocessors with the NetBurst \| microarchitecture, which is the dual-core variant of \| Pentium 4 "Prescott" manufactured by Intel. Each CPU \| comprised two dies, each containing a single core, residing \| next to each other on a multi-chip module package." \| IronWolve wrote: \| Yup, and why overclocking on AMD tests each of your \| ccx/cores, marks gold cores that can reach 5ghz+, and slower \| cores that dont overclock. Also why AMD tests each ccx, and \| moves worse ones to lower end products, and higher ones to \| the 5950x. \| \| Then you can OC tweak each core and try to milk out \| performance, PBO tends to be pretty good on detecting which \| cores with slight performance boost. With a few bios options, \| all that adds up without effort. \| \| And then finally get a better (advanced) scheduler in \| windows/linux that can move workloads around per core \| depending on workload. Windows released multiple fixes for \| the scheduler on AMD starting with win10 1903. \| \| I find scheduler mods and modest bios tweaks to increase \| performance without much effort, very detectable performance. \| \| Linux I use xanmod (and pf-kernel before that) \| \| Windows I use project lasso and AMD PBO. \| \| Slow/fast cores and scheduler tweaks is how win11/arm \| mac/android, makes things appear faster too. \| \| Amusing, scheduler/core tweaks been around for linux for a \| decade+, making the desktop super smooth, but its now \| mainstream in win11/arm osx. \| Andys wrote: \| does xanmod work as a drop-in replacement in Ubuntu with an \| NVIDIA card? \| sroussey wrote: \| There are core complexes in Apple's M1 variations as well. \| \| And it's not just L3 that can shared by various chips. \| \| Each complex also has its own memory bandwidth, so running on \| two core in two complexes will get around twice the memory \| bandwidth of two cores in the same complex. \| lend000 wrote: \| I've always wondered how much this affects shared cloud \| instances (for example, a single c5.small instance). Can your \| neighbor, who shares your L3 cache, be using memory so much \| more aggressively than you that it causes you to evict your \| L2/L1 cache? Or is cache coherency maintained if the core that \| evicted your L3 cache is known to be living in a different \| memory space? \| aspaceman wrote: \| Coherency will be maintained (since the protocols support \| that case). But yes, a separate process would evict that \| memory. From the processors point of view, they're just \| addresses tagged with data. Caching behavior doesnt depend on \| virtual memory addresses because I believe that info is \| stripped away at that point. \| \| So if someone is thrashing cache on the same core you're on, \| you will notice it if the processes aren't being shared \| effectively. \| \| The contents of the cache aren't stored as part of a paused \| process or context switch. But I'd appreciate a correction \| here if I'm wrong. \| \| For an example, consider two processes A and B running on a \| set of cores. If A makes many more memory accesses than B, A \| can effectively starve B of "cached" memory accesses because \| A accesses memory more frequently. \| \| If B were run alone, then it's working set would fit in \| cache. Effectively making the algorithm operate from cache \| instead of RAM. \| \| BUT. You really have to be hitting the caches hard. Doesn't \| happen too often in casual applications. I only encountered \| this on GPUs (where each core has sperate L1 but a shared \| L2). Even then it's only aa problem if every core is hitting \| different cache lines. \| aspaceman wrote: \| Found some detail - apparently the CPU manuals are the \| place to check https://cs.stackexchange.com/a/1090 \| pdpi wrote: \| Out of curiosity -- was the "bad" mode a case of two separate \| workloads competing for limited cache within one CCX, or was it \| really bad cache behaviour across CCXs causing problems because \| the two processes shared memory? \| kentonv wrote: \| The two processes were a client and a server, the client just \| sends HTTP requests to the server over a unix socket, but \| otherwise they don't share resources. TBH I'm surprised there \| was so much impact just from that. There might be something \| more to the story -- I haven't investigated much yet. But \| basically when I used `taskset` to limit the benchmark to two \| cores in different CCX's, it ran about half the speed as when \| limited to two cores in the same CCX. I guess another \| possibility is that the kernel is allowing both processes to \| swap between cores regularly, which would much better explain \| the disastrous performance penalty. But, in that case I don't \| understand the bimodal distribution I see when running the \| benchmark with no core affinity... \| hinkley wrote: \| This kind of [stuff] drives me nuts every time I'm trying to \| read the performance telemetry tea leaves at work. \| \| Do we have requests that take longer, or did Linux do something \| dumb with thread affinity? \| dahfizz wrote: \| Pro tip: if you care about performance, and especially if you \| care about reliable performance measurements, you should pin \| and accelerate your threads. Using isol CPUs is the next \| step. \| \| In other words: if your application is at the mercy of Linux \| making bad decisions about what threads run where, that is a \| performance bug in your app. \| spockz wrote: \| How would you do this when running in kubernetes? Afaik it \| just guarantees that you get scheduled on some cpu for the \| "right" amount of time. Not on which cpu that is. \| \| I only know that there is one scheduler that gives you \| dedicated cores if you set the request and limit both to \| equal multiples of 1000. \| spiddy wrote: \| On post-docker pre-kubernetes time I used `--cpuset-cpus` \| on docker args to dedicate specific cpus to redis \| instances, using CoreOS and fleet for cluster \| orchestration. \| ncmncm wrote: \| Another reason to put off moving to kubernetes. \| masklinn wrote: \| > Do we have requests that take longer, or did Linux do \| something dumb with thread affinity? \| \| Yes. \| \| Cross-socket communications do take longer, but a properly \| configured NUMA-aware OS should probably have segregated \| threads of the same process to the same socket, so the \| performance should have increased linearly from 1 to 12 \| threads, then fallen off a cliff as the cross-socket effect \| started blowing up performances. \| klysm wrote: \| Do you use numa nodes to tell the kernel about this or \| something else? \| wmf wrote: \| You can enable node-per-CCX mode in UEFI to do that. \| Otherwise you have to use tools like hwloc to discover the \| cache topology and pin workloads accordingly. \| rob_c wrote: \| Underrated and useful reply. Thank you! \| buildbot wrote: \| L3 being shared is only common on smaller cores really, on any \| large Intel design the L3 isn't all in one place, it's broken \| up across the core. \| wmf wrote: \| That's still a single logical L3 (e.g. a single core could \| use the entire L3 capacity). \| cheschire wrote: \| There are only two hard things in Computer Science: cache \| invalidation and naming things. - Phil Karlton \| airstrike wrote: \| There are actually only two hard problems in computer science: \| \| 0) Cache invalidation \| \| 1) Naming things \| \| 5) Asynchronous callbacks \| \| 2) Off-by-one errors \| \| 3) Scope creep \| \| 6) Bounds checking \| c0balt wrote: \| You forgot text encodings, endianess, ... \| hinkley wrote: \| Is 4 uninitialized variables? \| Scarblac wrote: \| There is only one hard problem in computer science: _people_. \| [deleted] \| nicoburns wrote: \| Most languages will do bounds checking for you. I reckon that \| one's mostly solved at this point. \| chris_st wrote: \| But... but... but... we HAVE to use C, it's the only \| language that's fast enough! \| \| /s \| andi999 wrote: \| Too slow. \| hnaccount_rng wrote: \| You missed the off-by-one errors \| cheschire wrote: \| So you mean I was off by one? \| Pet_Ant wrote: \| Better version: \| \| There are only two hard things in Computer Science: cache \| invalidation, naming things, and off-by-one errors. \| MR4D wrote: \| Better better version: \| \| There are only two hard things in Computer Science: cache \| invalidation, and...what did you ask me again? \| jahnu wrote: \| There are only two hard things in Computer Science: cache \| invalidation, naming things, and off-by-oly two hard things \| in Computer Scivetem dy gjera te veshtira ne Shkencen \| Kompjute \| hermitdev wrote: \| Now, just toss in a byte order mark, and I think we've a \| winner! \| jacquesm wrote: \| Hehe. Ok. That is very funny :) \| gavinray wrote: \| I don't get it -- am I slow? \| [deleted] \| jacquesm wrote: \| Buffer overflow :) \| vaidhy wrote: \| This used to happen in C for me. Basically, you are \| trying to print a string which has overflowed the \| allocated length and you get junk printed on the string. \| C-Strings are null terminated and the system will keep \| printing till it encounters a null. So, bounds checking \| and buffer overflows are also hard problems. \| colejohnson66 wrote: \| Gotta watch out for those nulls :) \| silon42 wrote: \| Nulls are easy compared to invalid non-null pointers. \| ludamad wrote: \| People say null in Java was a mistake, but I've never had \| any issues in my null-only dialect of Java /s \| kibwen wrote: \| There are only two hard problems in Computer Science: cache \| invalidation, naming things, off-by-one errors, and \| remembering to update the documentation. \| nikkinana wrote: \| [deleted] \| physicsgraph wrote: \| This article is a well-written deep dive on parallel performance \| debugging that ends up identifying the difference being due to \| how caching is handled on multi-CPU vs single-CPU systems. Thanks \| to the author for taking time to write up their troubleshooting \| process description. \| pengaru wrote: \| TL;DR: serializing your parallel work dispatch on an atomic \| global counter is bad for scaling across many cores. \| \| Or, per-cpu counters exist for a reason. \| dan-robertson wrote: \| It seems to me that one solution would be to be able to shard an \| Arc into separate counts for each thread, so that the \| synchronisation wouldn't be required (?) but I don't know how it \| would be done without making most Arcs unnecessarily expensive. \| \| Perhaps another problem is that libraries can't be (or just \| aren't) generic about the kind of smart-pointers that are used \| but the kind of smart-pointers used can matter a lot (e.g. \| libraries often have to use Arc for some of their clients even \| though Rc would often suffice). \| \| But maybe I'm just totally wrong about this? \| Kranar wrote: \| It's a great idea and in fact a variation of it is being \| explored by Python to eliminate the use of the GIL without \| introducing performance regressions. \| \| The variation is instead of one count per thread, have two \| reference counts where one is atomic and the other is not \| atomic. The heuristic is that at any given time, the majority \| of writes are performed within a single thread and so that \| thread has exclusive read/write access to the non-atomic \| reference count, all other threads share the atomic reference \| count. There is some work needed to coordinate the atomic and \| non-atomic reference count so as to test when an object can be \| deleted, but the overhead of this work is less than the cost of \| always using an atomic. \| \| http://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf \| dan-robertson wrote: \| In the case of the original article, the reference count was \| being updated by all the threads so having an Arc-with-Rc- \| for-one-thread wouldn't actually change things, I think. But \| maybe you just mean for 'cheaper smart-pointers in general'? \| \| The python thing is interesting. I thought there were other \| reasons for the GIL than ref counts but I guess everything \| must be addressed. \| Kranar wrote: \| Yep in this case it might not have helped, but all this to \| say that some exploration of having multiple reference \| counts is a great idea. The balance becomes the overhead of \| coordinating multiple reference counts versus the gains \| from reduced atomic operations. \| Pxtl wrote: \| That actually makes me appreciate that the single-threadedness of \| Python might a good thing since Python uses refcounting. \| reilly3000 wrote: \| FWIW this is the kind of content I originally came to HN for and \| what keeps me coming back. \| chrisaycock wrote: \| The culprit was a shared _atomic reference counter_ (Arc). The \| cache invalidation across cores eliminated any kind of scaling \| via multiple threads. \| \| The solution is to stop sharing. The author had to make changes \| specific to this codebase for each thread to have its own copy of \| the offending dataset. \| denysvitali wrote: \| Awesome TL;DR! I read the full article but this summarizes it \| well. The article is anyways worth a full read \| divs1210 wrote: \| Immutable collections save the day! \| Thiez wrote: \| Immutable collections wouldn't really help here since Rust \| does not have a garbage collector, so they would still be \| wrapped in an `Arc`, or use one internally. Unless you're \| giving each thread their own copy of the immutable \| collection, in which case the immutability doesn't really add \| much. \| ericbarrett wrote: \| You're right if they're wrapped in an Arc. But it's a \| lot easier to confidently clone an immutable data structure \| to threads/async functions because you know they won't be \| mutated by other cores. If you're cloning mutable data, \| you'll have to collect each "shard" at the end of whatever \| runtime loop your program has, and merge them back into the \| authoritative model. (Or drop the changes, but then why \| make the data mutable?) Gets hairy. \| amelius wrote: \| Or use a language with a good concurrent garbage collector. \| dagmx wrote: \| What language wouldn't hit this issue? The second you have an \| atomic write across multiple numa clusters you'll hit this. \| \| Edit: it appears they're talking specifically about the ref \| counting whereas I was considering the entirety of a shared \| context. That clarification helps me understand where their \| statement was coming from \| amelius wrote: \| The problem was caused by a memory management primitive. \| This wouldn't be an issue in a runtime with a GC designed \| for concurrent loads. \| lowbloodsugar wrote: \| Wood for trees. The problem was caused by an ill-thought- \| out design. We can do similarly performance degrading \| things in GC languages, just the details will be \| different. However, at some extreme, which, to be fair, \| most systems won't hit, GC languages perform vastly worse \| than non GC languages. In one service I own, Java's G1GC \| uses 20x more CPU than Rust for an application specific \| benchmark. Most of this time is spent in the concurrent \| phase so Shenandoah and GenShen aren't going to make a \| dent (and we can't afford the RAM for Shendandoah). 20x \| CPU and 2x wall clock. The question we are looking at is, \| "Should we just continue to spend 20x $ on operating \| costs for the Java version to avoid writing it in Rust?" \| dagmx wrote: \| How would the GC avoid the atomic lock and cache \| invalidation across numa boundaries? What language has a \| sufficiently lock less rw capable GC? \| \| Edit: to clarify I was thinking about a mutable context \| share as shown in the code example, not solely about the \| ref counting. \| yencabulator wrote: \| If the only purpose of the Arc was to be a reference \| count, a language with a (non-refcount) GC wouldn't need \| that atomic in the first place. \| dagmx wrote: \| That's fair. It read to me, from his post, that Rune was \| using it for context sharing as well between threads \| (since that's the source of the highest level Arc in his \| code example) . If it's only for ref counts then it makes \| sense that a concurrent GC could avoid the issue \| Const-me wrote: \| > How would the GC avoid the atomic lock and cache \| invalidation across numa boundaries? \| \| By not using reference counting. State of the art GCs \| don't count references. They usually doing mark and \| sweep, implementing multiple generations, and/or doing a \| few other things. \| \| Most of that overhead only happens while collecting. \| Merely referencing an object from another thread doesn't \| modify any shared cache lines. \| \| > What language has a sufficiently lock less rw capable \| GC? \| \| Java, C#, F#, Golang. \| dagmx wrote: \| Yeah I think my confusion was that I was thinking about a \| fully mutable context shared across threads based on the \| code example in the post. \| \| But if it's just for the ref counting part of the Arc \| then I can see how a GC would solve it by not needing the \| RC \| Const-me wrote: \| > I was thinking about a fully mutable context shared \| across threads \| \| A quote from the article: "No locks, no mutexes, no \| syscalls, no shared mutable data here. There are some \| read-only structures context and unit shared behind an \| Arc, but read-only sharing shouldn't be a problem." As \| you see, the data shared across threads was immutable. \| \| However, the library they have picked was designed around \| Rust's ref.counting Arc<> smart pointers. Apparently for \| some other use cases, not needed by the OP, that library \| needs to modify these objects. \| \| > I can see how a GC would solve it by not needing the RC \| \| Interestingly enough, C++ would also solve that. The \| language does not stop programmers from changing things \| from multiple threads concurrently. For this reason, very \| few libraries have their public APIs designed around \| std::shared_ptr<> (C++ equivalent of Rust's Arc<>). \| Instead, what usually happens, library authors write in \| the documentation things like "the object you pass to \| this API must be thread safe" and "it's your \| responsibility to make sure the pointer you pass stays \| alive for as long as you using the API", and call it a \| day. \| amelius wrote: \| GoLang. It was designed for highly concurrent loads. For \| an overview of history and technology involved, read \| e.g.: \| \| https://blog.twitch.tv/en/2016/07/05/gos-march-to-low- \| latenc... \| dagmx wrote: \| Edit: another comment explains that you're likely talking \| about just the ref counting aspect rather than the entire \| context sharing used by the rune code shown, in which \| case, yes I see why a concurrent GC would avoid the issue \| in that scenario. \| \| ---------- \| \| I'm familiar with Go's GC. Your linked post doesn't \| explain how it would avoid the hit mentioned by the cache \| invalidation across multiple clusters. \| \| It'll either try and put multiple go routines on a single \| cluster (as listed in the link) or it'll need to copy the \| necessary stack per thread. Which is effectively what the \| original article ends up doing. \| \| But if you encounter anything that needs to run \| concurrently across threads while using a single r/w \| object, you'll hit the same cliff surely? \| YogurtFiend wrote: \| While this isn't true with Go's GC, a GC that's stop-the- \| world can avoid these cache coherency issues altogether. \| If you pause every thread before marking and sweeping, \| then you won't run into problems--only the GC is running. \| While this may sound silly, stopping the world will \| oftentimes lead to higher throughput than concurrent \| collectors, at the cost of higher latency (which is why \| it's not suitable for Go, which is often used for \| building servers where latency is a priority). \| jhgb wrote: \| For one, I'd be surprised if Azul's C4 (Zing) didn't have \| these issues already solved (assuming anyone has solved \| that, _they_ probably did). \| YogurtFiend wrote: \| I don't know why you're being downvoted. Memory reclamation \| is hard in the face of parallelism. It's much easier to \| design lock-free algorithms in languages with a garbage \| collector, because memory reclamation issues are taken care \| of for you (and often times in a batched manner, which won't \| affect the hot path of your code). There _are_ techniques to \| avoid using a garbage collector: atomic reference counting, \| epoch-based reclaimation, hazard points, and more. But they \| can be quite difficult to implement and difficult to use. \| quotemstr wrote: \| Right. Plus GCs can compact memory, which we can't do with \| manual memory management. I don't understand why people \| fetishize manual memory management --- do you want to spend \| your short time on Earth twiddling references? \| titzer wrote: \| After 28 years of programming in C and 25 years \| programming in C++ (on and off), I am sick to death of \| managing memory. Programmers can't do it right or well. \| Let the super-intelligent idiot savant with billions of \| cycles of computation per second and trillions of bits of \| storage do the thinking for me. \| \| Personally I glad the world is moving away from unsafe \| manual memory management in C++, be that with better RAII \| and managed pointers, Rust's type system, etc. But those \| things are still breakable and IMHO, ultimately a big \| waste of people's time. If Rust's lifetime annotations \| could be inferred by a compiler, then by all means. But \| if they can't, just GC already. All that extra thinking \| pushes people to think about the wrong things and designs \| end up ossified, difficult to refactor because of the \| extra overhead of radically changing ownership. Forget it \| already! Let computers figure out the whole memory \| management problem. \| \| Want your program to go fast? Don't allocate a bunch of \| garbage. Redesign it to reuse data structures carefully. \| Don't burden yourself forever twiddling with managing the \| large and small alike, obsessing over every dang byte. \| mcguire wrote: \| Rust's Arc _is_ an atomic reference counter. But, as the \| article describes, those aren 't free. \| duped wrote: \| One could argue that using a GC for reclamation isn't lock- \| free if it's not a concurrent GC. \| hayley-patton wrote: \| There are very few GCs that are actually lock-free. While \| uncooperative mutators cannot block other mutators in an \| on-the-fly collector, for example, the collector is \| blocked until all mutators cooperate. Nonetheless I'd \| guess that a sufficiently concurrent collector is good \| enough. ___________________________________________________________________ (page generated 2021-12-31 23:00 UTC)