[HN Gopher] A single line of code made a 24-core server slower t...
___________________________________________________________________
 
A single line of code made a 24-core server slower than a laptop
 
Author : Ygg2
Score  : 456 points
Date   : 2021-12-31 13:40 UTC (9 hours ago)
 
web link (pkolaczk.github.io)
w3m dump (pkolaczk.github.io)
 
| jv22222 wrote:
| Yes! Finally, HN removed the rule to strip thew word "How" from
| the title.
 
  | hinkley wrote:
  | A few hours later...
 
| ncmncm wrote:
| The notion of "lock-free" is fundamentally misleading.
| 
| It trips up everybody who hopes "lock-free" will be a magic
| bullet freeing them from resource contention bottlenecks.
| 
| When you have explicit locks, aka mutexes (condition variables,
| semaphores, what-have-you), the interaction between your threads
| is visible on the surface. Replacing that with "lock-free"
| interaction, you have essentially the same set of operations as
| when taking a mutex. On the up side, overhead cost may be lower.
| On the down side, mistakes show up as subtly wrong results
| instead of deadlocks. And, you get less control, so when you have
| a problem, it is at best harder to see why, and harder to fix.
| 
| Because the lock-free operations involve similar hardware bus
| interactions under the covers, they cannot fix contention
| problems. You have no choice but to fix contention problems at a
| higher, architectural design level, by reducing actual
| contention. Having solved the problems there, the extra cost of
| explicit mutex operations often does not matter, and the extra
| control you get may be worth any extra cost.
| 
| What is lock-free good for, then? Lock-free components reduce
| overhead when there is no contention. Given any actual
| contention, performance is abandoned in favor of correctness. So,
| if you have mutexes and performance is good, moving to lock-free
| operations _might_ make performance a little better. If
| performance is bad, mutexes and lock-free operations will be
| about equally as bad.
 
  | Kranar wrote:
  | Don't see anything more misleading about lock free than about
  | anything else involving parallelism. Lock freedom has to do
  | with guarantees about scheduling, you can be lock free and
  | still have contention but a lock free algorithm is guaranteed
  | to make progress within a bounded amount of time. A blocking
  | algorithm is not guaranteed to make any kind of progress within
  | a bounded amount of time.
  | 
  | If you're working on a real time system, and I don't mean the
  | colloquial meaning of that word, but rather a system that must
  | guarantee that a side effect is produced no later than a given
  | period of time, then you must use a lock free algorithm, there
  | is no alternative. Avionics, DSP, high frequency trading, these
  | are all domains where lock free algorithms are necessary.
  | Making a fast lock free algorithm is great and generally the
  | goal of any kind of parallelism is performance, but fast is not
  | an unambiguous term, it can mean low latency or high bandwidth
  | and it's very unlikely to write an algorithm that is both.
  | 
  | Lock free algorithms are great when you need low latency and
  | guaranteed throughput. If you don't need that then it's
  | possible to trade those properties for much higher bandwidth
  | and use blocking algorithms.
 
    | ncmncm wrote:
    | If you are counting on lock-free operations to guarantee you
    | will meet your real-time scheduling requirements, you have
    | chosen to fail, right out of the gate. Guaranteeing met
    | deadlines takes engineering. Using lock-free operations is
    | not a substitute for engineering. There is no such
    | substitute.
    | 
    | Also: All this has nothing to do with parallelism.
    | Parallelism is about things happening at the same time.
    | Locking or not locking, and contention, are about
    | concurrency, which is about synchronization and
    | serialization, a different topic.
 
      | Kranar wrote:
      | If you are using a mutex to guarantee real-time scheduling
      | requirements, you have failed.
      | 
      | Furthermore if you think a discussion about benchmarking an
      | algorithm on a 24-core SMP system involving atomic reads,
      | writes and shared caches has nothing to do with parallelism
      | then let's just agree to disagree with one another and I
      | wish you all the best in your programming endeavors.
      | 
      | People can decide for themselves what credibility they wish
      | to lend to your argument given your perspective on this
      | matter.
 
  | duped wrote:
  | Unfortunately I think this comment does is muddy the waters
  | even more. All "lock free" means is that one process(+) may not
  | cause any other process to block during execution.
  | 
  | "Contention" is not very meaningful to a lock-free algorithm,
  | since lock-free algorithms typically do not permit shared data
  | access semantics. Rather they describe how data is _sent
  | between_ processes, which is why a lock-free solution looks
  | very different than something behind mutexes.
  | 
  | But to the heart of your point, I do wish people spent more
  | time understanding that lock-free vs lock-ful is a distinction
  | in semantics and doesn't imply anything about performance.
  | There are subtle trade offs.
  | 
  | (+) I'm using "process" with a lot of hand waving, assume it
  | means whatever your atom of parallelism is
 
    | ncmncm wrote:
    | You are confusing two fundamentally different topics.
    | Anywhere there is no possibility of contention for access to
    | a resource, "lock-free" is trivially meaningless.
 
| twoodfin wrote:
| In situations like this, it's valuable to know about all the
| event types perf can sample beyond CPU cycles, including cache
| misses at various levels and branch mis-predictions. These can go
| right into flamegraphs as well.
 
| Thiez wrote:
| Is this the kind of situation where assigning a core affinity to
| the various threads would have made a difference (to make sure
| they all run on the same physical processor), or is the mere
| presence of another processor enough to trigger the behavior
| shown in the article?
 
  | loeg wrote:
  | No, that would not make a significant difference here. The
  | problem is that the same line has to bounce around all of the
  | threads that share it; pinning doesn't reduce that very much on
  | this time-scale. Thread migrations will be very infrequent
  | relative to atomic operations, especially when the number of
  | runnable threads is lower than number of physical cores.
 
  | dagmx wrote:
  | Core affinity would only help if he didn't have more threads
  | than clusters. Otherwise he'd either hit the wall his laptop
  | did, or cross the cluster and hit the cache issues
 
  | [deleted]
 
  | hermitdev wrote:
  | I don't think core affinity would have helped in this
  | situation. They had one instance that was being shared across
  | all cores, necessitating the reference count to be sync'd
  | between sockets at L3 cache level. They may have seen some
  | benefit if you had, say, 2 instances, one per CPU socket, and
  | dedicated an instance per socket to avoid L3 cache
  | interactions. But, in this case, it is far simpler, cleaner and
  | better performing to not share the instance and avoid the
  | reference counting entirely.
 
| [deleted]
 
| w_t_payne wrote:
| Shared-nothing dataflow models are an interesting way to engineer
| (some types of) parallelism.
 
| bob1029 wrote:
| > The fact that our reference counters are atomic is an
| additional problem that makes things even more complex for the
| processor. Although using atomic instructions is often referred
| to as "lockless programming", this is slightly misleading - in
| fact, atomic operations require some locking to happen at the
| hardware level. This locking is very fine-grained and cheap as
| long as there is no congestion, but as usual with locking, you
| may expect very poor performance if many things try to fight for
| the same lock at the same time. And it is of course much worse if
| those "things" are whole CPUs and not just single cores that are
| close to each other.
| 
| Contention is the devil and you can _never_ hide from it if you
| bring it into your life. The fastest  "contention avoidance hack"
| would be the CAS operation as mentioned above, which still
| performs approximately 2 orders of magnitude slower than a single
| threaded operation would in the most adverse scenarios.
| 
| If you aren't sure of which problems actually "fit" on a modern
| x86 thread these days, rest assured that many developers in
| fintech have been able to squeeze entire financial exchanges onto
| one:
| 
| https://lmax-exchange.github.io/disruptor/disruptor.html
| 
| For most scenarios, you really don't need all those damn cores.
| They are almost certainly getting in the way of you solving your
| problems.
 
  | hinkley wrote:
  | I believe Rust is pulling on the right thread here but I have
  | no idea how this will look in fifteen years. At some point we
  | will need to be able to tell that two threads don't contend and
  | thus can be "farther away" from each other, but these threads
  | are constantly communicating and need to use the same complex
  | and/or memory banks.
 
    | dagmx wrote:
    | I think it could be something built upon lifetime analysis. I
    | think a sufficiently complex compiler could use the data
    | available already today to infer thread affinity and other
    | rules.
 
      | hinkley wrote:
      | Basically what I was thinking. In the near term it doesn't
      | even have to be that precise to have a benefit (Rust builds
      | on escape analysis that helped with distributed garbage
      | collection, even when it wasn't that exhaustive). Even
      | ranking threads by suspected degree of interaction is a lot
      | of data.
      | 
      | These definitely don't talk. Offload. These definitely
      | talk. Colocate. These probably talk. Bin pack.
 
  | cjfd wrote:
  | "For most scenarios, you really don't need all those damn
  | cores. They are almost certainly getting in the way of you
  | solving your problems." This rings true. I still think the most
  | promising multi-threading strategy is to, by default, keep
  | everything on one core and if you find some expensive and more
  | or less independent workload to put that in another thread.
  | This is the old-fashioned way though. All of the cool kids will
  | push for thread pools where every workload will jump from core
  | to core and back again.... This potentially can require a lot
  | of locking, though, which also is not good for performance.
 
    | zmmmmm wrote:
    | It's an interesting reflection then on the Actor paradigm
    | where you are encouraged to create large numbers of fine
    | grained tasks and toss them into a giant thread pool blender
    | because light weight threads are "free" now. It would be
    | quite interesting to see the consequences of thread affinity
    | loss in that setting.
 
  | cecilpl2 wrote:
  | Games are a prime example of systems which must take advantage
  | of all possible hardware resources.
 
    | bob1029 wrote:
    | The fact that we believe this is probably part of the problem
    | with game development today.
 
    | dagmx wrote:
    | Games similarly struggle with multi core/ccx designed chips.
    | 
    | Battlefield 4 was famously terrible on Ryzen when AMD first
    | released it.
 
      | littlecranky67 wrote:
      | To be fair, Ryzen mostly underperformed when it was
      | released which was not AMDs fault. When for years engineers
      | used Intel as their benchmarking and testing machines and
      | were tweaking all sorts of parameters to measure against
      | Intel-only, you inevitable end up with an optimized
      | implementation that resembles the hardware details of the
      | Chip in some kind. Any new architecture would perform
      | badly, except if it is a perfect clone of the original
      | testing hardware.
 
        | dagmx wrote:
        | I wasn't implying it was AMDs fault, but the idea that
        | the games were optimized for Intel doesn't hold true IMHO
        | because they were also built for consoles that were AMD.
        | 
        | The majority of the performance was down to not
        | accounting for CCX scheduling
 
  | PaulDavisThe1st wrote:
  | Another interesting example is audio synthesis. It is
  | interesting because there are at least two models for audio
  | synthesis: single sample processing (as used in, for example,
  | VCV Rack) and block structured processing (as used by every
  | plugin API and thus every DAW-that-runs-plugins). But it gets
  | even more interesting because the degree of potential
  | parallelism is entirely controlled by the signal routing
  | pathways chosen by the user. If the software is processing N
  | independent signal pathways that do not meet until a final
  | output stage, you have a high (well, "N") level of potential
  | parallelism, and if N can expand to a high value then you will
  | always be able to benefit from more cores.
  | 
  | However, if the signal pathway is more convoluted (no pun
  | intended for my audio nerd crew), then not only does the degree
  | of parallelism decrease but the issue of single-sample vs.
  | block structured can become much more important. In the single
  | sample case, the cost of computing a single sample has
  | (relatively) enormous fixed overhead if the code has any
  | serious level of modularity, but is still very cheap compared
  | to inter-thread and inter-core synchronization. Spreading the
  | computation across cores will frequently, but not always, slow
  | things down quite dramatically. By contrast, block structured
  | processing has a much lower per-sample cost, because the
  | function call tree is invoked only once for an entire block of
  | samples, and so the relative cost of synchronization decreases.
  | 
  | This sounds like a slam dunk for block structured processing.
  | 
  | Problem is, there are things you cannot do correctly with block
  | structured processing, and so there is always a place for
  | single sample processing. However, the potential for
  | parallelization and the level at which it should take place
  | differs significantly between the two processing models, which
  | opens to door to some highly questionable design.
  | 
  | The short version of this is that audio synthesis in the
  | abstract can always expand to use any number of cores,
  | particularly for cpu-intensive synthesis like physical
  | modelling, but that in the presence of single-sample
  | processing, the cost of synchronization may completely negate
  | the benefits of parallelism.
 
    | spacechild1 wrote:
    | Great write up!
    | 
    | What I like about Pd is that you can freely reblock and
    | resample any subpatch. Want some section with single-sample-
    | feedback? Just put a [block~ 1]. You can also increase the
    | blocksize. Usually, this is done for upsampling and FFT
    | processing. Finally, reblocking can be nested, meaning that
    | you can reblock to 1024 samples and inside have another
    | subpatch running at 1 sample blocksize.
    | 
    | SuperCollider, on the other hand, has a fixed global
    | blocksize and samplerate, which I think is one of its biggest
    | limitations. (Needless to say, there are many things it does
    | better than Pd!)
    | 
    | ---
    | 
    | In the last few days I have been experimenting with adding
    | multi-threading support to Pd
    | (https://github.com/Spacechild1/pure-data/tree/multi-
    | threadin...). With the usual blocksize of 64 sample, you can
    | definitely observe the scheduling overhead in the CPU meter.
    | If you have a few (heavy-weight) subpatches running in
    | parallel, the overhead is neglible. But for [clone] with a
    | high number of (light-weight) copies, the overhead becomes
    | rather noticable. In my quick tests, reblocking to 256
    | samples already reduces the overhead significantly, at the
    | cost of increased latency, of course.
    | 
    | ---
    | 
    | Also, in my plugin host for Pd/Supercollider
    | (https://git.iem.at/pd/vstplugin/) I have a multi-threading
    | and bridging/sandboxing option. If the plugin itself is
    | rather lightweight and the blocksize is small, the scheduling
    | overhead becomes quite noticable. In Pd you can just put
    | [vstplugin~] in a subpatch + [block~]. For the SuperCollider
    | version I have added a "reblock" argument to process the
    | plugin at a higher blocksize, at the cost of increased
    | latency.
 
    | michaelrmmiller wrote:
    | To add onto this very good explanation from Paul, in
    | practice, audio processing graphs always rapidly decay into
    | serial processing after starting out massively parallel. Most
    | of the time, we are writing to a single output (i.e. speakers
    | via your audio interface or a file). On top of that, some of
    | the heaviest weight DSP happens on this final output stage
    | (mastering chains with multiband compressors, linear phase
    | equalizers, limiters etc.) So every 5.3ms (256 sample buffer
    | @ 48kHz sample rate) we start massively parallel processing
    | all the leaf nodes (audio tracks with sound files, virtual
    | instruments and synths) and end up bottlenecking as the tree
    | collapses into a line. Then we are stuck doing some of the
    | most CPU intensive work on a single core since we can't
    | process the limiter DSP plug-in until the EQ finishes its
    | work, for example.
    | 
    | We need to meet our real-time deadline or risk dropping
    | buffers and making nasty pops and clicks. That mastering
    | stage can pretty easily be the limiting (hah) step that
    | causes us to miss the deadline, even if we processed hundreds
    | of tracks in parallel moments before in less time.
    | 
    | The plug-in APIs (AudioUnits, VSTs, AAX) which are
    | responsible for all the DSP and virtual instruments are also
    | designed to process synchronously. Some plug-ins implement
    | their own threading under the hood but this can often get in
    | the way of the host application's real-time processing. On
    | top of that, because the API isn't designed to be
    | asynchronous, the host's processing thread is tied up waiting
    | for the completed result from the plug-in before it can move
    | on.
    | 
    | Add on that many DSP algorithms are time-dependent. You can't
    | chop up the sample buffer into N different parts and process
    | them independently. The result for sample i+1 depends on
    | processing sample i first.
 
| cletus wrote:
| The mantra I live by is that multi-threaded programming is so
| potentially complex and it's so easy to screw up (eg erasing all
| your supposed performance gains as was the case in this article)
| or by introducing subtle bugs or (worse) security flaws that you
| should do absolutely everything you can to avoid it.
| 
| IME C++ programmers seem to put an unreasonable amount of faith
| on RAII to "solve" these problems eg:
| std::mutex m;         ...         {
| std::lock_guard lg(m);           // do stuff with a
| mutex lock         } // lock_guard destructor releases mutex
| 
| If you're just updating a (private) map or something, this is
| fine. But as soon as this gets complex (eg by calling a function
| and you lose control over what thread is executing because some
| deep code creates a new thread) then this all turns to crap.
| 
| So whenever anyone tells you "just start a thread" without any
| hesitation, it's reasonably safe to assume they have no idea what
| they're talking about.
| 
| This is one reason I love the async code model used in a number
| of languages. I have the most experience with Hack [1] but it's
| not unique to that. What I tend to think of as "product" code
| (which is a real Facebook-ism) has no need to every create a
| thread and dealing with any concurrency issues.
| 
| Now if you're programming an HTTP server or an RDMBS or other
| infrastructure then sure, you have a valid need. But if you're
| just serving HTTP requests, just avoid it entirely if you can
| because any performance gain is probably imaginary but the
| complexity increase isn't.
| 
| [1]: https://docs.hhvm.com/hack/asynchronous-
| operations/introduct...
 
  | otabdeveloper4 wrote:
  | Don't call functions (that aren't part of the C++ standard
  | library) inside the lock guard. It's not a hard rule and you'll
  | get 99% of the way there.
 
  | Thiez wrote:
  | Isn't that mantra a bit too pessimistic? After fixing the
  | performance issue, a speedup that was almost linear in the
  | number of cores was achieved using multithreading.
  | 
  | In addition, async code can bring much of the same challenges
  | as multithreading. For example, in C# tasks may run on the
  | current thread, but they may also run on a thread-pool, and
  | this is mostly transparent to the programmer. When you don't
  | immediately await all asynchronous operations your tasks may
  | end up running at the same time.
 
    | cletus wrote:
    | Like anything you weigh up the upside against the downside.
    | 
    | The downside is more complex code, code that's harder to
    | reason about, the possibility of concurrency bugs, slower
    | development times, the possibility of security flaws and you
    | may just be shifting your performance bottleneck (eg creating
    | a hot mutex).
    | 
    | The upside is lower latency, more throughput and higher QPS.
    | 
    | My argument is quite simply YAGNI. A lot of these benefits
    | are for many people purely imaginary. They're the very
    | definition of premature optimization. Code that is less
    | complex, faster to develop in and likely "good enough" is
    | probably going to help you way more than that.
    | 
    | If you get to the scale where heavy multithreading actually
    | matters, that's a good problem to have. My argument is that
    | many people aren't there and I would hazard a guess that a
    | contributing factor to not getting there is worrying about
    | this level of performance before they need to.
 
  | thewarrior wrote:
  | This model is good for I/O heavy code that is typical for most
  | web services. However if your load is a mix of I/O and compute
  | something like Go should do a lot better.
 
    | cletus wrote:
    | Not sure why you're being downvoted because you raise a valid
    | point: Go has a concurrency model aimed at avoiding many of
    | these problems. This is of course the channel abstraction.
    | 
    | Go has other issues. My first piece of advice to any new Go
    | programmer is make every single one of your channels
    | unbuffered. This way it operates a lot like the async model I
    | mentioned and it tends to be easy to reason about.
    | 
    | But people fall into a trap of thinking adding a buffer will
    | improve concurrency and throughput. This might be true but
    | most of the time it's a premature optimization. Buffered
    | channels also often lead to obscure bugs and a system that's
    | simply harder to reason about.
    | 
    | It's really better to think of Go as an tool for organizing
    | concurrent code, not necessary a direct solution to the
    | problems involved. But organization matters.
 
      | thewarrior wrote:
      | All concurrency tools are foot guns but go seems to be the
      | safest foot gun :)
 
  | FpUser wrote:
  | >"But if you're just serving HTTP requests, just avoid it
  | entirely if you can because any performance gain is probably
  | imaginary but the complexity increase isn't"
  | 
  | I serve HTTP requests offering JSON based RPC in my C++ servers
  | that do all kinds of calculations (often CPU consuming) in
  | multiple threads. Works just fine and scales well. Yes I have
  | to be careful on how I work with shared data but it is
  | centralized encapsulated and done in a single place I call
  | request router / controller. I reuse it among the projects.
  | Nothing too exciting about it and no reason to be scare do
  | death ;)
 
| amichal wrote:
| Would this tool have have scaled as well as 24 separate OS
| processes?
| 
| I remember hitting this same issue many years ago. Some primitive
| in the library (c++) didn't on the surface appear to involve
| shared state and cache write lock but of course it did if you
| thought about it. Since I switched over to more application level
| coding I don't work that low level very much and tend to scale
| with processes now (typically not needing that much parallelism
| anyway) so i don't often have to think these things through
| anymore.
 
  | 10000truths wrote:
  | The problem outlined in the article is not with process vs
  | thread, the problem is with multiple parallel execution
  | contexts reading and writing the same cache line. This issue
  | would have manifested in the same way if your separate
  | processes touched the same offset in a shared memory mapping.
  | 
  | Of course, designing with multiple processes in mind helps to
  | design a shared-nothing architecture, as the costs of shared
  | state between processes are much more explicit. But _any_ kind
  | of state that needs to be shared will run into the same
  | bottleneck, because all forms of shared IPC fundamentally have
  | to use some form of locking somewhere in the stack to serialize
  | incoming inputs.
 
    | lmeyerov wrote:
    | I think the intuition on processes vs threads was right:
    | 
    | 1. The Arc use is a case (I believe! Am not a Rust'er!) of
    | true sharing of references. The cacheline will be contention
    | across cores. Doing processes would have eliminated true
    | sharing. Tracking logical core affinity for references via a
    | modal type system could have flagged this at compile time,
    | but I suspect that's not a thing in Rust land.
    | 
    | 2. Even if it was false sharing at the cacheline level,
    | process abstractions in shared memory systems are generally
    | aligned to avoid false sharing at cacheline granularity (and
    | misaligned ops at smaller granularities). Of course, in the
    | extreme, a 10X hero programmer may have tried to do amazing
    | levels of bit packing and syscall hackery and thus break all
    | those niceties, but that's the cleanup that everyone else in
    | a team has to make the hero a 10Xer.
 
  | loeg wrote:
  | Probably yes, unless the author went to great lengths to
  | contend on shared memory segments.
 
| wabain wrote:
| This is an excellent writeup, but I'm curious about how dependent
| the microbenchmark is on the scripting engine doing no real work.
| My off-the-cuff guess is that the unshared version might still
| perform better with other workloads (it would have higher
| throughput and avoid variance caused by fluctuating degrees of
| cache contention) but it might be that the observable overhead
| from hammering a single cache line would disappear if each thread
| had more varied work to do. And of course if the scripting engine
| benefits substantially from caching internal state across
| invocations (things like incrementally optimized JIT code) then
| having 24 instances can incur substantial extra work.
 
| dathinab wrote:
| Has anyone any idea where the `` format strings come from??
| 
| As far as I know rust doesn't have them.
| 
| And as far as I know if rust gets format strings it most likely
| get them in form of a `f` prefix (like raw strings have an `r`
| prefix and binary strings have a `b` prefix).
| 
| Is it just a thing to make the code more readable/shorter on the
| web?
 
  | metabagel wrote:
  | Markdown uses backticks around code.
  | 
  | https://docs.github.com/en/github/writing-on-github/getting-...
 
  | cdirkx wrote:
  | The code shown in the article isn't Rust, but Rune [0], which
  | has similar syntax, but is a dynamic scripting language. It has
  | template literals [1].
  | 
  | [0] https://rune-rs.github.io/
  | 
  | [1] https://rune-rs.github.io/book/template_literals.html
 
    | dathinab wrote:
    | Oh, that makes sense.
 
  | michalhosna wrote:
  | If you are talking about "Here is an example of a complete
  | workload that measures performance of selecting rows by random
  | keys", then you are looking at Rune code rather than Rust code.
  | 
  | See https://rune-rs.github.io/book/template_literals.html
 
| johnisgood wrote:
| Can anyone TL;DR it? A single line, in Rust, can make a 24-core
| server slower than a laptop? What line is that?
 
  | jkelleyrtp wrote:
  | In a multithreaded context, shared data was shared via atomic
  | reference counting. The atomic operation acted as a single
  | bottleneck for every operation. The solution was to derive deep
  | cloning for the shared data (a 1 line fix) to reduce atomic
  | contention.
 
| PhileinSophia wrote:
 
| vlovich123 wrote:
| Is there any tooling that can demonstrate cache coherency
| hotspots like cache line bouncing or inter-core communication?
 
  | mhh__ wrote:
  | All of the "serious" profilers support this. Intel have a bunch
  | of tooling for diagnosing threading issues, AMD also have some
  | less polished but still functional metrics you can use.
 
    | vlovich123 wrote:
    | Do they track it down to the actual problem point in the
    | code? Or do you just see that maybe there's a problem and
    | then experimenting with changes to figure out what needs to
    | be fixed?
 
      | mhh__ wrote:
      | Short answer: yes. Much better than perf does.
 
  | maxwell86 wrote:
  | Yes. Linux perf has a c2c hit modify "mode"/"option" that will
  | report precisely this issue.
 
    | loeg wrote:
    | What is c2c? Thanks.
 
      | tkhattra wrote:
      | c2c stands for cache-2-cache.
      | 
      | perf-c2c allows you to measure cacheline contention - see
      | https://man7.org/linux/man-pages/man1/perf-c2c.1.html and
      | https://joemario.github.io/blog/2016/09/01/c2c-blog/.
 
      | eptcyka wrote:
      | Cache to cents? Kernels 2 cents on how your code is
      | utilizing the cache? I have 0 clue too.
 
        | masklinn wrote:
        | Cache to Cache,
        | 
        | It tracks a bunch of cache-related counters, and HITM
        | (hit modify) tracks loads of data modified by either an
        | other core of the same node ("local hitm") or by an other
        | node entirely ("remote hitm").
        | 
        | Here, the contention would have shown up as extreme
        | amounts of both, as each core would almost certainly try
        | atomically updating a value which had just been updated
        | by an other core, with higher than even chances the value
        | had been updated by a core of the other socket (= node).
 
  | gjulianm wrote:
  | I've used the Intel VTune Profiler, I think it's free. It's a
  | pretty good tool for profiling code and gives you quite a lot
  | of information on how you're using the core pipelines, but you
  | need an Intel processor to use it. I think it would have
  | quickly flagged this kind of problem and told you that the
  | processor were stalled waiting for cache.
 
    | mhh__ wrote:
    | Vtune is free now. it's by the far the best profiler I've
    | ever come across, AMD's equivalent is a bit rough and ready
    | by comparison.
 
| game_the0ry wrote:
| Relevant video - Scott Meyers: Cpu Caches and Why You Care
| 
| https://www.youtube.com/watch?v=WDIkqP4JbkE
| 
| Remember, boys and girls, small mistakes can have big
| consequences.
| 
| Keep those data in immutable data structures when mutli-
| threading. Actor model also helps.
 
| teen wrote:
| That code? Thread.sleep(5000)
 
| errcorrectcode wrote:
| My laptop (i7-8650U) has slightly faster single-core performance
| than my 96-thread dual 7402 EPYC Milan home server. I'm fine with
| that. I'm not fine with undue inefficiencies, i.e., excessive
| cache misses, branch mis-predictions, and unparallelization due
| to sloppy or unfortunate code.
 
| kentonv wrote:
| An aside:
| 
| > L3 cache is shared by all cores of a CPU.
| 
| I recently learned this is no longer true! AMD EPYC processors
| (and maybe other recent AMDs?) divide cores into groups called
| "core complexes" (CCX), each of which has a separate L3 cache. My
| particular processor is 32-core where each set of 4 cores is a
| CCX. I discovered this when trying to figure out why a benchmark
| was performing wildly differently from one run to the next with a
| bimodal distribution -- it turned out to depend on whether Linux
| has scheduled the two processes in my benchmark to run on the
| same CCX vs. different CCXs.
| 
| https://en.wikipedia.org/wiki/Epyc shows the "core config" of
| each model, which is (number of CCX) x (cores per CCX).
 
  | [deleted]
 
  | Ataraxic wrote:
  | Just as information for some later readers, but this is
  | applicable not just to Epyc but also the consumer version
  | Threadripper. My understanding is that this is an example of a
  | Non-Uniform Memory Access (NUMA) which was also used to link
  | multiple cpus in different sockets together for a long time
  | now, but now they've been integrated into a cpu that fits in
  | one socket.
  | 
  | This actually has an importance if you are running a VM on such
  | a system since you'll run into things like the actual RAM (not
  | l3 cache) is often directly linked to a particular NUMA node.
  | For example accessing memory in the first ram stick vs the
  | second will give different latencies as it goes from ccx1 =>
  | ccx2 => stick2 versus ccx1 => stick1. This is applicable to I
  | think 2XXX versions and earlier for threadripper. My
  | understanding is that they solved this in later versions using
  | the infinity fabric (IO die) so now all ccx's go through the IO
  | die.
  | 
  | I ran into all of this trying to run an ubuntu machine that ran
  | windows using KVM while passing through my nvidia graphics
  | card.
 
    | reilly3000 wrote:
    | This sheds some light on why I had so much trouble running
    | GPU passthrough with Unraid (KVM) on my 1950X w/ Nvida. Those
    | were dark days.
 
      | rob_c wrote:
      | Frankly that was probably 200% unraid ...
 
    | masklinn wrote:
    | > Just as information for some later readers, but this is
    | applicable not just to Epyc but also the consumer version
    | Threadripper.
    | 
    | It's applicable to any Zen design with more than one CCX,
    | which is... any Zen 3 CPU of more than 8 cores (in Zen 2 it
    | was 4).
    | 
    | The wiki has the explanation under the "Core config" entry of
    | the Zen CPUs, but for the 5000s it's all the 59xx (12 and 16
    | cores).
    | 
    | Zen 3 APU are all single-CCX, though there are Zen 2 parts in
    | the 5000 range which are multi-CCX (because why not confuse
    | people): the 6-cores 5500U is a 2x3 and the 8-core 5700U is a
    | 2x4.
    | 
    | The rest is either low-core enough to be single-CCX Zen 2
    | (5300U) or topping out at a single 8-cores CCX (everything
    | else).
 
    | dagmx wrote:
    | This also applies to Ryzen as well. The first gen Ryzen was
    | plagued by software and schedulers that couldn't handle the
    | ccx affinity
 
  | 123pie123 wrote:
  | this is important when sizing and tuning VMs/ hypervisor
 
  | masklinn wrote:
  | > I recently learned this is no longer true! AMD EPYC
  | processors (and maybe other recent AMDs?) divide cores into
  | groups called "core complexes" (CCX), each of which has a
  | separate L3 cache.
  | 
  | It's still kinda true, just that "a CPU" in that context is a
  | CCX. Cross CCX communications has shown up a fair bit in
  | reviews and benches, and really in all chips at that scale
  | (e.g. Ampere's Altras and Intel's Xeons):
  | https://www.anandtech.com/show/16529/amd-epyc-milan-review/4
  | and one of the "improvements" in Zen 3 is the CCX are much
  | larger (there's one CCX per CCD, rather than 2) so there's less
  | crosstalk.
  | 
  | And it could already be untrue previously e.g. the Pentium D
  | were rushed out by sticking two P4 dies on the same PCB, I
  | think they even had to go through the northbridge to
  | communicate, so they were dual socket in all but physical
  | conformation (hence being absolute turds). I don't think they
  | had L3 at all though, so that wasn't really a factor, but
  | still...
 
    | tentacleuno wrote:
    | > And it could already be untrue previously e.g. the Pentium
    | D were rushed out by sticking two P4 dies on the same PCB, I
    | think they even had to go through the northbridge to
    | communicate
    | 
    | Part of me hopes you're wrong here. That is absolutely
    | absurd.
 
      | masklinn wrote:
      | Yeah you'd hope so wouldn't you?
      | 
      | But Intel got caught completely unaware by the switch to
      | multi-core, just as it had by the 64b switch.
      | 
      | The eventual Core 2 was not ready yet (Intel even had to
      | bridge to it with the intermediate Core, which really had
      | more to do with the Pentium M than with the Core 2 though
      | it did feature a proper dual-core design... so much so that
      | the Core solo was actually a binned Duo with one core
      | disabled).
      | 
      | So anyway Intel was caught with its pants around its ankle
      | for the second time and they couldn't let that happen. And
      | they actually beat AMD to market, having turned out a
      | working dual-core design in the time between AMD's
      | announcement of the dual-core opteron (and strong hints of
      | x2) and the actual release, about 8 months.
      | 
      | To manage that Intel could not rearchitecture their chip
      | (and probably didn't want to as it'd become clear Netburst
      | was a dead-end), so they stapled two Prescott cores
      | together, FSB included, and connected both to the
      | northbridge.
      | 
      | It probably took more time to validate that solution for
      | the server market, which is why where AMD released the dual
      | core Opterons in April and Athlons in May, it took until
      | October for the first dual-core Xeon to be available.
 
      | danudey wrote:
      | Here's the wikipedia article:
      | 
      | https://en.wikipedia.org/wiki/Pentium_D
      | 
      | "The Pentium D brand refers to two series of desktop dual-
      | core 64-bit x86-64 microprocessors with the NetBurst
      | microarchitecture, which is the dual-core variant of
      | Pentium 4 "Prescott" manufactured by Intel. Each CPU
      | comprised two dies, each containing a single core, residing
      | next to each other on a multi-chip module package."
 
    | IronWolve wrote:
    | Yup, and why overclocking on AMD tests each of your
    | ccx/cores, marks gold cores that can reach 5ghz+, and slower
    | cores that dont overclock. Also why AMD tests each ccx, and
    | moves worse ones to lower end products, and higher ones to
    | the 5950x.
    | 
    | Then you can OC tweak each core and try to milk out
    | performance, PBO tends to be pretty good on detecting which
    | cores with slight performance boost. With a few bios options,
    | all that adds up without effort.
    | 
    | And then finally get a better (advanced) scheduler in
    | windows/linux that can move workloads around per core
    | depending on workload. Windows released multiple fixes for
    | the scheduler on AMD starting with win10 1903.
    | 
    | I find scheduler mods and modest bios tweaks to increase
    | performance without much effort, very detectable performance.
    | 
    | Linux I use xanmod (and pf-kernel before that)
    | 
    | Windows I use project lasso and AMD PBO.
    | 
    | Slow/fast cores and scheduler tweaks is how win11/arm
    | mac/android, makes things appear faster too.
    | 
    | Amusing, scheduler/core tweaks been around for linux for a
    | decade+, making the desktop super smooth, but its now
    | mainstream in win11/arm osx.
 
      | Andys wrote:
      | does xanmod work as a drop-in replacement in Ubuntu with an
      | NVIDIA card?
 
    | sroussey wrote:
    | There are core complexes in Apple's M1 variations as well.
    | 
    | And it's not just L3 that can shared by various chips.
    | 
    | Each complex also has its own memory bandwidth, so running on
    | two core in two complexes will get around twice the memory
    | bandwidth of two cores in the same complex.
 
  | lend000 wrote:
  | I've always wondered how much this affects shared cloud
  | instances (for example, a single c5.small instance). Can your
  | neighbor, who shares your L3 cache, be using memory so much
  | more aggressively than you that it causes you to evict your
  | L2/L1 cache? Or is cache coherency maintained if the core that
  | evicted your L3 cache is known to be living in a different
  | memory space?
 
    | aspaceman wrote:
    | Coherency will be maintained (since the protocols support
    | that case). But yes, a separate process would evict that
    | memory. From the processors point of view, they're just
    | addresses tagged with data. Caching behavior doesnt depend on
    | virtual memory addresses because I believe that info is
    | stripped away at that point.
    | 
    | So if someone is thrashing cache on the same core you're on,
    | you will notice it if the processes aren't being shared
    | effectively.
    | 
    | The contents of the cache aren't stored as part of a paused
    | process or context switch. But I'd appreciate a correction
    | here if I'm wrong.
    | 
    | For an example, consider two processes A and B running on a
    | set of cores. If A makes many more memory accesses than B, A
    | can effectively starve B of "cached" memory accesses because
    | A accesses memory more frequently.
    | 
    | If B were run alone, then it's working set would fit in
    | cache. Effectively making the algorithm operate from cache
    | instead of RAM.
    | 
    | BUT. You really have to be hitting the caches hard. Doesn't
    | happen too often in casual applications. I only encountered
    | this on GPUs (where each core has sperate L1 but a shared
    | L2). Even then it's only aa problem if every core is hitting
    | different cache lines.
 
      | aspaceman wrote:
      | Found some detail - apparently the CPU manuals are the
      | place to check https://cs.stackexchange.com/a/1090
 
  | pdpi wrote:
  | Out of curiosity -- was the "bad" mode a case of two separate
  | workloads competing for limited cache within one CCX, or was it
  | really bad cache behaviour across CCXs causing problems because
  | the two processes shared memory?
 
    | kentonv wrote:
    | The two processes were a client and a server, the client just
    | sends HTTP requests to the server over a unix socket, but
    | otherwise they don't share resources. TBH I'm surprised there
    | was so much impact just from that. There might be something
    | more to the story -- I haven't investigated much yet. But
    | basically when I used `taskset` to limit the benchmark to two
    | cores in different CCX's, it ran about half the speed as when
    | limited to two cores in the same CCX. I guess another
    | possibility is that the kernel is allowing both processes to
    | swap between cores regularly, which would much better explain
    | the disastrous performance penalty. But, in that case I don't
    | understand the bimodal distribution I see when running the
    | benchmark with no core affinity...
 
  | hinkley wrote:
  | This kind of [stuff] drives me nuts every time I'm trying to
  | read the performance telemetry tea leaves at work.
  | 
  | Do we have requests that take longer, or did Linux do something
  | dumb with thread affinity?
 
    | dahfizz wrote:
    | Pro tip: if you care about performance, and especially if you
    | care about reliable performance measurements, you should pin
    | and accelerate your threads. Using isol CPUs is the next
    | step.
    | 
    | In other words: if your application is at the mercy of Linux
    | making bad decisions about what threads run where, that is a
    | performance bug in your app.
 
      | spockz wrote:
      | How would you do this when running in kubernetes? Afaik it
      | just guarantees that you get scheduled on some cpu for the
      | "right" amount of time. Not on which cpu that is.
      | 
      | I only know that there is one scheduler that gives you
      | dedicated cores if you set the request and limit both to
      | equal multiples of 1000.
 
        | spiddy wrote:
        | On post-docker pre-kubernetes time I used `--cpuset-cpus`
        | on docker args to dedicate specific cpus to redis
        | instances, using CoreOS and fleet for cluster
        | orchestration.
 
        | ncmncm wrote:
        | Another reason to put off moving to kubernetes.
 
    | masklinn wrote:
    | > Do we have requests that take longer, or did Linux do
    | something dumb with thread affinity?
    | 
    | Yes.
    | 
    | Cross-socket communications do take longer, but a properly
    | configured NUMA-aware OS should probably have segregated
    | threads of the same process to the same socket, so the
    | performance should have increased linearly from 1 to 12
    | threads, then fallen off a cliff as the cross-socket effect
    | started blowing up performances.
 
  | klysm wrote:
  | Do you use numa nodes to tell the kernel about this or
  | something else?
 
    | wmf wrote:
    | You can enable node-per-CCX mode in UEFI to do that.
    | Otherwise you have to use tools like hwloc to discover the
    | cache topology and pin workloads accordingly.
 
      | rob_c wrote:
      | Underrated and useful reply. Thank you!
 
  | buildbot wrote:
  | L3 being shared is only common on smaller cores really, on any
  | large Intel design the L3 isn't all in one place, it's broken
  | up across the core.
 
    | wmf wrote:
    | That's still a single logical L3 (e.g. a single core could
    | use the entire L3 capacity).
 
| cheschire wrote:
| There are only two hard things in Computer Science: cache
| invalidation and naming things. - Phil Karlton
 
  | airstrike wrote:
  | There are actually only two hard problems in computer science:
  | 
  | 0) Cache invalidation
  | 
  | 1) Naming things
  | 
  | 5) Asynchronous callbacks
  | 
  | 2) Off-by-one errors
  | 
  | 3) Scope creep
  | 
  | 6) Bounds checking
 
    | c0balt wrote:
    | You forgot text encodings, endianess, ...
 
    | hinkley wrote:
    | Is 4 uninitialized variables?
 
    | Scarblac wrote:
    | There is only one hard problem in computer science: _people_.
 
    | [deleted]
 
    | nicoburns wrote:
    | Most languages will do bounds checking for you. I reckon that
    | one's mostly solved at this point.
 
      | chris_st wrote:
      | But... but... but... we HAVE to use C, it's the only
      | language that's fast enough!
      | 
      | /s
 
      | andi999 wrote:
      | Too slow.
 
  | hnaccount_rng wrote:
  | You missed the off-by-one errors
 
    | cheschire wrote:
    | So you mean I was off by one?
 
  | Pet_Ant wrote:
  | Better version:
  | 
  | There are only two hard things in Computer Science: cache
  | invalidation, naming things, and off-by-one errors.
 
    | MR4D wrote:
    | Better better version:
    | 
    | There are only two hard things in Computer Science: cache
    | invalidation, and...what did you ask me again?
 
    | jahnu wrote:
    | There are only two hard things in Computer Science: cache
    | invalidation, naming things, and off-by-oly two hard things
    | in Computer Scivetem dy gjera te veshtira ne Shkencen
    | Kompjute
 
      | hermitdev wrote:
      | Now, just toss in a byte order mark, and I think we've a
      | winner!
 
      | jacquesm wrote:
      | Hehe. Ok. That is very funny :)
 
        | gavinray wrote:
        | I don't get it -- am I slow?
 
        | [deleted]
 
        | jacquesm wrote:
        | Buffer overflow :)
 
        | vaidhy wrote:
        | This used to happen in C for me. Basically, you are
        | trying to print a string which has overflowed the
        | allocated length and you get junk printed on the string.
        | C-Strings are null terminated and the system will keep
        | printing till it encounters a null. So, bounds checking
        | and buffer overflows are also hard problems.
 
      | colejohnson66 wrote:
      | Gotta watch out for those nulls :)
 
        | silon42 wrote:
        | Nulls are easy compared to invalid non-null pointers.
 
        | ludamad wrote:
        | People say null in Java was a mistake, but I've never had
        | any issues in my null-only dialect of Java /s
 
    | kibwen wrote:
    | There are only two hard problems in Computer Science: cache
    | invalidation, naming things, off-by-one errors, and
    | remembering to update the documentation.
 
| nikkinana wrote:
 
| [deleted]
 
| physicsgraph wrote:
| This article is a well-written deep dive on parallel performance
| debugging that ends up identifying the difference being due to
| how caching is handled on multi-CPU vs single-CPU systems. Thanks
| to the author for taking time to write up their troubleshooting
| process description.
 
| pengaru wrote:
| TL;DR: serializing your parallel work dispatch on an atomic
| global counter is bad for scaling across many cores.
| 
| Or, per-cpu counters exist for a reason.
 
| dan-robertson wrote:
| It seems to me that one solution would be to be able to shard an
| Arc into separate counts for each thread, so that the
| synchronisation wouldn't be required (?) but I don't know how it
| would be done without making most Arcs unnecessarily expensive.
| 
| Perhaps another problem is that libraries can't be (or just
| aren't) generic about the kind of smart-pointers that are used
| but the kind of smart-pointers used can matter a lot (e.g.
| libraries often have to use Arc for some of their clients even
| though Rc would often suffice).
| 
| But maybe I'm just totally wrong about this?
 
  | Kranar wrote:
  | It's a great idea and in fact a variation of it is being
  | explored by Python to eliminate the use of the GIL without
  | introducing performance regressions.
  | 
  | The variation is instead of one count per thread, have two
  | reference counts where one is atomic and the other is not
  | atomic. The heuristic is that at any given time, the majority
  | of writes are performed within a single thread and so that
  | thread has exclusive read/write access to the non-atomic
  | reference count, all other threads share the atomic reference
  | count. There is some work needed to coordinate the atomic and
  | non-atomic reference count so as to test when an object can be
  | deleted, but the overhead of this work is less than the cost of
  | always using an atomic.
  | 
  | http://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf
 
    | dan-robertson wrote:
    | In the case of the original article, the reference count was
    | being updated by all the threads so having an Arc-with-Rc-
    | for-one-thread wouldn't actually change things, I think. But
    | maybe you just mean for 'cheaper smart-pointers in general'?
    | 
    | The python thing is interesting. I thought there were other
    | reasons for the GIL than ref counts but I guess everything
    | must be addressed.
 
      | Kranar wrote:
      | Yep in this case it might not have helped, but all this to
      | say that some exploration of having multiple reference
      | counts is a great idea. The balance becomes the overhead of
      | coordinating multiple reference counts versus the gains
      | from reduced atomic operations.
 
| Pxtl wrote:
| That actually makes me appreciate that the single-threadedness of
| Python might a good thing since Python uses refcounting.
 
| reilly3000 wrote:
| FWIW this is the kind of content I originally came to HN for and
| what keeps me coming back.
 
| chrisaycock wrote:
| The culprit was a shared _atomic reference counter_ (Arc). The
| cache invalidation across cores eliminated any kind of scaling
| via multiple threads.
| 
| The solution is to stop sharing. The author had to make changes
| specific to this codebase for each thread to have its own copy of
| the offending dataset.
 
  | denysvitali wrote:
  | Awesome TL;DR! I read the full article but this summarizes it
  | well. The article is anyways worth a full read
 
  | divs1210 wrote:
  | Immutable collections save the day!
 
    | Thiez wrote:
    | Immutable collections wouldn't really help here since Rust
    | does not have a garbage collector, so they would still be
    | wrapped in an `Arc`, or use one internally. Unless you're
    | giving each thread their own copy of the immutable
    | collection, in which case the immutability doesn't really add
    | much.
 
      | ericbarrett wrote:
      | You're right if they're wrapped in an Arc. But it's a
      | lot easier to confidently clone an immutable data structure
      | to threads/async functions because you know they won't be
      | mutated by other cores. If you're cloning mutable data,
      | you'll have to collect each "shard" at the end of whatever
      | runtime loop your program has, and merge them back into the
      | authoritative model. (Or drop the changes, but then why
      | make the data mutable?) Gets hairy.
 
  | amelius wrote:
  | Or use a language with a good concurrent garbage collector.
 
    | dagmx wrote:
    | What language wouldn't hit this issue? The second you have an
    | atomic write across multiple numa clusters you'll hit this.
    | 
    | Edit: it appears they're talking specifically about the ref
    | counting whereas I was considering the entirety of a shared
    | context. That clarification helps me understand where their
    | statement was coming from
 
      | amelius wrote:
      | The problem was caused by a memory management primitive.
      | This wouldn't be an issue in a runtime with a GC designed
      | for concurrent loads.
 
        | lowbloodsugar wrote:
        | Wood for trees. The problem was caused by an ill-thought-
        | out design. We can do similarly performance degrading
        | things in GC languages, just the details will be
        | different. However, at some extreme, which, to be fair,
        | most systems won't hit, GC languages perform vastly worse
        | than non GC languages. In one service I own, Java's G1GC
        | uses 20x more CPU than Rust for an application specific
        | benchmark. Most of this time is spent in the concurrent
        | phase so Shenandoah and GenShen aren't going to make a
        | dent (and we can't afford the RAM for Shendandoah). 20x
        | CPU and 2x wall clock. The question we are looking at is,
        | "Should we just continue to spend 20x $ on operating
        | costs for the Java version to avoid writing it in Rust?"
 
        | dagmx wrote:
        | How would the GC avoid the atomic lock and cache
        | invalidation across numa boundaries? What language has a
        | sufficiently lock less rw capable GC?
        | 
        | Edit: to clarify I was thinking about a mutable context
        | share as shown in the code example, not solely about the
        | ref counting.
 
        | yencabulator wrote:
        | If the only purpose of the Arc was to be a reference
        | count, a language with a (non-refcount) GC wouldn't need
        | that atomic in the first place.
 
        | dagmx wrote:
        | That's fair. It read to me, from his post, that Rune was
        | using it for context sharing as well between threads
        | (since that's the source of the highest level Arc in his
        | code example) . If it's only for ref counts then it makes
        | sense that a concurrent GC could avoid the issue
 
        | Const-me wrote:
        | > How would the GC avoid the atomic lock and cache
        | invalidation across numa boundaries?
        | 
        | By not using reference counting. State of the art GCs
        | don't count references. They usually doing mark and
        | sweep, implementing multiple generations, and/or doing a
        | few other things.
        | 
        | Most of that overhead only happens while collecting.
        | Merely referencing an object from another thread doesn't
        | modify any shared cache lines.
        | 
        | > What language has a sufficiently lock less rw capable
        | GC?
        | 
        | Java, C#, F#, Golang.
 
        | dagmx wrote:
        | Yeah I think my confusion was that I was thinking about a
        | fully mutable context shared across threads based on the
        | code example in the post.
        | 
        | But if it's just for the ref counting part of the Arc
        | then I can see how a GC would solve it by not needing the
        | RC
 
        | Const-me wrote:
        | > I was thinking about a fully mutable context shared
        | across threads
        | 
        | A quote from the article: "No locks, no mutexes, no
        | syscalls, no shared mutable data here. There are some
        | read-only structures context and unit shared behind an
        | Arc, but read-only sharing shouldn't be a problem." As
        | you see, the data shared across threads was immutable.
        | 
        | However, the library they have picked was designed around
        | Rust's ref.counting Arc<> smart pointers. Apparently for
        | some other use cases, not needed by the OP, that library
        | needs to modify these objects.
        | 
        | > I can see how a GC would solve it by not needing the RC
        | 
        | Interestingly enough, C++ would also solve that. The
        | language does not stop programmers from changing things
        | from multiple threads concurrently. For this reason, very
        | few libraries have their public APIs designed around
        | std::shared_ptr<> (C++ equivalent of Rust's Arc<>).
        | Instead, what usually happens, library authors write in
        | the documentation things like "the object you pass to
        | this API must be thread safe" and "it's your
        | responsibility to make sure the pointer you pass stays
        | alive for as long as you using the API", and call it a
        | day.
 
        | amelius wrote:
        | GoLang. It was designed for highly concurrent loads. For
        | an overview of history and technology involved, read
        | e.g.:
        | 
        | https://blog.twitch.tv/en/2016/07/05/gos-march-to-low-
        | latenc...
 
        | dagmx wrote:
        | Edit: another comment explains that you're likely talking
        | about just the ref counting aspect rather than the entire
        | context sharing used by the rune code shown, in which
        | case, yes I see why a concurrent GC would avoid the issue
        | in that scenario.
        | 
        | ----------
        | 
        | I'm familiar with Go's GC. Your linked post doesn't
        | explain how it would avoid the hit mentioned by the cache
        | invalidation across multiple clusters.
        | 
        | It'll either try and put multiple go routines on a single
        | cluster (as listed in the link) or it'll need to copy the
        | necessary stack per thread. Which is effectively what the
        | original article ends up doing.
        | 
        | But if you encounter anything that needs to run
        | concurrently across threads while using a single r/w
        | object, you'll hit the same cliff surely?
 
        | YogurtFiend wrote:
        | While this isn't true with Go's GC, a GC that's stop-the-
        | world can avoid these cache coherency issues altogether.
        | If you pause every thread before marking and sweeping,
        | then you won't run into problems--only the GC is running.
        | While this may sound silly, stopping the world will
        | oftentimes lead to higher throughput than concurrent
        | collectors, at the cost of higher latency (which is why
        | it's not suitable for Go, which is often used for
        | building servers where latency is a priority).
 
        | jhgb wrote:
        | For one, I'd be surprised if Azul's C4 (Zing) didn't have
        | these issues already solved (assuming anyone has solved
        | that, _they_ probably did).
 
    | YogurtFiend wrote:
    | I don't know why you're being downvoted. Memory reclamation
    | is hard in the face of parallelism. It's much easier to
    | design lock-free algorithms in languages with a garbage
    | collector, because memory reclamation issues are taken care
    | of for you (and often times in a batched manner, which won't
    | affect the hot path of your code). There _are_ techniques to
    | avoid using a garbage collector: atomic reference counting,
    | epoch-based reclaimation, hazard points, and more. But they
    | can be quite difficult to implement and difficult to use.
 
      | quotemstr wrote:
      | Right. Plus GCs can compact memory, which we can't do with
      | manual memory management. I don't understand why people
      | fetishize manual memory management --- do you want to spend
      | your short time on Earth twiddling references?
 
        | titzer wrote:
        | After 28 years of programming in C and 25 years
        | programming in C++ (on and off), I am sick to death of
        | managing memory. Programmers can't do it right or well.
        | Let the super-intelligent idiot savant with billions of
        | cycles of computation per second and trillions of bits of
        | storage do the thinking for me.
        | 
        | Personally I glad the world is moving away from unsafe
        | manual memory management in C++, be that with better RAII
        | and managed pointers, Rust's type system, etc. But those
        | things are still breakable and IMHO, ultimately a big
        | waste of people's time. If Rust's lifetime annotations
        | could be inferred by a compiler, then by all means. But
        | if they can't, just GC already. All that extra thinking
        | pushes people to think about the wrong things and designs
        | end up ossified, difficult to refactor because of the
        | extra overhead of radically changing ownership. Forget it
        | already! Let computers figure out the whole memory
        | management problem.
        | 
        | Want your program to go fast? Don't allocate a bunch of
        | garbage. Redesign it to reuse data structures carefully.
        | Don't burden yourself forever twiddling with managing the
        | large and small alike, obsessing over every dang byte.
 
      | mcguire wrote:
      | Rust's Arc _is_ an atomic reference counter. But, as the
      | article describes, those aren 't free.
 
      | duped wrote:
      | One could argue that using a GC for reclamation isn't lock-
      | free if it's not a concurrent GC.
 
        | hayley-patton wrote:
        | There are very few GCs that are actually lock-free. While
        | uncooperative mutators cannot block other mutators in an
        | on-the-fly collector, for example, the collector is
        | blocked until all mutators cooperate. Nonetheless I'd
        | guess that a sufficiently concurrent collector is good
        | enough.
 
___________________________________________________________________
(page generated 2021-12-31 23:00 UTC)