[HN Gopher] Extreme HTTP Performance Tuning
___________________________________________________________________
 
Extreme HTTP Performance Tuning
 
Author : talawahtech
Score  : 219 points
Date   : 2021-05-20 20:01 UTC (2 hours ago)
 
web link (talawah.io)
w3m dump (talawah.io)
 
| jeffbee wrote:
| Very nice round-up of techniques. I'd throw out a few that might
| or might not be worth trying: 1) I always disable C-states deeper
| than C1E. Waking from C6 takes upwards of 100 microseconds, way
| too much for a latency-sensitive service, and it doesn't save
| _you_ any money when you are running on EC2; 2) Try receive flow
| steering for a possible boost above and beyond what you get from
| RSS.
| 
| Would also be interesting to discuss the impacts of turning off
| the xmit queue discipline. fq is designed to reduce frame drops
| at the switch level. Transmitting as fast as possible can cause
| frame drops which will totally erase all your other tuning work.
 
  | talawahtech wrote:
  | Thanks!
  | 
  | > I always disable C-states deeper than C1E
  | 
  | AWS doesn't let you mess with c-states for instances smaller
  | than a c5.9xlarge[1]. I did actually test it out on a 9xlarge
  | just for kicks, but it didn't make a difference. Once this test
  | starts, all CPUs are 99+% Busy for the duration of the test. I
  | think it would factor in more if there were lots of CPUs, and
  | some were idle during the test.
  | 
  | > Try receive flow steering for a possible boost
  | 
  | I think the stuff I do in the "perfect locality" section[2]
  | (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive
  | flow steering would be trying to do, but more efficiently.
  | 
  | > Would also be interesting to discuss the impacts of turning
  | off the xmit queue discipline
  | 
  | Yea, noqueue would definitely be a no-go on a constrained
  | network, but when running the (t)wrk benchmark in the cluster
  | placement group I didn't see any evidence of packet drops or
  | retransmits. Drop only happened with the iperf test.
  | 
  | 1.
  | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...
  | 
  | 2. https://talawah.io/blog/extreme-http-performance-tuning-
  | one-...
 
  | duskwuff wrote:
  | Does C-state tuning even do anything on EC2? My intuition says
  | it probably doesn't pass through to the underlying hardware --
  | once the VM exits, it's up to the host OS what power state the
  | CPU goes into.
 
    | jeffbee wrote:
    | It definitely works and you can measure the effect. There's
    | official documentation on what it does and how to tune it:
    | 
    | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo.
    | ..
 
  | xtacy wrote:
  | I suspect that the web server's CPU usage will be pretty high
  | (almost 100%), so C-state tuning may not matter as much?
  | 
  | EDIT: also, RSS happens on the NIC. RFS happens in the kernel,
  | so it might not be as effective. For a uniform request workload
  | like the one in the article, statically binding flows to a NIC
  | queue should be sufficient. :)
 
| fierro wrote:
| How can you be sure the estimated max server capability is not
| actually just a limitation in the _client_ , i.e, the client
| maxes out at _sending_ 224k requests  / second.
| 
| I see that this is clearly not the case here, but in general how
| can one be sure?
 
| alufers wrote:
| That is one hell of a comprehensive article. I wonder how much
| impact would such extreme optimizations on a real-world
| application, which for example does DB queries.
| 
| This experiment feels similar to people who buy old cars and
| remove everything from the inside except the engine, which they
| tune up so that the car runs faster :).
 
  | talawahtech wrote:
  | This comprehensive level of extreme tuning is not going to be
  | _directly_ useful to most people; but there are a few things in
  | there like SO_ATTACH_REUSEPORT_CBPF that I hope to see more
  | servers and frameworks adopt. Similarly I think it is good to
  | be aware of the adaptive interrupt capabilities of AWS
  | instances, and the impacts of speculative execution
  | mitigations, even if you stick to the defaults.
  | 
  | More importantly it is about the idea of using tools like
  | Flamegraphs (or other profiling tools) to identify and
  | eliminate _your_ bottlenecks. It is also just fun to experiment
  | and share the results (and the CloudFormation template). Plus
  | it establishes a high water mark for what is possible, which
  | also makes it useful for future experiments. At some point I
  | would like to do a modified version of this that includes DB
  | queries.
 
  | 101008 wrote:
  | Yes, my experience (not much) is that what makes YouTube or
  | Google or any of those products really impressive is the speed.
  | 
  | YouTube or Google Search suggestion is good, and I think it
  | could be replicable with that amount of data. What is insane is
  | the speed. I can't think how they do it. I am doing something
  | similar for the company I work on and it takes seconds (and the
  | amount of data isn't that much), so I can't wrap my head around
  | it.
  | 
  | The point is that doing only speed is not _that_ complicated,
  | and doing some algorithms alone is not _that_ complicated. What
  | is really hard is to do both.
 
    | ecnahc515 wrote:
    | A lot of this is just spending more money and resources to
    | make it possible to optimize for speed.
    | 
    | With sufficient caching with and a lot of parallelism makes
    | this possible. That costs money though. Caching means storing
    | data twice. Parallelism means more servers (since you'll
    | probably be aiming to saturate the network bandwidth for each
    | host).
    | 
    | Pre-aggregating data is another part of the strategy, as that
    | avoids using CPU cycles in the fast-path, but it means
    | storing even more copies of the data!
    | 
    | My personal anecdotal experience with this is with SQL on
    | object storage. Query engines that use object storage can
    | still perform well with the above techniques, even though
    | querying large amounts of data from object is slow. You can
    | bypass the slowness of object storage if you pre-cache the
    | data somewhere else that's closer/faster for recent data. You
    | can have materialized views/tables for rollups of data over
    | longer periods of time, which reduces the data needed to be
    | fetched and cached. It also requires less CPU due to working
    | with a smaller amount of pre-calculated data.
    | 
    | Apply this to every layer, every system, etc, and you can get
    | good performance even with tons of data. It's why doing
    | machine-learning in real- is way harder than pre-computing
    | models. Streaming platforms make this all much easier as you
    | can constantly be pre-computing as much as you can, and pre-
    | filling caches, etc.
    | 
    | Of course, having engineers work on 1% performance
    | improvements in the OS kernel, or memory allocators, etc will
    | add up and help a lot too.
 
    | simcop2387 wrote:
    | I've had them take seconds for suggestions before when doing
    | more esoteric searches. I think there's an inordinate amount
    | of cached suggestions and they have an incredible way to look
    | them up efficiently.
 
| fabioyy wrote:
| did you try DPDK?
 
| strawberrysauce wrote:
| Your website is super snappy. I see that it has a perfect
| lighthouse score too. Can you explain the stack you used and how
| you set it up?
 
| miohtama wrote:
| How much head room there would be if one were to use Unikernel
| and skip the application space altogether?
 
| 0xbadcafebee wrote:
| Very well written, bravo. TOC and reference links makes it even
| better.
 
| the8472 wrote:
| Since it's CPU-bound and spends a lot of time in the kernel would
| compiling the kernel for the specific CPU used make sense? Or are
| the CPU cycles wasted on things the compiler can't optimize?
 
| drenvuk wrote:
| I'm of two minds with regards to this: This is cool but unless
| you have no authentication, data to fetch remotely or on disk
| this is really just telling you what the ceiling is for
| everything you could possibly run.
| 
| As for this article, there are so many knobs that you tweaked to
| get this to run faster it's incredibly informative. Thank you for
| sharing.
 
  | joshka wrote:
  | > this is really just telling you what the ceiling is
  | 
  | That's a useful piece of info to know when performance tuning a
  | real world app with auth / data / etc.
 
| bigredhdl wrote:
| I really like the "Optimizations That Didn't Work" section. This
| type of information should be shared more often.
 
| Thaxll wrote:
| There was a similar article from Dropbox years ago:
| https://dropbox.tech/infrastructure/optimizing-web-servers-f...
| still very relevant
 
| 120bits wrote:
| Very well written.
| 
| - I have nodejs server for the APIs and its running on m5.xlarge
| instance. I haven't done much research on what instance type
| should I go for. I looked up and it seems like
| c5n.xlarge(mentioned in the article) is meant compute optimized.
| That cost difference isn't much between m5.xlarge and c5n.xlarge.
| So, I'm assuming that switching to c5 instance would be better,
| right?
| 
| - Does having ngnix handle the request is better option here? And
| setup reverse proxy for NodeJS? I'm thinking of taking small
| steps on scaling an existing framework.
 
  | talawahtech wrote:
  | Thanks!
  | 
  | The c5 instance type is about 10-15% faster than the m5, but
  | the m5 has twice as much memory. So if memory is not a concern
  | then switching to c5 is both a little cheaper and a little
  | faster.
  | 
  | You shouldn't need the c5n, the regular c5 should be fine for
  | most use cases, and it is cheaper.
  | 
  | Nginx in front of nodejs sounds like a solid starting point,
  | but I can't claim to have a ton of experience with that combo.
 
  | danielheath wrote:
  | For high level languages like node, the graviton2 instances
  | offer vastly cheaper cpu time (as in, 40%). That's the m6g /
  | c6g series.
  | 
  | As in all things, check the results on your own workload!
 
  | [deleted]
 
  | nodesocket wrote:
  | m5 has more memory, if you application is memory bound stick
  | with that instance type.
  | 
  | I'd recommend just using a standard AWS application load
  | balancer in front of your Node.js app. Terminate SSL at the ALB
  | as well using certificate manager (free). Will run you around
  | $18 a month more.
 
| secondcoming wrote:
| Fantastic article. Disabling spectre mitigations on all my team's
| GCE instances is something I'm going to check out.
| 
| Regarding core pinning, the usual advice is to pin to the CPU
| socket physically closest to the NIC. Is there any point doing
| this on cloud instances? Your actual cores could be anywhere. So
| just isolate one and hope for the best?
 
  | brobinson wrote:
  | There are a bunch more mitigations that can be disabled than he
  | disables in the article. I usually refer to https://make-linux-
  | fast-again.com/
 
  | halz wrote:
  | Pinning to the physically closest core is a bit misleading.
  | Take a look at output from something like `lstopo`
  | [https://www.open-mpi.org/projects/hwloc/], where you can
  | filter pids across the NUMA topology and trace which components
  | are routed into which nodes. Pin the network based workloads
  | into the corresponding NUMA node and isolate processes from
  | hitting the IRQ that drives the NIC.
 
| ArtWomb wrote:
| Wow. Such impressive bpftrace skill! Keeping this article under
| my pillow ;)
| 
| Wonder where the next optimization path leads? Using huge memory
| pages. io_uring, which was briefly mentioned. Or kernel bypass,
| which is supported on c5n instances as of late...
 
| diroussel wrote:
| Did you consider wrk2?
| 
| https://github.com/giltene/wrk2
| 
| Maybe you duplicated some of these fixes?
 
  | talawahtech wrote:
  | Yea, I looked it wrk2 but it was a no-go right out the gate.
  | From what I recall the changes to handle coordinated omission
  | use a timer that has a 1ms resolution. So basically things
  | broke immediately because all requests were under 1ms.
 
| specialist wrote:
| What is the theoretical max req/s for a 4 vCPU c5n.xlarge
| instance?
 
  | talawahtech wrote:
  | There is no published limit, but based on my tests the network
  | device for the c5n.xlarge has a hard limit of 1.8M pps (which
  | translates directly to req/s for small requests without
  | pipelining).
  | 
  | There is also a quota system in place, so even though that is
  | the hard limit, you can only operate at those speeds for a
  | short time before you start getting rate-limited.
 
___________________________________________________________________
(page generated 2021-05-20 23:00 UTC)