proxy70

	[HN Gopher] Extreme HTTP Performance Tuning ___________________________________________________________________ Extreme HTTP Performance Tuning Author : talawahtech Score : 219 points Date : 2021-05-20 20:01 UTC (2 hours ago)
	web link (talawah.io)
	w3m dump (talawah.io)
	\| jeffbee wrote: \| Very nice round-up of techniques. I'd throw out a few that might \| or might not be worth trying: 1) I always disable C-states deeper \| than C1E. Waking from C6 takes upwards of 100 microseconds, way \| too much for a latency-sensitive service, and it doesn't save \| _you_ any money when you are running on EC2; 2) Try receive flow \| steering for a possible boost above and beyond what you get from \| RSS. \| \| Would also be interesting to discuss the impacts of turning off \| the xmit queue discipline. fq is designed to reduce frame drops \| at the switch level. Transmitting as fast as possible can cause \| frame drops which will totally erase all your other tuning work. \| talawahtech wrote: \| Thanks! \| \| > I always disable C-states deeper than C1E \| \| AWS doesn't let you mess with c-states for instances smaller \| than a c5.9xlarge[1]. I did actually test it out on a 9xlarge \| just for kicks, but it didn't make a difference. Once this test \| starts, all CPUs are 99+% Busy for the duration of the test. I \| think it would factor in more if there were lots of CPUs, and \| some were idle during the test. \| \| > Try receive flow steering for a possible boost \| \| I think the stuff I do in the "perfect locality" section[2] \| (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive \| flow steering would be trying to do, but more efficiently. \| \| > Would also be interesting to discuss the impacts of turning \| off the xmit queue discipline \| \| Yea, noqueue would definitely be a no-go on a constrained \| network, but when running the (t)wrk benchmark in the cluster \| placement group I didn't see any evidence of packet drops or \| retransmits. Drop only happened with the iperf test. \| \| 1. \| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo... \| \| 2. https://talawah.io/blog/extreme-http-performance-tuning- \| one-... \| duskwuff wrote: \| Does C-state tuning even do anything on EC2? My intuition says \| it probably doesn't pass through to the underlying hardware -- \| once the VM exits, it's up to the host OS what power state the \| CPU goes into. \| jeffbee wrote: \| It definitely works and you can measure the effect. There's \| official documentation on what it does and how to tune it: \| \| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo. \| .. \| xtacy wrote: \| I suspect that the web server's CPU usage will be pretty high \| (almost 100%), so C-state tuning may not matter as much? \| \| EDIT: also, RSS happens on the NIC. RFS happens in the kernel, \| so it might not be as effective. For a uniform request workload \| like the one in the article, statically binding flows to a NIC \| queue should be sufficient. :) \| fierro wrote: \| How can you be sure the estimated max server capability is not \| actually just a limitation in the _client_ , i.e, the client \| maxes out at _sending_ 224k requests / second. \| \| I see that this is clearly not the case here, but in general how \| can one be sure? \| alufers wrote: \| That is one hell of a comprehensive article. I wonder how much \| impact would such extreme optimizations on a real-world \| application, which for example does DB queries. \| \| This experiment feels similar to people who buy old cars and \| remove everything from the inside except the engine, which they \| tune up so that the car runs faster :). \| talawahtech wrote: \| This comprehensive level of extreme tuning is not going to be \| _directly_ useful to most people; but there are a few things in \| there like SO_ATTACH_REUSEPORT_CBPF that I hope to see more \| servers and frameworks adopt. Similarly I think it is good to \| be aware of the adaptive interrupt capabilities of AWS \| instances, and the impacts of speculative execution \| mitigations, even if you stick to the defaults. \| \| More importantly it is about the idea of using tools like \| Flamegraphs (or other profiling tools) to identify and \| eliminate _your_ bottlenecks. It is also just fun to experiment \| and share the results (and the CloudFormation template). Plus \| it establishes a high water mark for what is possible, which \| also makes it useful for future experiments. At some point I \| would like to do a modified version of this that includes DB \| queries. \| 101008 wrote: \| Yes, my experience (not much) is that what makes YouTube or \| Google or any of those products really impressive is the speed. \| \| YouTube or Google Search suggestion is good, and I think it \| could be replicable with that amount of data. What is insane is \| the speed. I can't think how they do it. I am doing something \| similar for the company I work on and it takes seconds (and the \| amount of data isn't that much), so I can't wrap my head around \| it. \| \| The point is that doing only speed is not _that_ complicated, \| and doing some algorithms alone is not _that_ complicated. What \| is really hard is to do both. \| ecnahc515 wrote: \| A lot of this is just spending more money and resources to \| make it possible to optimize for speed. \| \| With sufficient caching with and a lot of parallelism makes \| this possible. That costs money though. Caching means storing \| data twice. Parallelism means more servers (since you'll \| probably be aiming to saturate the network bandwidth for each \| host). \| \| Pre-aggregating data is another part of the strategy, as that \| avoids using CPU cycles in the fast-path, but it means \| storing even more copies of the data! \| \| My personal anecdotal experience with this is with SQL on \| object storage. Query engines that use object storage can \| still perform well with the above techniques, even though \| querying large amounts of data from object is slow. You can \| bypass the slowness of object storage if you pre-cache the \| data somewhere else that's closer/faster for recent data. You \| can have materialized views/tables for rollups of data over \| longer periods of time, which reduces the data needed to be \| fetched and cached. It also requires less CPU due to working \| with a smaller amount of pre-calculated data. \| \| Apply this to every layer, every system, etc, and you can get \| good performance even with tons of data. It's why doing \| machine-learning in real- is way harder than pre-computing \| models. Streaming platforms make this all much easier as you \| can constantly be pre-computing as much as you can, and pre- \| filling caches, etc. \| \| Of course, having engineers work on 1% performance \| improvements in the OS kernel, or memory allocators, etc will \| add up and help a lot too. \| simcop2387 wrote: \| I've had them take seconds for suggestions before when doing \| more esoteric searches. I think there's an inordinate amount \| of cached suggestions and they have an incredible way to look \| them up efficiently. \| fabioyy wrote: \| did you try DPDK? \| strawberrysauce wrote: \| Your website is super snappy. I see that it has a perfect \| lighthouse score too. Can you explain the stack you used and how \| you set it up? \| miohtama wrote: \| How much head room there would be if one were to use Unikernel \| and skip the application space altogether? \| 0xbadcafebee wrote: \| Very well written, bravo. TOC and reference links makes it even \| better. \| the8472 wrote: \| Since it's CPU-bound and spends a lot of time in the kernel would \| compiling the kernel for the specific CPU used make sense? Or are \| the CPU cycles wasted on things the compiler can't optimize? \| drenvuk wrote: \| I'm of two minds with regards to this: This is cool but unless \| you have no authentication, data to fetch remotely or on disk \| this is really just telling you what the ceiling is for \| everything you could possibly run. \| \| As for this article, there are so many knobs that you tweaked to \| get this to run faster it's incredibly informative. Thank you for \| sharing. \| joshka wrote: \| > this is really just telling you what the ceiling is \| \| That's a useful piece of info to know when performance tuning a \| real world app with auth / data / etc. \| bigredhdl wrote: \| I really like the "Optimizations That Didn't Work" section. This \| type of information should be shared more often. \| Thaxll wrote: \| There was a similar article from Dropbox years ago: \| https://dropbox.tech/infrastructure/optimizing-web-servers-f... \| still very relevant \| 120bits wrote: \| Very well written. \| \| - I have nodejs server for the APIs and its running on m5.xlarge \| instance. I haven't done much research on what instance type \| should I go for. I looked up and it seems like \| c5n.xlarge(mentioned in the article) is meant compute optimized. \| That cost difference isn't much between m5.xlarge and c5n.xlarge. \| So, I'm assuming that switching to c5 instance would be better, \| right? \| \| - Does having ngnix handle the request is better option here? And \| setup reverse proxy for NodeJS? I'm thinking of taking small \| steps on scaling an existing framework. \| talawahtech wrote: \| Thanks! \| \| The c5 instance type is about 10-15% faster than the m5, but \| the m5 has twice as much memory. So if memory is not a concern \| then switching to c5 is both a little cheaper and a little \| faster. \| \| You shouldn't need the c5n, the regular c5 should be fine for \| most use cases, and it is cheaper. \| \| Nginx in front of nodejs sounds like a solid starting point, \| but I can't claim to have a ton of experience with that combo. \| danielheath wrote: \| For high level languages like node, the graviton2 instances \| offer vastly cheaper cpu time (as in, 40%). That's the m6g / \| c6g series. \| \| As in all things, check the results on your own workload! \| [deleted] \| nodesocket wrote: \| m5 has more memory, if you application is memory bound stick \| with that instance type. \| \| I'd recommend just using a standard AWS application load \| balancer in front of your Node.js app. Terminate SSL at the ALB \| as well using certificate manager (free). Will run you around \| $18 a month more. \| secondcoming wrote: \| Fantastic article. Disabling spectre mitigations on all my team's \| GCE instances is something I'm going to check out. \| \| Regarding core pinning, the usual advice is to pin to the CPU \| socket physically closest to the NIC. Is there any point doing \| this on cloud instances? Your actual cores could be anywhere. So \| just isolate one and hope for the best? \| brobinson wrote: \| There are a bunch more mitigations that can be disabled than he \| disables in the article. I usually refer to https://make-linux- \| fast-again.com/ \| halz wrote: \| Pinning to the physically closest core is a bit misleading. \| Take a look at output from something like `lstopo` \| [https://www.open-mpi.org/projects/hwloc/], where you can \| filter pids across the NUMA topology and trace which components \| are routed into which nodes. Pin the network based workloads \| into the corresponding NUMA node and isolate processes from \| hitting the IRQ that drives the NIC. \| ArtWomb wrote: \| Wow. Such impressive bpftrace skill! Keeping this article under \| my pillow ;) \| \| Wonder where the next optimization path leads? Using huge memory \| pages. io_uring, which was briefly mentioned. Or kernel bypass, \| which is supported on c5n instances as of late... \| diroussel wrote: \| Did you consider wrk2? \| \| https://github.com/giltene/wrk2 \| \| Maybe you duplicated some of these fixes? \| talawahtech wrote: \| Yea, I looked it wrk2 but it was a no-go right out the gate. \| From what I recall the changes to handle coordinated omission \| use a timer that has a 1ms resolution. So basically things \| broke immediately because all requests were under 1ms. \| specialist wrote: \| What is the theoretical max req/s for a 4 vCPU c5n.xlarge \| instance? \| talawahtech wrote: \| There is no published limit, but based on my tests the network \| device for the c5n.xlarge has a hard limit of 1.8M pps (which \| translates directly to req/s for small requests without \| pipelining). \| \| There is also a quota system in place, so even though that is \| the hard limit, you can only operate at those speeds for a \| short time before you start getting rate-limited. ___________________________________________________________________ (page generated 2021-05-20 23:00 UTC)