|
| jeffbee wrote:
| Very nice round-up of techniques. I'd throw out a few that might
| or might not be worth trying: 1) I always disable C-states deeper
| than C1E. Waking from C6 takes upwards of 100 microseconds, way
| too much for a latency-sensitive service, and it doesn't save
| _you_ any money when you are running on EC2; 2) Try receive flow
| steering for a possible boost above and beyond what you get from
| RSS.
|
| Would also be interesting to discuss the impacts of turning off
| the xmit queue discipline. fq is designed to reduce frame drops
| at the switch level. Transmitting as fast as possible can cause
| frame drops which will totally erase all your other tuning work.
| talawahtech wrote:
| Thanks!
|
| > I always disable C-states deeper than C1E
|
| AWS doesn't let you mess with c-states for instances smaller
| than a c5.9xlarge[1]. I did actually test it out on a 9xlarge
| just for kicks, but it didn't make a difference. Once this test
| starts, all CPUs are 99+% Busy for the duration of the test. I
| think it would factor in more if there were lots of CPUs, and
| some were idle during the test.
|
| > Try receive flow steering for a possible boost
|
| I think the stuff I do in the "perfect locality" section[2]
| (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive
| flow steering would be trying to do, but more efficiently.
|
| > Would also be interesting to discuss the impacts of turning
| off the xmit queue discipline
|
| Yea, noqueue would definitely be a no-go on a constrained
| network, but when running the (t)wrk benchmark in the cluster
| placement group I didn't see any evidence of packet drops or
| retransmits. Drop only happened with the iperf test.
|
| 1.
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...
|
| 2. https://talawah.io/blog/extreme-http-performance-tuning-
| one-...
| duskwuff wrote:
| Does C-state tuning even do anything on EC2? My intuition says
| it probably doesn't pass through to the underlying hardware --
| once the VM exits, it's up to the host OS what power state the
| CPU goes into.
| jeffbee wrote:
| It definitely works and you can measure the effect. There's
| official documentation on what it does and how to tune it:
|
| https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo.
| ..
| xtacy wrote:
| I suspect that the web server's CPU usage will be pretty high
| (almost 100%), so C-state tuning may not matter as much?
|
| EDIT: also, RSS happens on the NIC. RFS happens in the kernel,
| so it might not be as effective. For a uniform request workload
| like the one in the article, statically binding flows to a NIC
| queue should be sufficient. :)
| fierro wrote:
| How can you be sure the estimated max server capability is not
| actually just a limitation in the _client_ , i.e, the client
| maxes out at _sending_ 224k requests / second.
|
| I see that this is clearly not the case here, but in general how
| can one be sure?
| alufers wrote:
| That is one hell of a comprehensive article. I wonder how much
| impact would such extreme optimizations on a real-world
| application, which for example does DB queries.
|
| This experiment feels similar to people who buy old cars and
| remove everything from the inside except the engine, which they
| tune up so that the car runs faster :).
| talawahtech wrote:
| This comprehensive level of extreme tuning is not going to be
| _directly_ useful to most people; but there are a few things in
| there like SO_ATTACH_REUSEPORT_CBPF that I hope to see more
| servers and frameworks adopt. Similarly I think it is good to
| be aware of the adaptive interrupt capabilities of AWS
| instances, and the impacts of speculative execution
| mitigations, even if you stick to the defaults.
|
| More importantly it is about the idea of using tools like
| Flamegraphs (or other profiling tools) to identify and
| eliminate _your_ bottlenecks. It is also just fun to experiment
| and share the results (and the CloudFormation template). Plus
| it establishes a high water mark for what is possible, which
| also makes it useful for future experiments. At some point I
| would like to do a modified version of this that includes DB
| queries.
| 101008 wrote:
| Yes, my experience (not much) is that what makes YouTube or
| Google or any of those products really impressive is the speed.
|
| YouTube or Google Search suggestion is good, and I think it
| could be replicable with that amount of data. What is insane is
| the speed. I can't think how they do it. I am doing something
| similar for the company I work on and it takes seconds (and the
| amount of data isn't that much), so I can't wrap my head around
| it.
|
| The point is that doing only speed is not _that_ complicated,
| and doing some algorithms alone is not _that_ complicated. What
| is really hard is to do both.
| ecnahc515 wrote:
| A lot of this is just spending more money and resources to
| make it possible to optimize for speed.
|
| With sufficient caching with and a lot of parallelism makes
| this possible. That costs money though. Caching means storing
| data twice. Parallelism means more servers (since you'll
| probably be aiming to saturate the network bandwidth for each
| host).
|
| Pre-aggregating data is another part of the strategy, as that
| avoids using CPU cycles in the fast-path, but it means
| storing even more copies of the data!
|
| My personal anecdotal experience with this is with SQL on
| object storage. Query engines that use object storage can
| still perform well with the above techniques, even though
| querying large amounts of data from object is slow. You can
| bypass the slowness of object storage if you pre-cache the
| data somewhere else that's closer/faster for recent data. You
| can have materialized views/tables for rollups of data over
| longer periods of time, which reduces the data needed to be
| fetched and cached. It also requires less CPU due to working
| with a smaller amount of pre-calculated data.
|
| Apply this to every layer, every system, etc, and you can get
| good performance even with tons of data. It's why doing
| machine-learning in real- is way harder than pre-computing
| models. Streaming platforms make this all much easier as you
| can constantly be pre-computing as much as you can, and pre-
| filling caches, etc.
|
| Of course, having engineers work on 1% performance
| improvements in the OS kernel, or memory allocators, etc will
| add up and help a lot too.
| simcop2387 wrote:
| I've had them take seconds for suggestions before when doing
| more esoteric searches. I think there's an inordinate amount
| of cached suggestions and they have an incredible way to look
| them up efficiently.
| fabioyy wrote:
| did you try DPDK?
| strawberrysauce wrote:
| Your website is super snappy. I see that it has a perfect
| lighthouse score too. Can you explain the stack you used and how
| you set it up?
| miohtama wrote:
| How much head room there would be if one were to use Unikernel
| and skip the application space altogether?
| 0xbadcafebee wrote:
| Very well written, bravo. TOC and reference links makes it even
| better.
| the8472 wrote:
| Since it's CPU-bound and spends a lot of time in the kernel would
| compiling the kernel for the specific CPU used make sense? Or are
| the CPU cycles wasted on things the compiler can't optimize?
| drenvuk wrote:
| I'm of two minds with regards to this: This is cool but unless
| you have no authentication, data to fetch remotely or on disk
| this is really just telling you what the ceiling is for
| everything you could possibly run.
|
| As for this article, there are so many knobs that you tweaked to
| get this to run faster it's incredibly informative. Thank you for
| sharing.
| joshka wrote:
| > this is really just telling you what the ceiling is
|
| That's a useful piece of info to know when performance tuning a
| real world app with auth / data / etc.
| bigredhdl wrote:
| I really like the "Optimizations That Didn't Work" section. This
| type of information should be shared more often.
| Thaxll wrote:
| There was a similar article from Dropbox years ago:
| https://dropbox.tech/infrastructure/optimizing-web-servers-f...
| still very relevant
| 120bits wrote:
| Very well written.
|
| - I have nodejs server for the APIs and its running on m5.xlarge
| instance. I haven't done much research on what instance type
| should I go for. I looked up and it seems like
| c5n.xlarge(mentioned in the article) is meant compute optimized.
| That cost difference isn't much between m5.xlarge and c5n.xlarge.
| So, I'm assuming that switching to c5 instance would be better,
| right?
|
| - Does having ngnix handle the request is better option here? And
| setup reverse proxy for NodeJS? I'm thinking of taking small
| steps on scaling an existing framework.
| talawahtech wrote:
| Thanks!
|
| The c5 instance type is about 10-15% faster than the m5, but
| the m5 has twice as much memory. So if memory is not a concern
| then switching to c5 is both a little cheaper and a little
| faster.
|
| You shouldn't need the c5n, the regular c5 should be fine for
| most use cases, and it is cheaper.
|
| Nginx in front of nodejs sounds like a solid starting point,
| but I can't claim to have a ton of experience with that combo.
| danielheath wrote:
| For high level languages like node, the graviton2 instances
| offer vastly cheaper cpu time (as in, 40%). That's the m6g /
| c6g series.
|
| As in all things, check the results on your own workload!
| [deleted]
| nodesocket wrote:
| m5 has more memory, if you application is memory bound stick
| with that instance type.
|
| I'd recommend just using a standard AWS application load
| balancer in front of your Node.js app. Terminate SSL at the ALB
| as well using certificate manager (free). Will run you around
| $18 a month more.
| secondcoming wrote:
| Fantastic article. Disabling spectre mitigations on all my team's
| GCE instances is something I'm going to check out.
|
| Regarding core pinning, the usual advice is to pin to the CPU
| socket physically closest to the NIC. Is there any point doing
| this on cloud instances? Your actual cores could be anywhere. So
| just isolate one and hope for the best?
| brobinson wrote:
| There are a bunch more mitigations that can be disabled than he
| disables in the article. I usually refer to https://make-linux-
| fast-again.com/
| halz wrote:
| Pinning to the physically closest core is a bit misleading.
| Take a look at output from something like `lstopo`
| [https://www.open-mpi.org/projects/hwloc/], where you can
| filter pids across the NUMA topology and trace which components
| are routed into which nodes. Pin the network based workloads
| into the corresponding NUMA node and isolate processes from
| hitting the IRQ that drives the NIC.
| ArtWomb wrote:
| Wow. Such impressive bpftrace skill! Keeping this article under
| my pillow ;)
|
| Wonder where the next optimization path leads? Using huge memory
| pages. io_uring, which was briefly mentioned. Or kernel bypass,
| which is supported on c5n instances as of late...
| diroussel wrote:
| Did you consider wrk2?
|
| https://github.com/giltene/wrk2
|
| Maybe you duplicated some of these fixes?
| talawahtech wrote:
| Yea, I looked it wrk2 but it was a no-go right out the gate.
| From what I recall the changes to handle coordinated omission
| use a timer that has a 1ms resolution. So basically things
| broke immediately because all requests were under 1ms.
| specialist wrote:
| What is the theoretical max req/s for a 4 vCPU c5n.xlarge
| instance?
| talawahtech wrote:
| There is no published limit, but based on my tests the network
| device for the c5n.xlarge has a hard limit of 1.8M pps (which
| translates directly to req/s for small requests without
| pipelining).
|
| There is also a quota system in place, so even though that is
| the hard limit, you can only operate at those speeds for a
| short time before you start getting rate-limited.
___________________________________________________________________
(page generated 2021-05-20 23:00 UTC) |