proxy70

	[HN Gopher] Scaling Kubernetes to 7,500 Nodes ___________________________________________________________________ Scaling Kubernetes to 7,500 Nodes Author : sillysaurusx Score : 109 points Date : 2021-01-25 19:09 UTC (3 hours ago)
	web link (openai.com)
	w3m dump (openai.com)
	\| sandGorgon wrote: \| > _Pods communicate directly with one another on their pod IP \| addresses with MPI via SSH, not service endpoints._ \| \| this is super interesting. They are using mpi on kubernetes for \| their AI training i suppose. \| \| So they are not using anything like kubeflow. Any idea which \| framework this is ? \| \| The state of AI training on kubernetes is not so hot. And this \| would be a good learning. There is Ray Distributed today that \| claims a better performance (as well as a better developer \| experience) than OpenMPI - \| https://www.usenix.org/system/files/osdi18-moritz.pdf \| \| wonder why the choices were made as such \| benchess wrote: \| Hi, co-author here! \| \| We use a pretty standard tech stack of PyTorch + NCCL + MPI. \| We've used both OpenMPI and MPICH to varying degrees. \| \| Kubeflow is interesting, but it solves a slightly different \| problem of scheduling/coordinating ML workflows on top of Kube. \| It doesn't get involved with how an ML job communicates within \| itself cross-node. \| mmq wrote: \| Probably OP was referring to the MPIOperator, TFOperator, \| PytorchOperator, ... they are under the Kuberflow org, but \| can be deployed independently of Kubeflow itself. Several \| other projects are using those operators to provide similar \| abstractions you mentioned in your blog post, e.g. Gang \| scheduling, cross-nodes communication, ... \| \| One difference is that these operators use the Kubernetes \| service interface for communication, generally exposing a \| headless service for each replica. \| trhway wrote: \| >Our biggest jobs run MPI, and all pods within the job are \| participating in a single MPI communicator. If any of the \| participating pods die, the entire job halts and needs to be \| restarted. \| \| Sounds like a strange approach for ML jobs - i mean you'd expect \| that all those parallel subtasks aren't each individually midway \| interconnected with each other, and that the failed subtask can \| be easily restarted on its own. With thousands of subtasks \| running in parallel some are bound to fail for whatever reason. \| Their choice of MPI over HTTP suggests though that they pay a \| premium for latency and that suggests the subtasks actively \| cross-communicating, a typical case for MPI. \| jamesblonde wrote: \| This is probably data-parallel training with collective all- \| reduce (Horovod probably, as they are using MPI). Membership of \| the ring in Horovod is static - you can't recover from a failed \| worker. You would need to build a consistent hashing ring (like \| a DHT), so that workers could identify and agree on failing \| workers (heartbeats) and evict them. None of those goodies in \| Horovod yet. \| \| The workaround is to have a chief-node do periodic \| checkpointing of the model weights and epoc/iteration, so that \| you can recover from the checkpoint if a worker fails. \| andyxor wrote: \| Excellent engineering but I wish they also worked on AI \| jacques_chester wrote: \| > _That said, strain on the kube-scheduler is spiky. A new job \| may consist of many hundreds of pods all being created at once, \| then return to a relatively low rate of churn._ \| \| Last I checked, the default scheduler places Pods one at a time. \| It might be advantageous to use a gang/batch scheduler like kube- \| batch[0], Poseidon[1] or DCM[2]. \| \| Edit: looks like they're already investigating that approach -- \| \| > _We tried a few things needing a custom scheduler, but ran into \| edge cases that caused conflicts with how normal pods were \| scheduled. Kubernetes 1.18 introduced a plugin architecture for \| the core Kubernetes scheduler, making it much easier to add \| features like this natively. We recently landed on the \| Coscheduling plugin as a good way to solve this problem._ \| \| [0] https://github.com/kubernetes-sigs/kube-batch \| \| [1] https://github.com/kubernetes-sigs/poseidon \| \| [2] https://github.com/vmware/declarative-cluster-management \| sillysaurusx wrote: \| Godda hand it to OpenAI. My opinion about them is slowly \| reversing. I was nervous when they went full API, but CLIP is a \| fantastic model that they released for free. \| chubot wrote: \| _A large machine learning job spans many nodes and runs most \| efficiently when it has access to all of the hardware resources \| on each node ... So for many of our workloads, a single pod \| occupies the entire node_ \| \| Hm, why not just use the underlying nodes then, without \| Kubernetes? \| \| Is the underlying cloud that bad at scheduling, and are they \| keeping the VMs warm all the time? \| \| What are they gaining for this indirection? Is it to get a common \| interface across GCP and other clouds? \| \| _Bin-packing or fragmentation is not a common problem_ \| \| _there's relatively low strain on the scheduler._ \| \| _That said, strain on the kube-scheduler is spiky_ \| benchess wrote: \| Hi! Co-author here. We do keep the nodes running 24/7, so \| Kubernetes still provides the scheduling to decide which nodes \| are free or not at any given time. Generally starting a \| container on a pre-warmed node is still much much faster than \| booting a VM. Also, some of our servers are bare-metal. \| \| EDIT: Also don't discount the rest of the Kubernetes ecosystem. \| It's more than just a scheduler. It provides configuration, \| secrets management, healthchecks, self-healing, service \| discovery, ACLs... there are absolutely other ways to solve \| each of these things. But when starting from scratch there's a \| wide field of additional questions to answer, problems to \| solve. \| xorcist wrote: \| Isn't Kubernetes a pretty lousy scheduler when it doesn't \| take this into consideration? There are a number of \| schedulers used in high performance computing that should be \| able to do a better job. \| chubot wrote: \| Yeah exactly... This seems closer to an HPC problem, not a \| "cloud" problem. \| \| Related comment from 6 months ago about Kubernetes use \| cases: https://lobste.rs/s/kx1jj4/what_has_your_experience_ \| with_kub... \| \| Summary: scale has at least 2 different meanings. Scaling \| in resources doesn't really mean you need Kubernetes. \| Scaling in terms of workload diversity is a better use case \| for it. \| \| Kubernetes is basically a knockoff of Borg, but Borg is \| designed (or evolved) to run diverse services (search, \| maps, gmail, etc.; batch and low latency). Ironically most \| people who run their own Kube clusters don't seem to have \| much workload diversity. \| \| On the other hand, HPC is usually about scaling in terms of \| resources: running a few huge jobs on many nodes. A single \| job will occupy an entire node (and thousands of nodes), \| which is what's happening here. \| \| I've never used these HPC systems but it looks like they \| are starting to run on the cloud. Kubernetes may still have \| been a defensible choice for other reasons, but as someone \| who used Borg for a long time, it's weird what it's turned \| into. Sort of like protobufs now have a weird "reflection \| service". Huh? \| \| https://aws.amazon.com/blogs/publicsector/tag/htcondor/ \| \| https://aws.amazon.com/marketplace/pp/Center-for-High- \| Throug... \| [deleted] \| jacobr1 wrote: \| Exactly, we migrated to k8s not because we needed better \| scaling (ec2 auto scaling groups were working reasonably \| well for us) but because we kept investing our own way to \| do rolling deploys or run scheduled jobs, and had a \| variety of ways to store secrets. On top of that \| developers were increasingly running their own containers \| with docker compose test services talking to each to each \| other. We migrated to k8s to A) have a way to standardize \| how to run containerized builds and get the benefits for \| "it works on my laptop" matching how it works in \| production (at least functionally) and B) a common set of \| patterns for managing deployed software. Resource \| scheduling only became of interest after we migrated when \| we realized the aggregation of our payloads allowed us to \| use things like spot instances without jeopardizing \| availability. \| vergessenmir wrote: \| It maybe an HPC problem but I'm not sure the available \| solutions come close to k8s in terms of functionality and \| I'm not talking about scheduling. \| \| I used to work in HPC/Grid but it's been a while but I do \| remember Condor being clunky even though it had its uses. \| \| And the commercial grid offerings couldn't scale to \| almost 10k nodes back then (am not sure about now, or if \| they even exist anymore) \| toomuchtodo wrote: \| Condor is clunky, but still in use in high energy \| physics, for example (LHC CMS detector data processing). \| \| For greenfield deployments, I would recommend Hashicorp's \| Nomad before Kubernetes or Condor if your per server \| container intent is ~1 (bare metal with a light \| hypervisor for orchestration), but still steer you to \| Kubernetes for microservices and web-based cookie cutter \| apps (I know many finance shops using Nomad, but \| Cloudflare uses it with Consul, so no hard and fast \| rules). \| \| Disclosure: Worked in HPC space managing a cluster for \| high energy physics. I also use (free version) Nomad for \| personal cluster workload scheduling. \| jedbrown wrote: \| Condor and the like are for independent jobs "throughput \| computing" but the authors here are using MPI for \| tightly-coupled jobs. SLURM and Flux are actively- \| developed schedulers for these kind of jobs. \| \| https://slurm.schedmd.com/ \| \| https://flux-framework.readthedocs.io/en/latest/ \| AlphaSite wrote: \| If all you care about is node in use or not in use I think \| it's fine. You don't need anything complex from the \| scheduler. \| stonogo wrote: \| Are you starting from scratch? This architecture seems like a \| pretty standard HPC deployment with unnecessary \| containerization involved. \| hamandcheese wrote: \| Not to me mention it's a well known skillset that can more \| easily be hired for, as opposed to "come work on our crazy \| sauce job scheduler, you'll love it!" \| dijit wrote: \| I feel like we solved this problem over a decade ago (if \| you're keeping machines warm anyway) with job brokers. Am I \| somehow mistaken? \| dssdd wrote: \| >Pod network traffic shaping \| \| Have you considered EDT-based rate limiting for Pods? This should \| scale well compared to TBF or HTB. Cilium developers have \| integrated this natively: \| https://cilium.io/blog/2020/11/10/cilium-19#bwmanager \| benchess wrote: \| Hi, co-author here. Yes we are excited about the potential of \| this! ___________________________________________________________________ (page generated 2021-01-25 23:00 UTC)