|
| sandGorgon wrote:
| > _Pods communicate directly with one another on their pod IP
| addresses with MPI via SSH, not service endpoints._
|
| this is super interesting. They are using mpi on kubernetes for
| their AI training i suppose.
|
| So they are not using anything like kubeflow. Any idea which
| framework this is ?
|
| The state of AI training on kubernetes is not so hot. And this
| would be a good learning. There is Ray Distributed today that
| claims a better performance (as well as a better developer
| experience) than OpenMPI -
| https://www.usenix.org/system/files/osdi18-moritz.pdf
|
| wonder why the choices were made as such
| benchess wrote:
| Hi, co-author here!
|
| We use a pretty standard tech stack of PyTorch + NCCL + MPI.
| We've used both OpenMPI and MPICH to varying degrees.
|
| Kubeflow is interesting, but it solves a slightly different
| problem of scheduling/coordinating ML workflows on top of Kube.
| It doesn't get involved with how an ML job communicates within
| itself cross-node.
| mmq wrote:
| Probably OP was referring to the MPIOperator, TFOperator,
| PytorchOperator, ... they are under the Kuberflow org, but
| can be deployed independently of Kubeflow itself. Several
| other projects are using those operators to provide similar
| abstractions you mentioned in your blog post, e.g. Gang
| scheduling, cross-nodes communication, ...
|
| One difference is that these operators use the Kubernetes
| service interface for communication, generally exposing a
| headless service for each replica.
| trhway wrote:
| >Our biggest jobs run MPI, and all pods within the job are
| participating in a single MPI communicator. If any of the
| participating pods die, the entire job halts and needs to be
| restarted.
|
| Sounds like a strange approach for ML jobs - i mean you'd expect
| that all those parallel subtasks aren't each individually midway
| interconnected with each other, and that the failed subtask can
| be easily restarted on its own. With thousands of subtasks
| running in parallel some are bound to fail for whatever reason.
| Their choice of MPI over HTTP suggests though that they pay a
| premium for latency and that suggests the subtasks actively
| cross-communicating, a typical case for MPI.
| jamesblonde wrote:
| This is probably data-parallel training with collective all-
| reduce (Horovod probably, as they are using MPI). Membership of
| the ring in Horovod is static - you can't recover from a failed
| worker. You would need to build a consistent hashing ring (like
| a DHT), so that workers could identify and agree on failing
| workers (heartbeats) and evict them. None of those goodies in
| Horovod yet.
|
| The workaround is to have a chief-node do periodic
| checkpointing of the model weights and epoc/iteration, so that
| you can recover from the checkpoint if a worker fails.
| andyxor wrote:
| Excellent engineering but I wish they also worked on AI
| jacques_chester wrote:
| > _That said, strain on the kube-scheduler is spiky. A new job
| may consist of many hundreds of pods all being created at once,
| then return to a relatively low rate of churn._
|
| Last I checked, the default scheduler places Pods one at a time.
| It might be advantageous to use a gang/batch scheduler like kube-
| batch[0], Poseidon[1] or DCM[2].
|
| Edit: looks like they're already investigating that approach --
|
| > _We tried a few things needing a custom scheduler, but ran into
| edge cases that caused conflicts with how normal pods were
| scheduled. Kubernetes 1.18 introduced a plugin architecture for
| the core Kubernetes scheduler, making it much easier to add
| features like this natively. We recently landed on the
| Coscheduling plugin as a good way to solve this problem._
|
| [0] https://github.com/kubernetes-sigs/kube-batch
|
| [1] https://github.com/kubernetes-sigs/poseidon
|
| [2] https://github.com/vmware/declarative-cluster-management
| sillysaurusx wrote:
| Godda hand it to OpenAI. My opinion about them is slowly
| reversing. I was nervous when they went full API, but CLIP is a
| fantastic model that they released for free.
| chubot wrote:
| _A large machine learning job spans many nodes and runs most
| efficiently when it has access to all of the hardware resources
| on each node ... So for many of our workloads, a single pod
| occupies the entire node_
|
| Hm, why not just use the underlying nodes then, without
| Kubernetes?
|
| Is the underlying cloud that bad at scheduling, and are they
| keeping the VMs warm all the time?
|
| What are they gaining for this indirection? Is it to get a common
| interface across GCP and other clouds?
|
| _Bin-packing or fragmentation is not a common problem_
|
| _there's relatively low strain on the scheduler._
|
| _That said, strain on the kube-scheduler is spiky_
| benchess wrote:
| Hi! Co-author here. We do keep the nodes running 24/7, so
| Kubernetes still provides the scheduling to decide which nodes
| are free or not at any given time. Generally starting a
| container on a pre-warmed node is still much much faster than
| booting a VM. Also, some of our servers are bare-metal.
|
| EDIT: Also don't discount the rest of the Kubernetes ecosystem.
| It's more than just a scheduler. It provides configuration,
| secrets management, healthchecks, self-healing, service
| discovery, ACLs... there are absolutely other ways to solve
| each of these things. But when starting from scratch there's a
| wide field of additional questions to answer, problems to
| solve.
| xorcist wrote:
| Isn't Kubernetes a pretty lousy scheduler when it doesn't
| take this into consideration? There are a number of
| schedulers used in high performance computing that should be
| able to do a better job.
| chubot wrote:
| Yeah exactly... This seems closer to an HPC problem, not a
| "cloud" problem.
|
| Related comment from 6 months ago about Kubernetes use
| cases: https://lobste.rs/s/kx1jj4/what_has_your_experience_
| with_kub...
|
| Summary: scale has at least 2 different meanings. Scaling
| in resources doesn't really mean you need Kubernetes.
| Scaling in terms of workload diversity is a better use case
| for it.
|
| Kubernetes is basically a knockoff of Borg, but Borg is
| designed (or evolved) to run diverse services (search,
| maps, gmail, etc.; batch and low latency). Ironically most
| people who run their own Kube clusters don't seem to have
| much workload diversity.
|
| On the other hand, HPC is usually about scaling in terms of
| resources: running a few huge jobs on many nodes. A single
| job will occupy an entire node (and thousands of nodes),
| which is what's happening here.
|
| I've never used these HPC systems but it looks like they
| are starting to run on the cloud. Kubernetes may still have
| been a defensible choice for other reasons, but as someone
| who used Borg for a long time, it's weird what it's turned
| into. Sort of like protobufs now have a weird "reflection
| service". Huh?
|
| https://aws.amazon.com/blogs/publicsector/tag/htcondor/
|
| https://aws.amazon.com/marketplace/pp/Center-for-High-
| Throug...
| [deleted]
| jacobr1 wrote:
| Exactly, we migrated to k8s not because we needed better
| scaling (ec2 auto scaling groups were working reasonably
| well for us) but because we kept investing our own way to
| do rolling deploys or run scheduled jobs, and had a
| variety of ways to store secrets. On top of that
| developers were increasingly running their own containers
| with docker compose test services talking to each to each
| other. We migrated to k8s to A) have a way to standardize
| how to run containerized builds and get the benefits for
| "it works on my laptop" matching how it works in
| production (at least functionally) and B) a common set of
| patterns for managing deployed software. Resource
| scheduling only became of interest after we migrated when
| we realized the aggregation of our payloads allowed us to
| use things like spot instances without jeopardizing
| availability.
| vergessenmir wrote:
| It maybe an HPC problem but I'm not sure the available
| solutions come close to k8s in terms of functionality and
| I'm not talking about scheduling.
|
| I used to work in HPC/Grid but it's been a while but I do
| remember Condor being clunky even though it had its uses.
|
| And the commercial grid offerings couldn't scale to
| almost 10k nodes back then (am not sure about now, or if
| they even exist anymore)
| toomuchtodo wrote:
| Condor is clunky, but still in use in high energy
| physics, for example (LHC CMS detector data processing).
|
| For greenfield deployments, I would recommend Hashicorp's
| Nomad before Kubernetes or Condor if your per server
| container intent is ~1 (bare metal with a light
| hypervisor for orchestration), but still steer you to
| Kubernetes for microservices and web-based cookie cutter
| apps (I know many finance shops using Nomad, but
| Cloudflare uses it with Consul, so no hard and fast
| rules).
|
| Disclosure: Worked in HPC space managing a cluster for
| high energy physics. I also use (free version) Nomad for
| personal cluster workload scheduling.
| jedbrown wrote:
| Condor and the like are for independent jobs "throughput
| computing" but the authors here are using MPI for
| tightly-coupled jobs. SLURM and Flux are actively-
| developed schedulers for these kind of jobs.
|
| https://slurm.schedmd.com/
|
| https://flux-framework.readthedocs.io/en/latest/
| AlphaSite wrote:
| If all you care about is node in use or not in use I think
| it's fine. You don't need anything complex from the
| scheduler.
| stonogo wrote:
| Are you starting from scratch? This architecture seems like a
| pretty standard HPC deployment with unnecessary
| containerization involved.
| hamandcheese wrote:
| Not to me mention it's a well known skillset that can more
| easily be hired for, as opposed to "come work on our crazy
| sauce job scheduler, you'll love it!"
| dijit wrote:
| I feel like we solved this problem over a decade ago (if
| you're keeping machines warm anyway) with job brokers. Am I
| somehow mistaken?
| dssdd wrote:
| >Pod network traffic shaping
|
| Have you considered EDT-based rate limiting for Pods? This should
| scale well compared to TBF or HTB. Cilium developers have
| integrated this natively:
| https://cilium.io/blog/2020/11/10/cilium-19#bwmanager
| benchess wrote:
| Hi, co-author here. Yes we are excited about the potential of
| this!
___________________________________________________________________
(page generated 2021-01-25 23:00 UTC) |