[HN Gopher] Scaling Kubernetes to 7,500 Nodes
___________________________________________________________________
 
Scaling Kubernetes to 7,500 Nodes
 
Author : sillysaurusx
Score  : 109 points
Date   : 2021-01-25 19:09 UTC (3 hours ago)
 
web link (openai.com)
w3m dump (openai.com)
 
| sandGorgon wrote:
| > _Pods communicate directly with one another on their pod IP
| addresses with MPI via SSH, not service endpoints._
| 
| this is super interesting. They are using mpi on kubernetes for
| their AI training i suppose.
| 
| So they are not using anything like kubeflow. Any idea which
| framework this is ?
| 
| The state of AI training on kubernetes is not so hot. And this
| would be a good learning. There is Ray Distributed today that
| claims a better performance (as well as a better developer
| experience) than OpenMPI -
| https://www.usenix.org/system/files/osdi18-moritz.pdf
| 
| wonder why the choices were made as such
 
  | benchess wrote:
  | Hi, co-author here!
  | 
  | We use a pretty standard tech stack of PyTorch + NCCL + MPI.
  | We've used both OpenMPI and MPICH to varying degrees.
  | 
  | Kubeflow is interesting, but it solves a slightly different
  | problem of scheduling/coordinating ML workflows on top of Kube.
  | It doesn't get involved with how an ML job communicates within
  | itself cross-node.
 
    | mmq wrote:
    | Probably OP was referring to the MPIOperator, TFOperator,
    | PytorchOperator, ... they are under the Kuberflow org, but
    | can be deployed independently of Kubeflow itself. Several
    | other projects are using those operators to provide similar
    | abstractions you mentioned in your blog post, e.g. Gang
    | scheduling, cross-nodes communication, ...
    | 
    | One difference is that these operators use the Kubernetes
    | service interface for communication, generally exposing a
    | headless service for each replica.
 
| trhway wrote:
| >Our biggest jobs run MPI, and all pods within the job are
| participating in a single MPI communicator. If any of the
| participating pods die, the entire job halts and needs to be
| restarted.
| 
| Sounds like a strange approach for ML jobs - i mean you'd expect
| that all those parallel subtasks aren't each individually midway
| interconnected with each other, and that the failed subtask can
| be easily restarted on its own. With thousands of subtasks
| running in parallel some are bound to fail for whatever reason.
| Their choice of MPI over HTTP suggests though that they pay a
| premium for latency and that suggests the subtasks actively
| cross-communicating, a typical case for MPI.
 
  | jamesblonde wrote:
  | This is probably data-parallel training with collective all-
  | reduce (Horovod probably, as they are using MPI). Membership of
  | the ring in Horovod is static - you can't recover from a failed
  | worker. You would need to build a consistent hashing ring (like
  | a DHT), so that workers could identify and agree on failing
  | workers (heartbeats) and evict them. None of those goodies in
  | Horovod yet.
  | 
  | The workaround is to have a chief-node do periodic
  | checkpointing of the model weights and epoc/iteration, so that
  | you can recover from the checkpoint if a worker fails.
 
| andyxor wrote:
| Excellent engineering but I wish they also worked on AI
 
| jacques_chester wrote:
| > _That said, strain on the kube-scheduler is spiky. A new job
| may consist of many hundreds of pods all being created at once,
| then return to a relatively low rate of churn._
| 
| Last I checked, the default scheduler places Pods one at a time.
| It might be advantageous to use a gang/batch scheduler like kube-
| batch[0], Poseidon[1] or DCM[2].
| 
| Edit: looks like they're already investigating that approach --
| 
| > _We tried a few things needing a custom scheduler, but ran into
| edge cases that caused conflicts with how normal pods were
| scheduled. Kubernetes 1.18 introduced a plugin architecture for
| the core Kubernetes scheduler, making it much easier to add
| features like this natively. We recently landed on the
| Coscheduling plugin as a good way to solve this problem._
| 
| [0] https://github.com/kubernetes-sigs/kube-batch
| 
| [1] https://github.com/kubernetes-sigs/poseidon
| 
| [2] https://github.com/vmware/declarative-cluster-management
 
| sillysaurusx wrote:
| Godda hand it to OpenAI. My opinion about them is slowly
| reversing. I was nervous when they went full API, but CLIP is a
| fantastic model that they released for free.
 
| chubot wrote:
| _A large machine learning job spans many nodes and runs most
| efficiently when it has access to all of the hardware resources
| on each node ... So for many of our workloads, a single pod
| occupies the entire node_
| 
| Hm, why not just use the underlying nodes then, without
| Kubernetes?
| 
| Is the underlying cloud that bad at scheduling, and are they
| keeping the VMs warm all the time?
| 
| What are they gaining for this indirection? Is it to get a common
| interface across GCP and other clouds?
| 
|  _Bin-packing or fragmentation is not a common problem_
| 
|  _there's relatively low strain on the scheduler._
| 
|  _That said, strain on the kube-scheduler is spiky_
 
  | benchess wrote:
  | Hi! Co-author here. We do keep the nodes running 24/7, so
  | Kubernetes still provides the scheduling to decide which nodes
  | are free or not at any given time. Generally starting a
  | container on a pre-warmed node is still much much faster than
  | booting a VM. Also, some of our servers are bare-metal.
  | 
  | EDIT: Also don't discount the rest of the Kubernetes ecosystem.
  | It's more than just a scheduler. It provides configuration,
  | secrets management, healthchecks, self-healing, service
  | discovery, ACLs... there are absolutely other ways to solve
  | each of these things. But when starting from scratch there's a
  | wide field of additional questions to answer, problems to
  | solve.
 
    | xorcist wrote:
    | Isn't Kubernetes a pretty lousy scheduler when it doesn't
    | take this into consideration? There are a number of
    | schedulers used in high performance computing that should be
    | able to do a better job.
 
      | chubot wrote:
      | Yeah exactly... This seems closer to an HPC problem, not a
      | "cloud" problem.
      | 
      | Related comment from 6 months ago about Kubernetes use
      | cases: https://lobste.rs/s/kx1jj4/what_has_your_experience_
      | with_kub...
      | 
      | Summary: scale has at least 2 different meanings. Scaling
      | in resources doesn't really mean you need Kubernetes.
      | Scaling in terms of workload diversity is a better use case
      | for it.
      | 
      | Kubernetes is basically a knockoff of Borg, but Borg is
      | designed (or evolved) to run diverse services (search,
      | maps, gmail, etc.; batch and low latency). Ironically most
      | people who run their own Kube clusters don't seem to have
      | much workload diversity.
      | 
      | On the other hand, HPC is usually about scaling in terms of
      | resources: running a few huge jobs on many nodes. A single
      | job will occupy an entire node (and thousands of nodes),
      | which is what's happening here.
      | 
      | I've never used these HPC systems but it looks like they
      | are starting to run on the cloud. Kubernetes may still have
      | been a defensible choice for other reasons, but as someone
      | who used Borg for a long time, it's weird what it's turned
      | into. Sort of like protobufs now have a weird "reflection
      | service". Huh?
      | 
      | https://aws.amazon.com/blogs/publicsector/tag/htcondor/
      | 
      | https://aws.amazon.com/marketplace/pp/Center-for-High-
      | Throug...
 
        | [deleted]
 
        | jacobr1 wrote:
        | Exactly, we migrated to k8s not because we needed better
        | scaling (ec2 auto scaling groups were working reasonably
        | well for us) but because we kept investing our own way to
        | do rolling deploys or run scheduled jobs, and had a
        | variety of ways to store secrets. On top of that
        | developers were increasingly running their own containers
        | with docker compose test services talking to each to each
        | other. We migrated to k8s to A) have a way to standardize
        | how to run containerized builds and get the benefits for
        | "it works on my laptop" matching how it works in
        | production (at least functionally) and B) a common set of
        | patterns for managing deployed software. Resource
        | scheduling only became of interest after we migrated when
        | we realized the aggregation of our payloads allowed us to
        | use things like spot instances without jeopardizing
        | availability.
 
        | vergessenmir wrote:
        | It maybe an HPC problem but I'm not sure the available
        | solutions come close to k8s in terms of functionality and
        | I'm not talking about scheduling.
        | 
        | I used to work in HPC/Grid but it's been a while but I do
        | remember Condor being clunky even though it had its uses.
        | 
        | And the commercial grid offerings couldn't scale to
        | almost 10k nodes back then (am not sure about now, or if
        | they even exist anymore)
 
        | toomuchtodo wrote:
        | Condor is clunky, but still in use in high energy
        | physics, for example (LHC CMS detector data processing).
        | 
        | For greenfield deployments, I would recommend Hashicorp's
        | Nomad before Kubernetes or Condor if your per server
        | container intent is ~1 (bare metal with a light
        | hypervisor for orchestration), but still steer you to
        | Kubernetes for microservices and web-based cookie cutter
        | apps (I know many finance shops using Nomad, but
        | Cloudflare uses it with Consul, so no hard and fast
        | rules).
        | 
        | Disclosure: Worked in HPC space managing a cluster for
        | high energy physics. I also use (free version) Nomad for
        | personal cluster workload scheduling.
 
        | jedbrown wrote:
        | Condor and the like are for independent jobs "throughput
        | computing" but the authors here are using MPI for
        | tightly-coupled jobs. SLURM and Flux are actively-
        | developed schedulers for these kind of jobs.
        | 
        | https://slurm.schedmd.com/
        | 
        | https://flux-framework.readthedocs.io/en/latest/
 
      | AlphaSite wrote:
      | If all you care about is node in use or not in use I think
      | it's fine. You don't need anything complex from the
      | scheduler.
 
    | stonogo wrote:
    | Are you starting from scratch? This architecture seems like a
    | pretty standard HPC deployment with unnecessary
    | containerization involved.
 
    | hamandcheese wrote:
    | Not to me mention it's a well known skillset that can more
    | easily be hired for, as opposed to "come work on our crazy
    | sauce job scheduler, you'll love it!"
 
    | dijit wrote:
    | I feel like we solved this problem over a decade ago (if
    | you're keeping machines warm anyway) with job brokers. Am I
    | somehow mistaken?
 
| dssdd wrote:
| >Pod network traffic shaping
| 
| Have you considered EDT-based rate limiting for Pods? This should
| scale well compared to TBF or HTB. Cilium developers have
| integrated this natively:
| https://cilium.io/blog/2020/11/10/cilium-19#bwmanager
 
  | benchess wrote:
  | Hi, co-author here. Yes we are excited about the potential of
  | this!
 
___________________________________________________________________
(page generated 2021-01-25 23:00 UTC)