[HN Gopher] Writing a Kubernetes Operator
___________________________________________________________________
 
Writing a Kubernetes Operator
 
Author : todsacerdoti
Score  : 141 points
Date   : 2023-03-09 13:41 UTC (9 hours ago)
 
web link (metalbear.co)
w3m dump (metalbear.co)
 
| sigwinch28 wrote:
| I find myself conflicted between two approaches at work:
| 
| 1. Write a provider/extension/whatever for a tool like Terraform
| or Pulumi. I live in a world where the infrastructure doesn't
| move underneath my feet. I am the source of truth. I feel like I
| only need to reconcile changes when _I_ make changes to my IaC
| repositories.
| 
| 2. I could write something that exists in a control plane, like
| Kubernetes operators or Crossplane. I live in a world where I
| look at the world, find the delta between current state and
| desired state, then try to reconcile. This is an endless loop.
| 
| I feel like these are different approaches with the same goal.
| Why should I decide either way beyond tossing a coin?
| 
| Some use cases:
| 
| - an internal enterprise DNS system which is not standards-
| compliant with the world at large
| 
| - an internal certificate authority and certificate issuing
| system.
 
| debarshri wrote:
| A better way to write an operator these days is to use
| kubebuilder [1].
| 
| My complaint is that I have seen orgs write operators for random
| stuff, often reinventing the wheel. Lot of operators in orgs are
| result of resume driven development. Having said that it often
| comes handy for complex orchestration.
| 
| [1] https://github.com/kubernetes-sigs/kubebuilder
 
  | casperc wrote:
  | What would be a good example where an operator would make
  | sense?
 
    | EdwardDiego wrote:
    | I worked on an operator that manages Kafka in K8s. If you
    | want to upgrade the brokers in a Kafka cluster, you generally
    | do a rolling upgrade to ensure availability.
    | 
    | The operator will do this for you, you just update the
    | version of the broker in the CR spec, it notices, and then
    | applies the change.
    | 
    | Likewise, some configuration options can be applied at
    | runtime, some need the broker to be restarted to be applied,
    | the operator knows which are which, and will again manage the
    | process of a rolling restart if needed to apply the change.
    | 
    | You can also define topics and users as custom resources, so
    | have a nice Gitops approach to declaring resources.
 
    | debarshri wrote:
    | There is whole list of public operators that you can find in
    | operator hub [1].
    | 
    | [1] https://operatorhub.io/
 
    | spenczar5 wrote:
    | Operators make sense when you need to automatically modify
    | resources in response to changes in the cluster's state.
    | 
    | An example that has come up for me is an operator for a Kafka
    | Schema Registry. This is a service that needs some
    | credentials in a somewhat obscure format so it can
    | communicate very directly with a Kafka broker. If the
    | broker's certificates (or CA) are modified, then the Schema
    | Registry needs to have new credentials generated, and needs
    | to be restarted. But the registry shouldn't (obviously) have
    | direct access to the broker's certificates. Instead, there's
    | a more-privileged subsystem which orchestrates that dance;
    | that's the operator.
 
    | sleepybrett wrote:
    | kubernetes itself is a collection of controllers/operators.
    | It takes manifests like pods and uses that information to
    | create the workload in your container runtime on a node with
    | the resources it needs.
 
    | debarshri wrote:
    | A good example from my perspective is when you are delivering
    | an application as 3rd party vendor and you wish to automate
    | lot of operational stuff like backup, scaling based on
    | events, automating stuff based on cluster events. It starts
    | becoming very valuable. I am sure there are many more use
    | cases for.
 
      | jrockway wrote:
      | I would not write an operator to do any of these things. To
      | me an "operator" strongly implies the existence of a CRD
      | and the need to manage it. So for autoscaling, HPA/VPA are
      | built into k8s. Backups should be an application-level
      | feature; when the "take a backup" RPC or time arrives, take
      | a backup and dump it in configured object storage.
      | Automating stuff based on cluster events also doesn't
      | require an operator; call client.V1().Whatever().Watch and
      | do what you need to do.
      | 
      | The only moderately justifiable operator I've ever seen is
      | cert-manager. Even then, one wonders what it would be like
      | if it just updated a Secret every 3 months based on a hard-
      | coded config passed to it, and skipped the CRDs.
 
    | jhoelzel wrote:
    | - creating databases for your app on the fly.
    | 
    | - scaling up and down applications because of time instead of
    | demand. or based on non metric based actions
    | 
    | - Extending kubernetes to understand your workload
    | 
    | - Automating configuration and management of complex
    | applications
    | 
    | - Managing legacy applications that cannot be easily
    | containerized or migrated to the cloud.
    | 
    | if you love k8s youll love operators
    | 
    | the list is endless!
 
      | dilyevsky wrote:
      | With respect, being "in love" with a technology is not a
      | good way to go about it - it leads to tunnel vision
 
    | remram wrote:
    | An operator operates something, e.g. it actively makes
    | changes. If you want to deploy an application, a Helm Chart
    | is the correct way. It will allow you to have deterministic
    | deployment, that you can duplicate multiple times in your
    | cluster, and you can dry-run it and see the generated
    | manifests.
    | 
    | An operator is needed when you can't just deploy and forget
    | about it. An example is the Prometheus operator, which will
    | track annotations created by users to configure the scraping
    | configuration of your Prometheus instances. Another example
    | is cert-manager, which gets certificates into secrets based
    | on Certificate and Ingress objects, renews them automatically
    | before expiry, and does that by creating ingresses picked up
    | by your ingress controller.
    | 
    | The advantage of an operator is that it will react to stuff
    | happening in the cluster. The drawback is that it reacts to
    | stuff happening, potentially doing unexpected things because
    | changes happen at any time and you can't dry-run them.
    | Another drawback is that they are usually global, so you
    | can't run multiple versions at the same time for different
    | namespaces (mainly because custom resource definitions are
    | global).
    | 
    | Unfortunately many people think packaging an application =
    | creating an operator, and that operator does nothing a chart
    | couldn't do.
 
      | stasmo wrote:
      | The CockRoach DB example in the article is a perfect
      | example of an unnecessary CRD. Acquiring certificates
      | within an Kubernetes cluster is a common requirement for
      | lots of applications and there are lots of solutions out
      | there. Is it really necessary to spend time writing your
      | own operator? Now you have a second helm chart and an
      | operator to maintain. Now you have to explain to people
      | which chart to use. You could get rid of the non-operator
      | chart but now I have operators within the cluster acquiring
      | certificates in 5 or 6 different ways. Do I have to
      | configure the credentials for 6 operators so they can make
      | Route53 DNS challenge records?
      | 
      | Edit: maybe we could shift left and ask the app developers
      | to add certificate acquisition directly into the app
      | source.
 
        | outworlder wrote:
        | > Do I have to configure the credentials for 6 operators
        | so they can make Route53 DNS challenge records?
        | 
        | A certificate for service to service communication does
        | not have to correspond to a public endpoint.
 
      | mdaniel wrote:
      | > that operator does nothing a chart couldn't do.
      | 
      | Or is can be _actively harmful_ when they don 't do any
      | error checking whatsoever, causing it to be less accurate
      | that `helm template` would be. Related, it's also one more
      | thing to monitor because it can decide to start vomiting
      | errors for whatever random reason
 
      | dpkirchner wrote:
      | Neither of those cases really need an operator --
      | Prometheus and cert-manager both have code that watches for
      | changes on ingresses/services/custom resources and reacts
      | to changes (using permissions granted via RBAC). I've used
      | both without an operator and still use Prometheus without
      | one.
 
  | cacois wrote:
  | I've found operator-sdk [1] (which uses kubebuilder under the
  | hood) to be a better starting point for operator development.
  | 
  | [1] https://github.com/operator-framework/operator-sdk
 
    | MuffinFlavored wrote:
    | Can you give me an example use case you've ran into where you
    | need to write a custom k8s operator/API?
 
      | [deleted]
 
| darren0 wrote:
| I'm not sure why this is a top post. The definitions of
| controller and operator are completely wrong. The example code is
| for creating a custom api server which is only done in the most
| advanced of advanced use cases. The implementation of the
| apiserver is too naive to demonstrate they have any understanding
| of the complexity that supporting watch will cause.
 
  | mfer wrote:
  | The article has a description of what an operator is wrong. The
  | definition of an operator originally was...
  | 
  | > An Operator is an application-specific controller that
  | extends the Kubernetes API to create, configure, and manage
  | instances of complex stateful applications on behalf of a
  | Kubernetes user. It builds upon the basic Kubernetes resource
  | and controller concepts but includes domain or application-
  | specific knowledge to automate common tasks.
  | 
  | This is the original definition of an operator [1]. People no
  | use them for stateless things and domain specific work has
  | taken off.
  | 
  | You can look at the Kubernetes docs [2] to see refinements on
  | it...
  | 
  | > Kubernetes' operator pattern concept lets you extend the
  | cluster's behaviour without modifying the code of Kubernetes
  | itself by linking controllers to one or more custom resources.
  | Operators are clients of the Kubernetes API that act as
  | controllers for a Custom Resource.
  | 
  | [1]
  | https://web.archive.org/web/20190113035722/https://coreos.co...
  | 
  | [2] https://kubernetes.io/docs/concepts/extend-
  | kubernetes/operat...
 
    | richardwhiuk wrote:
    | You don't need to implement a custom API server to implement
    | an operator - you can just watch a CR.
 
      | jhoelzel wrote:
      | for an operator you do, what you mean is a controller =)
 
      | [deleted]
 
  | timelapse wrote:
  | > The definitions of controller and operator are completely
  | wrong.
  | 
  | mind clarifying?
 
| devkulkarni wrote:
| We have an FAQ about Operators here: https://github.com/cloud-
| ark/kubeplus/blob/master/Operator-F...
| 
| It should be helpful if you are new to the Operator concept.
| 
| Operators are generally useful for handling domain-specific
| actions - for example, performing database backups, installing
| plugins on Moodle/Wordpress, etc. If you are looking for
| application deployment then a Helm chart should be sufficient.
 
| kimbernator wrote:
| I didn't really enjoy my experience with the few operators I've
| worked with, mainly because they require the maintainer to build
| in some sort of access to basic kubernetes functionality. I see
| the benefit of operators, but I hated that in order to do
| something as simple as define memory/CPU limits to certain
| containers I would need to open a PR to the repo and wait weeks,
| sometimes months, for a new release.
| 
| It's frustrating to be a kubernetes admin but not have access to
| basic configuration options because the maintainers of even some
| very high-profile operators (looking at you, AWX) neglected to
| build in access to basic functionality.
 
  | evancordell wrote:
  | This is a common frustration of mine as well!
  | 
  | In the latest release of the spicedb-operator[0], I added a
  | feature that allows users to specify arbitrary patches over
  | operator-managed resources directly in the API (examples in the
  | link).
  | 
  | There are some other projects like Kyverno and Gatekeeper that
  | try to do this generically with mutating webhooks, but
  | embedding a `patches` API into the operator itself gives the
  | operator a chance to ensure the changes are within some
  | reasonable guardrails.
  | 
  | [0]: https://github.com/authzed/spicedb-
  | operator/releases/tag/v1....
 
    | remram wrote:
    | The SpiceDB operator looks like a prime example of something
    | that should have been a Helm Chart. Migrations can be run in
    | the containers.
    | 
    | Operators are just the non-containerized daemons of the
    | Kubernetes OS. We did all this work to run everything in
    | neatly encapsulated containers, and then everyone wants to
    | run stuff globally on the whole cluster. What's the point? Do
    | we just containerize clusters and start over?
 
      | xyzzy_plugh wrote:
      | I'm not sure what you're on about. Operators don't need to
      | run in cluster at all. And even then, they can absolutely
      | run as containers. And as far as permissions go, that's up
      | to you. They're just regular service accounts.
 
      | evancordell wrote:
      | I get the sentiment. We held off on building an operator
      | until we felt there was actually value in doing so (for the
      | most part, Deployments cover the operational needs pretty
      | well).
      | 
      | Migrations can be run in containers (and they are, even
      | with the operator), but it's actually a lot of work to run
      | them at the right time, only once, with the right flags, in
      | the right order, waiting for SpiceDB to reach a specific
      | spot in phased migrations, etc.
      | 
      | Moving from v1.13.0 to v1.14.0 of SpiceDB requires a multi-
      | phase migration to avoid downtime[0], as could any phased
      | migration for any stateful workload. The operator will walk
      | you through them correctly, without intervention. Users who
      | aren't running on Kubernetes or aren't using the operator
      | often have problems running these steps correctly.
      | 
      | The value is in this automation, but also in the API
      | interface itself. RDS is just some automation and an API on
      | top of EC2, and I think RDS has value over running postgres
      | on EC2 myself directly.
      | 
      | As for helm charts, this is just my opinion, but I don't
      | think they're a good way to distribute software to end
      | users. The interface for a helm chart becomes polluted over
      | time in the same way that most operator APIs become
      | polluted over time, as more and more configuration is
      | pulled up to the top. I think helm is better suited to
      | managing configuration you write yourself to deploy on your
      | own clusters (I realize I'm in the minority here).
      | 
      | [0]:
      | https://github.com/authzed/spicedb/releases/tag/v1.14.0
 
    | ojhughes wrote:
    | Adding the patch api is neat! I've solved this in the past by
    | embedding the entire PodSpec etc into the CRD
 
      | remram wrote:
      | Did you call your CRD "Deployment"?
 
      | sklarsa wrote:
      | I might have to borrow that! Very clever
 
  | hintymad wrote:
  | > I would need to open a PR to the repo and wait weeks,
  | sometimes months, for a new release.
  | 
  | Just curious, is this a limitation of the Operators framework,
  | or that of your system's implementation? My knee-jerk reaction
  | is that any implementation should absolutely not require
  | opening ticket. After all, Amazon's API mandate happened 20
  | years ago, and Netflix followed suit to achieve phenomenal
  | productivity for their engineers. I have a hard time imagining
  | why any engineer would think that gatekeeping configuration
  | with PR is a good idea(a UI with proper automation and approval
  | process that hides generated PR for specific use cases is a
  | different matter)
 
    | IceWreck wrote:
    | Not a kubernetes expert, but my understanding is that that
    | operators are regular programs that run in a kubernetes
    | container and interact with the kubernetes API to
    | launch/manage other containers and custom kubernetes
    | resources.
    | 
    | An operator (or its custom resource) can be configured by
    | Kubernetes YAML/API and its upto the creator of the operator
    | to specify the kind of configuration. If the operator creator
    | did not specify options to set cpu/memory limits on the pods
    | managed by the operator, then you can't do anything. You have
    | to add that feature into the operator and then make a pull
    | request and wait for it to be upstreamed.
    | 
    | Or fork it instead. Same thing for helm charts (except
    | forking and patching them is easier than forking an
    | operator).
 
| fedreg wrote:
| Here's another example of a custom rust operator,
| https://github.com/mach-kernel/databricks-kube-operator
| 
| Written by a co-worker to help manage our databricks projects
| across clusters. Works wonderfully!!
 
  | alexott wrote:
  | But why such complexity? Is it easier to maintain than
  | terraform code?
 
    | EdwardDiego wrote:
    | Yes. Terraform doesn't actively manage resources, opererators
    | do.
 
| jhoelzel wrote:
| Oh i love operators they usually tie the entire cluster together
| and lead to amazing things! Think of Kubernetes as an advanced
| API server that can be extended endlessly and operators are the
| way to do it.
| 
| There really is no magic, is all there and with go the images are
| usually what? like 10 mb?
| 
| It's essential to have a solid understanding of Kubernetes
| architecture, concepts such as custom resources and controllers,
| and the tools and APIs available for working with Operators.
| 
| Dont use rust though, use and sdk like the operator sdk or
| kubebuilder. Its native to k8s and you will have a much easier
| time too.
 
| Thaxll wrote:
| Using Rust for that is a bad idea, just use the official and
| native SDKs ( in Go ). Rust does not have any equivalent to
| https://sdk.operatorframework.io/
 
| jzelinskie wrote:
| Since Go got generics, working with the Kubernetes API could
| become far more ergonomic. It's been pulling teeth until now. I'm
| eager to see how the upstream APIs change over time.
| 
| In the mean time, one of the creators of the Operator
| Framework[0] built a bunch of useful patterns using generics that
| we used to build the SpiceDB Operator[1] called controller-
| idioms[2].
| 
| Does anyone know of other efforts to improve the status quo?
| 
| [0]: https://operatorframework.io
| 
| [1]: https://github.com/authzed/spicedb-operator
| 
| [2]: https://github.com/authzed/controller-idioms
 
| crabbone wrote:
| I've written (well, participated in development of) two
| Kubernetes operators, and support about a dozen of them (in our
| own deployment of Kubernetes): Jupyter, PostgreSQL, a bunch of
| Prometheus operators and a handful of proprietary ones.
| 
| In my years of working with Kubernetes I cannot shake the feeling
| that it's, basically, an MLM. It carefully obscures it's
| functionality by hiding behind opaque definitions. It doesn't
| really work, when push comes to shove. And, most importantly, it
| survives in a parasitic kind of way: by piggybacking on those who
| develop all kinds of extensions, be it operators, custom
| networking or storage plugins, authentication and so on.
| 
| My problem is I cannot find who stands at the top of the pyramid.
| There's Cloudnative Foundation, but all it does is selling
| certifications nobody really needs... so, that cannot possibly be
| it. No big name doesn't really benefit from this in an obvious
| way...
| 
| So... anyways, when I hear people argue about how to implement
| this or another extension of Kubernetes, it rings the same as
| when people argue about styles of agile, or code readability etc.
| nonsense. There isn't a good way. There is not acceptance
| criteria. The whole system is flawed to no end.
 
| _muff1nman_ wrote:
| This article is mistaken from the get-go as an operator is not
| the same as an apiservice. Rather an operator is a wider term for
| something that includes a controller. See
| https://kubernetes.io/docs/concepts/extend-kubernetes/operat...
| 
| Also it's important for people reading this article - an
| apiservice (which this article talks about) is very rarely
| something that should be done. An operator is more appropriate
| for nearly all cases except for when you truly need your state
| stored outside of the internal Kubernetes etcd datastore.
 
  | reedjosh wrote:
  | Custom Resource + Controller = Operator. Good call!
  | 
  | > Operators are clients of the Kubernetes API that act as
  | controllers for a Custom Resource.
 
    | jhoelzel wrote:
    | exactly! controlling refers to directing or regulating the
    | behavior of something, while operating refers to the actual
    | execution or manipulation.
 
  | tenac23 wrote:
  | After reading the comments we updated the article
 
| rdtsc wrote:
| You have a problem: orchestrating some thing in kube, so you
| write some custom operator logic running alongside your main
| product; but now you have two problems to worry about.
| 
| I've seen just as much if not more issues with debugging the
| operator logic itself as with the main pods/deployments it was
| trying to manage.
| 
| So just from a practical point of view, I think it should be a
| last resort after everything else fails (helm charts, etc).
 
___________________________________________________________________
(page generated 2023-03-09 23:01 UTC)