[HN Gopher] Execute Docker Containers as QEMU MicroVMs
___________________________________________________________________
 
Execute Docker Containers as QEMU MicroVMs
 
Author : DarkPlayer
Score  : 122 points
Date   : 2021-06-16 16:05 UTC (6 hours ago)
 
web link (mergeboard.com)
w3m dump (mergeboard.com)
 
| [deleted]
 
| riobard wrote:
| A few years ago I invested in a small startup called `hyper.sh`.
| It open sourced a container runtime called `runV` which provided
| exactly this: security of virtual machines plus convenience of
| containers.
| 
| The project later merged with Intel Clear Container to become
| what's now called Kata Containers (https://katacontainers.io/)
| and is now widely used by several Internet giants like Alibaba
| and Baidu.
| 
| The startup was acquired by Ant Finance a couple of years ago.
| 
| (I recorded a podcast with one of hyper.sh engineer if you can
| listen to Mandarin https://pan.icu/25)
 
  | [deleted]
 
  | polskibus wrote:
  | How does it differ from Firecracker?
 
    | riobard wrote:
    | I'm not familiar with later development, but AFAIK
    | Firecracker came much later and now you can actually use
    | Firecracker as Kata Container's hypervisor in addition to
    | QEMU.
 
  | temp_praneshp wrote:
  | Probably off topic: Back in 2014-15 at my first job, when I was
  | working on openstack, they used to show up at the summits. They
  | were super smart and very generous with their time when I had
  | questions. I wondered sometime in 2020 what happened to them,
  | I'm happy they had a decent exit.
 
  | lifty wrote:
  | I worked with their tech, testing it, and I loved the product.
  | It was definitely ahead of its time. Similar in some ways to
  | what Fly is doing these days, without the edge.
 
  | cptnapalm wrote:
  | I was looking at Kata containers a few days ago. I'm pretty new
  | to trying to use VMs/containers for services; purely hobby
  | level. Couldn't figure out how to use them, but that's not
  | necessarily a knock on them as I also can't get OpenBSD
  | wireguard to work either.
 
| forty wrote:
| Isn't firecracker an AWS tech?
 
  | cpach wrote:
  | That's correct.
  | 
  | https://github.com/firecracker-microvm/firecracker
 
| encryptluks2 wrote:
| Why not run containers in VMs in containers in VMs? :)
| 
| Seriously, VMs are hardly as secure as many people want to
| believe unless you're utilizing enclaves and even that has
| vulnerabilities. I think a better approach is Seccomp and
| whatever other filtering makes sense.
 
  | dboreham wrote:
  | Machine Turducken.
 
  | handrous wrote:
  | A while back I did some looking at FreeBSD jails to try to
  | figure out why they don't have more mindshare (especially when
  | paired with the nigh-superpower-granting ZFS).
  | 
  | I came away baffled that they weren't more widely-promoted,
  | compared with Docker and friends. After thinking about it for a
  | while, all I can figure is they're so straightforward to use
  | and well-documented that there's no room to make one's name, or
  | to make a buck, re-packaging them or wrapping them in complex
  | tools, so there's little money or glory (= personal marketing
  | via open-source project leadership/contributions) in promoting
  | them.
  | 
  | [EDIT] that is: what would be a blog post in LXC/Docker land...
  | doesn't exist, because it's covered perfectly well in the docs.
  | What would be a simple open-source tool... becomes a blog post,
  | because it's short, simple, and clear enough not to merit
  | special software, but just a quick guide to existing tools.
  | What would be a business, becomes a simple open-source tool
  | without enough of a difficulty/convenience "moat" to support a
  | business.
 
    | nicolaslem wrote:
    | TrueNAS exposed me to FreeBSD jails but what put me off is
    | that there does not seem to be an equivalent of "docker
    | build".
    | 
    | Jails seem to be treated like OpenVZ containers in the Linux
    | world: a lighter alternative to virtual machines, not a way
    | to build and distribute applications like Docker.
    | 
    | This is just my take after playing a few hours with jails, I
    | would happily be proven wrong.
 
    | tyingq wrote:
    | If technically best in the container space mattered, Illumos
    | would be everywhere...
 
      | tptacek wrote:
      | People say this a lot too, but Illumos also uses shared-
      | kernel isolation. Linux + gVisor is probably
      | (significantly) superior to it as far as security goes.
 
      | cestith wrote:
      | Or z/OS
 
    | tptacek wrote:
    | Jails are still shared-kernel isolation. Docker's reputation
    | is mired in its earlier implementations, when it wasn't
    | really even intended for multitenant isolation. Modern
    | Docker, running with unprivileged containers (which is the
    | norm), is substantially hardened. The real win over Docker is
    | losing the shared kernel, which is what lots of people are
    | doing, so the win to Jails is marginal.
 
    | boardwaalk wrote:
    | I suspect the answer includes it not being Linux, even with
    | the compatibility layer available.
 
      | handrous wrote:
      | I'm sure that's some of it, but the trend seems to be
      | moving away from leveraging OS-level tools _anyway_. As
      | long as your containers (or jails) and the single important
      | binary in each one start up OK and your network tuning on
      | the parent OS isn 't completely screwed up, the rest barely
      | matters anymore.
 
        | coder543 wrote:
        | It seems like you're missing a lot of things.
        | 
        | As a developer, how do I run FreeBSD Jails on my MacBook
        | during development? With Docker for Mac, it is trivial
        | for me to do everything on my Mac, and the fact that
        | there is a virtual machine is completely invisible to me.
        | Everything "Just Works". With FreeBSD Jails, I would have
        | to actually interact with a VM constantly, including the
        | pain of shipping files back and forth.
        | 
        | As a developer, are popular databases and applications
        | pre-packaged as FreeBSD Jails so that I can spin one up
        | on my laptop with a single command? Where is the Docker
        | Hub equivalent?
        | 
        | As a developer, how do I orchestrate a collection of
        | FreeBSD Jails for each project? With Docker, I define a
        | single `docker-compose.yml` file for each project. With a
        | single `docker-compose up`, the entire project is running
        | _including_ dependencies such as databases and other
        | related projects in a completely reproducible fashion.
        | This makes it trivial for coworkers to spin up a project
        | on their machine and immediately be productive without
        | spending an hour trying to get all the right versions of
        | everything installed and up and running.
        | 
        | As someone responsible for deploying an application to
        | production, what is the story around FreeBSD Jails for
        | deploying across a cluster? Is there a Kubernetes-
        | equivalent that can manage the allocation of resources,
        | blue-green deployments, and manage the lifecycle of my
        | FreeBSD Jails?
        | 
        | As someone responsible for deploying an application to
        | production, do any of the major clouds support FreeBSD
        | Jails? With Docker images, I can deploy those straight to
        | ECS Fargate, Google Cloud Run, and half a dozen other
        | services. Then I don't even have to think about my own
        | infrastructure unless I need some really specialized
        | hardware for a specific application.
        | 
        | > the rest barely matters anymore.
        | 
        |  _Everything else_ matters so much.
        | 
        | As to your earlier point about ZFS, most Linux distros
        | these days seem to trivially support ZFS. Even TrueNAS is
        | working on switching to Linux with their TrueNAS Scale
        | offering.
        | 
        | It's not that I'm opposed to FreeBSD... FreeBSD is just a
        | hard sell. It's hard to pin down exactly what you're
        | gaining by throwing out all the collective Linux
        | knowledge of an organization and switching to FreeBSD.
        | FreeBSD is an N-th tier platform for pretty much every
        | programming language except C, so good luck when you run
        | into random subtle problems. Also, good luck doing
        | hardware accelerated machine learning inference or
        | training on FreeBSD... it's _probably_ possible?
        | 
        | > the single important binary
        | 
        | This is also such a weird thing to throw out there. I
        | like a good Go program myself, but _most_ companies are
        | not only deploying single-binary statically linked
        | applications. Most companies are also deploying some kind
        | of Ruby, Python, or Java application... none of which are
        | likely to be a single file in practice. Most of them will
        | have a variety of shared libraries, and I don 't know if
        | I've ever seen a Ruby application shipped in a `FROM
        | scratch` container before. Technically possible, but
        | that's just not common reality as far as I've seen. It
        | sounds like you're proposing that everyone is already
        | running in `FROM scratch` containers, so a FreeBSD Jail
        | is just a drop-in replacement.
        | 
        | Linux containers are far from perfect, but as a
        | developer... I _have_ played with FreeBSD Jails before,
        | and come away frustrated by all the work you have to do
        | yourself.
 
        | handrous wrote:
        | > > the single important binary
        | 
        | > This is also such a weird thing to throw out there. I
        | like a good Go program myself, but most companies are not
        | only deploying single-binary statically linked
        | applications. Most companies are also deploying some kind
        | of Ruby, Python, or Java application... none of which are
        | likely to be a single file in practice.
        | 
        | Sure, but usual practice with containers is to put each
        | thing in its own, unless they are _very_ tightly coupled.
        | Web-app with a SQL database and a memory cache? Three
        | containers. You _can_ do otherwise, but that 's typical.
        | Usually each container ends up with one main, important
        | running process, and not much else.
        | 
        | [EDIT]
        | 
        | > As someone responsible for deploying an application to
        | production, what is the story around FreeBSD Jails for
        | deploying across a cluster? Is there a Kubernetes-
        | equivalent that can manage the allocation of resources,
        | blue-green deployments, and manage the lifecycle of my
        | FreeBSD Jails?
        | 
        | > As someone responsible for deploying an application to
        | production, do any of the major clouds support FreeBSD
        | Jails? With Docker images, I can deploy those straight to
        | ECS Fargate, Google Cloud Run, and half a dozen other
        | services. Then I don't even have to think about my own
        | infrastructure unless I need some really specialized
        | hardware for a specific application.
        | 
        | These are exactly the kinds of things I was thinking of
        | when I noted that the OS itself has been seriously
        | diminished in importance, for modern workflows. I agree
        | that most commercial or high-profile open-source "cloud"
        | tools and platforms are built around LXC/Docker.
 
        | coder543 wrote:
        | > Sure, but usual practice with containers is to put each
        | thing in its own, unless they are very tightly coupled.
        | Web-app with a SQL database and a memory cache? Three
        | containers. You can do otherwise, but that's typical.
        | Usually each container ends up with one main, important
        | running process, and not much else.
        | 
        | I agree, but... getting all the application dependencies
        | in there is more than just getting a single binary in
        | there. If it's just a single-binary Go program, then a
        | Jail works just fine, but it's not that simple for a Ruby
        | application. I'm definitely not talking about databases
        | running in the same container as the application. That's
        | where Kubernetes and docker-compose come in for multi-
        | container orchestration, which are things that FreeBSD
        | Jails don't have as far as I know.
        | 
        | > These are exactly the kinds of things I was thinking of
        | when I noted that the OS itself has been seriously
        | diminished in importance
        | 
        | Yes, but... these are all the things that FreeBSD doesn't
        | offer. These are the real reasons that people don't talk
        | about FreeBSD Jails in the same breath as Docker. The
        | Docker container itself (or the FreeBSD Jail) as a unit
        | of isolation is the least interesting part of the
        | ecosystem. All of the developer tools, orchestration
        | tools, and prebuilt images are what make the Docker
        | universe so interesting, and make FreeBSD Jails... less
        | interesting.
        | 
        | You said you were confused why Jails don't have more
        | mindshare. It has absolutely nothing to do with people
        | being able to invent useless tools and write blog posts
        | about them, and it has absolutely nothing to do with
        | FreeBSD Jails being _too well documented_. You kind of
        | implied those were the best explanations you could come
        | up with. Those are not the problems _at all_ , and it
        | seems disingenuous to me to say you think those are the
        | problems unless you _really_ didn 't know the things I
        | mentioned in my first reply.
 
        | oarsinsync wrote:
        | FreeBSD introduced Jails in 1999.
        | 
        | I used my first Jail in 2001.
        | 
        | Docker was started over a decade later in 2013.
        | 
        | It's reasonable to be confused why Jails lacks the
        | mindshare. "Because it lacks all these other over-the-top
        | features that we need" might be reasonable in response,
        | except that Docker didn't have any of these things on day
        | 0 either.
        | 
        | Jails had a 14 year head start, Docker reinvents the
        | wheel, and nor particularly well at first. Why did it
        | succeed more than Jails did? It wasn't because of the
        | piss-poor native Mac support.
 
        | tptacek wrote:
        | It seems pretty obvious that the big thing here is that
        | most people ship apps on Linux, not on FreeBSD.
 
        | handrous wrote:
        | My personal favorite thing about Docker, and the part I'd
        | most miss if I switched to Jails (which I'm fairly
        | confident could meet my needs with some fairly simple
        | scripts and aliases that wouldn't take me long to arrive
        | at, which is why I think there's so much less of an
        | "ecosystem" there, even a nascent and under-developed
        | one) is the way it forces projects to un-fuck their
        | configuration.
        | 
        | 500-line config, much of which few people ever care
        | about, with all kinds of ill-conceived nesting? Better
        | put the ~20 options that 99% of users ever touch in
        | environment variables, and document them. Weird state
        | garbage that's not captured in your config-on-disk?
        | Better figure it out and get it into env vars, and have
        | your startup script use those to transparently manage
        | whatever bad decisions you made re: state in the past.
        | Shit files all over the system? Better get that sorted
        | out so people can handle persistence with at the _very_
        | most three total mounts--and oh, gee, look, now your
        | simple example docker-compose also serves to document
        | where exactly you store files. And so on.
        | 
        | (my second-favorite thing is that it's a de-facto cross-
        | distro package manager with very up-to-date packages that
        | are trivial to completely and cleanly uninstall)
 
        | vermaden wrote:
        | > As a developer, are popular databases and applications
        | pre-packaged as FreeBSD Jails so that I can spin one up
        | on my laptop with a single command?
        | 
        | The closest you can get is BastilleBSD (framework for
        | FreeBSD Jails) and their templates - available here:
        | 
        | https://github.com/BastilleBSD/templates
        | https://bastillebsd.org/templates/
 
  | tptacek wrote:
  | I don't know what people generally believe.
  | 
  | But the attack surface of a Linux kernel is very large, is
  | pretty unpredictable, and can't be coherently masked out with
  | rules (my favorite example Jann Horn's VM reference count bug,
  | which was a simple concurrency flaw in the core virtual memory
  | system). By comparison, a Linux KVM hypervisor is not just a
  | subset of the kernel by definition, but also a much smaller
  | codebase, a tiny fraction of the whole kernel.
  | 
  | Replacing shared-kernel isolation like seccomp-filtered
  | containers with VMs is, architecturally, simply the replacement
  | of a large trusted computing base with a smaller one. If the
  | overhead is acceptable, it's hard to argue with from a security
  | perspective.
 
  | riobard wrote:
  | That's the approach taken by Google's gVisor (at the cost of
  | I/O and network performance).
 
    | fsociety wrote:
    | gVisor, for better or for worse, does a whole lot of other
    | things than just seccomp filtering, and it shows in
    | performance tests.
 
    | encryptluks2 wrote:
    | gVisor does more than filtering, they basically reimplemented
    | the syscalls in an application kernel. At least with seccomp
    | the performance overhead is minimal.
 
    | tptacek wrote:
    | No, that's really not at all what gVisor is. gVisor is best
    | thought of as user-mode Linux --- a complete reimplementation
    | of most of the OS kernel. It's not a system call filter; it's
    | something much closer to a VM than to seccomp.
    | 
    | gVisor is a very cool codebase. As an illustration of the
    | approach: it includes its own TCP/IP stack; we use it in our
    | command-line dev tool to allow people to SSH to their VMs
    | over WireGuard without having to install WireGuard or obtain
    | privileges to manage WireGuard.
 
  | gorkish wrote:
  | OK; https://github.com/harvester/harvester
  | 
  | Security and performance aren't the only driving forces; there
  | are a lot of technical and operational benefits to the
  | abstraction and standard interfaces that you get when running
  | stacks that might otherwise look like someone took an Xzibit
  | meme too far.
  | 
  | Also remember on a modern system, there are often at least 2
  | additional layers at work abstracting interfaces to the "bare
  | metal" OS already.
 
    | encryptluks2 wrote:
    | I'm not disagreeing that abstraction can be useful, but the
    | overhead of a VM is unnecessary if utilizing the full
    | potential of containers. Afterall, the Linux Kernel is acting
    | as the hypervisor already, so might as well trust it enough
    | to properly sandbox containers too and use the right
    | functionality to do so. I also think that running a
    | virtualization layer adds quite a bit of complexity, so while
    | it is cool that projects and companies have made it work and
    | integrated it with a container solution, eliminating the VM
    | layer altogether seems more ideal IMO.
 
| ashishbijlani wrote:
| > Can we somehow combine the advantages of the docker ecosystem
| with VMs?
| 
| Shameless plug: this is exactly what our goal is with
| https://kwarantine.xyz We are creating a new hypervisor (from
| scratch) that can run strongly isolated Docker/LXC containers.
 
  | mikepurvis wrote:
  | Is this what gvisor is? https://github.com/google/gvisor
 
    | ashishbijlani wrote:
    | No, gVisor is from Google. They emulate system calls in user-
    | space and use VMs, which increases runtime performance
    | overhead. We use hardware virtualization to directly run
    | containers -- no I/O emulation, no expensive VM exits, scale
    | as needed. Initial comparison with FC/GVisor/Xen here:
    | https://github.com/ashishbijlani/kwarantine
 
      | monocasa wrote:
      | I'm not sure gvisor requires vm exits. Their first backend
      | used ptrace very similarly to how user mode Linux worked.
      | 
      | Minor quip though since ptrace might even be slower than vm
      | exits; your core point stands.
 
        | rkeene2 wrote:
        | User Mode Linux is still around and works well. I use it
        | when I need a "fakeroot" without any special privileges
        | on the host.
        | 
        | https://rkeene.org/viewer/tmp/fakeroot.sh.htm
 
      | tptacek wrote:
      | It sounds like you just said "yes, but what we're building
      | is faster". The userland Linux emulation is a security
      | benefit, not a liability.
 
  | amscanne wrote:
  | The "fork" sounds like you blue pill the OS for each container?
  | I'm assuming the concept is like Cappsule [1] or Bromium [2]?
  | 
  | [1] https://cappsule.github.io/ [2]
  | https://en.wikipedia.org/wiki/Bromium#/media/File:Bromium-en...
 
    | ashishbijlani wrote:
    | fork here is COW on the host kernel (i.e., copying EPT
    | entries). We will post detailed technical documentation soon.
 
| eatonphil wrote:
| There are a few existing projects out there like this (running
| Docker images as virtual machines, specifically) if folks are
| interested. Slim [0] is the one I can remember off the top of my
| head. I think there are a couple more.
| 
| Still, neat to have the walkthrough here in this post.
| 
| https://github.com/ottomatica/slim
 
| thekevjames wrote:
| I had fun exploring Docker->VM conversion a while back [1],
| though the larger goal in my case was to be able to make the
| build path to custom GCP VM Images a bit simpler. Exciting to see
| other cases where folks are finding this sort of flow useful!
| 
| 1: https://thekev.in/blog/2019-08-05-dockerfile-bootable-
| vm/ind...
 
| rwmj wrote:
| https://katacontainers.io/ ?
 
  | bonzini wrote:
  | Yes, indeed. However it's nice to see directly the mechanisms
  | that let Kata do its magic.
 
| gravypod wrote:
| Something I'd be very interested in: building a PXE image from
| something declarative like Dockerfiles.
 
  | justincormack wrote:
  | Try LinuxKit https://github.com/linuxkit/linuxkit
 
  | laurencerowe wrote:
  | Google Container Optimized OS is basically this I think. It's
  | what's used when you start a GCE instance with a docker image.
  | 
  | https://cloud.google.com/container-optimized-os/
 
| OldGoodNewBad wrote:
| I think a lot of folks are going out of their way to
| misunderstand what happened. Yes there are other similar projects
| and containers. No, none come from a long established _COMMUNITY
| RUN PROJECT_. This is something akin to the difference between
| VirtualBox and OpenBSD's vmd. Ones a product with a "free" tier,
| the other is a community project.
 
| tptacek wrote:
| As I understand the landscape here, the big enabling win of
| microvms is faster boot time; there's a cool qemu-lite slide deck
| that goes into detail about how they cut down boot time:
| 
| https://www.linux-kvm.org/images/d/d2/03x05B-Chao_Peng-Light...
| 
| The big win was slashing away the BIOS stuff.
| 
| We use AWS's Firecracker to turn our customers Docker containers
| into Firecracker microvms (Firecracker is Amazon's Rust VMM, the
| engine for Fargate and Lambda). Anecdotally: in my dev
| environment, the difference between Firecracker boot times and
| native Docker container startup is imperceptible; the logging we
| do swamps the VM boot stuff. It's _very_ fast.
 
___________________________________________________________________
(page generated 2021-06-16 23:00 UTC)