proxy70

	[HN Gopher] Execute Docker Containers as QEMU MicroVMs ___________________________________________________________________ Execute Docker Containers as QEMU MicroVMs Author : DarkPlayer Score : 122 points Date : 2021-06-16 16:05 UTC (6 hours ago)
	web link (mergeboard.com)
	w3m dump (mergeboard.com)
	\| [deleted] \| riobard wrote: \| A few years ago I invested in a small startup called `hyper.sh`. \| It open sourced a container runtime called `runV` which provided \| exactly this: security of virtual machines plus convenience of \| containers. \| \| The project later merged with Intel Clear Container to become \| what's now called Kata Containers (https://katacontainers.io/) \| and is now widely used by several Internet giants like Alibaba \| and Baidu. \| \| The startup was acquired by Ant Finance a couple of years ago. \| \| (I recorded a podcast with one of hyper.sh engineer if you can \| listen to Mandarin https://pan.icu/25) \| [deleted] \| polskibus wrote: \| How does it differ from Firecracker? \| riobard wrote: \| I'm not familiar with later development, but AFAIK \| Firecracker came much later and now you can actually use \| Firecracker as Kata Container's hypervisor in addition to \| QEMU. \| temp_praneshp wrote: \| Probably off topic: Back in 2014-15 at my first job, when I was \| working on openstack, they used to show up at the summits. They \| were super smart and very generous with their time when I had \| questions. I wondered sometime in 2020 what happened to them, \| I'm happy they had a decent exit. \| lifty wrote: \| I worked with their tech, testing it, and I loved the product. \| It was definitely ahead of its time. Similar in some ways to \| what Fly is doing these days, without the edge. \| cptnapalm wrote: \| I was looking at Kata containers a few days ago. I'm pretty new \| to trying to use VMs/containers for services; purely hobby \| level. Couldn't figure out how to use them, but that's not \| necessarily a knock on them as I also can't get OpenBSD \| wireguard to work either. \| forty wrote: \| Isn't firecracker an AWS tech? \| cpach wrote: \| That's correct. \| \| https://github.com/firecracker-microvm/firecracker \| encryptluks2 wrote: \| Why not run containers in VMs in containers in VMs? :) \| \| Seriously, VMs are hardly as secure as many people want to \| believe unless you're utilizing enclaves and even that has \| vulnerabilities. I think a better approach is Seccomp and \| whatever other filtering makes sense. \| dboreham wrote: \| Machine Turducken. \| handrous wrote: \| A while back I did some looking at FreeBSD jails to try to \| figure out why they don't have more mindshare (especially when \| paired with the nigh-superpower-granting ZFS). \| \| I came away baffled that they weren't more widely-promoted, \| compared with Docker and friends. After thinking about it for a \| while, all I can figure is they're so straightforward to use \| and well-documented that there's no room to make one's name, or \| to make a buck, re-packaging them or wrapping them in complex \| tools, so there's little money or glory (= personal marketing \| via open-source project leadership/contributions) in promoting \| them. \| \| [EDIT] that is: what would be a blog post in LXC/Docker land... \| doesn't exist, because it's covered perfectly well in the docs. \| What would be a simple open-source tool... becomes a blog post, \| because it's short, simple, and clear enough not to merit \| special software, but just a quick guide to existing tools. \| What would be a business, becomes a simple open-source tool \| without enough of a difficulty/convenience "moat" to support a \| business. \| nicolaslem wrote: \| TrueNAS exposed me to FreeBSD jails but what put me off is \| that there does not seem to be an equivalent of "docker \| build". \| \| Jails seem to be treated like OpenVZ containers in the Linux \| world: a lighter alternative to virtual machines, not a way \| to build and distribute applications like Docker. \| \| This is just my take after playing a few hours with jails, I \| would happily be proven wrong. \| tyingq wrote: \| If technically best in the container space mattered, Illumos \| would be everywhere... \| tptacek wrote: \| People say this a lot too, but Illumos also uses shared- \| kernel isolation. Linux + gVisor is probably \| (significantly) superior to it as far as security goes. \| cestith wrote: \| Or z/OS \| tptacek wrote: \| Jails are still shared-kernel isolation. Docker's reputation \| is mired in its earlier implementations, when it wasn't \| really even intended for multitenant isolation. Modern \| Docker, running with unprivileged containers (which is the \| norm), is substantially hardened. The real win over Docker is \| losing the shared kernel, which is what lots of people are \| doing, so the win to Jails is marginal. \| boardwaalk wrote: \| I suspect the answer includes it not being Linux, even with \| the compatibility layer available. \| handrous wrote: \| I'm sure that's some of it, but the trend seems to be \| moving away from leveraging OS-level tools _anyway_. As \| long as your containers (or jails) and the single important \| binary in each one start up OK and your network tuning on \| the parent OS isn 't completely screwed up, the rest barely \| matters anymore. \| coder543 wrote: \| It seems like you're missing a lot of things. \| \| As a developer, how do I run FreeBSD Jails on my MacBook \| during development? With Docker for Mac, it is trivial \| for me to do everything on my Mac, and the fact that \| there is a virtual machine is completely invisible to me. \| Everything "Just Works". With FreeBSD Jails, I would have \| to actually interact with a VM constantly, including the \| pain of shipping files back and forth. \| \| As a developer, are popular databases and applications \| pre-packaged as FreeBSD Jails so that I can spin one up \| on my laptop with a single command? Where is the Docker \| Hub equivalent? \| \| As a developer, how do I orchestrate a collection of \| FreeBSD Jails for each project? With Docker, I define a \| single `docker-compose.yml` file for each project. With a \| single `docker-compose up`, the entire project is running \| _including_ dependencies such as databases and other \| related projects in a completely reproducible fashion. \| This makes it trivial for coworkers to spin up a project \| on their machine and immediately be productive without \| spending an hour trying to get all the right versions of \| everything installed and up and running. \| \| As someone responsible for deploying an application to \| production, what is the story around FreeBSD Jails for \| deploying across a cluster? Is there a Kubernetes- \| equivalent that can manage the allocation of resources, \| blue-green deployments, and manage the lifecycle of my \| FreeBSD Jails? \| \| As someone responsible for deploying an application to \| production, do any of the major clouds support FreeBSD \| Jails? With Docker images, I can deploy those straight to \| ECS Fargate, Google Cloud Run, and half a dozen other \| services. Then I don't even have to think about my own \| infrastructure unless I need some really specialized \| hardware for a specific application. \| \| > the rest barely matters anymore. \| \| _Everything else_ matters so much. \| \| As to your earlier point about ZFS, most Linux distros \| these days seem to trivially support ZFS. Even TrueNAS is \| working on switching to Linux with their TrueNAS Scale \| offering. \| \| It's not that I'm opposed to FreeBSD... FreeBSD is just a \| hard sell. It's hard to pin down exactly what you're \| gaining by throwing out all the collective Linux \| knowledge of an organization and switching to FreeBSD. \| FreeBSD is an N-th tier platform for pretty much every \| programming language except C, so good luck when you run \| into random subtle problems. Also, good luck doing \| hardware accelerated machine learning inference or \| training on FreeBSD... it's _probably_ possible? \| \| > the single important binary \| \| This is also such a weird thing to throw out there. I \| like a good Go program myself, but _most_ companies are \| not only deploying single-binary statically linked \| applications. Most companies are also deploying some kind \| of Ruby, Python, or Java application... none of which are \| likely to be a single file in practice. Most of them will \| have a variety of shared libraries, and I don 't know if \| I've ever seen a Ruby application shipped in a `FROM \| scratch` container before. Technically possible, but \| that's just not common reality as far as I've seen. It \| sounds like you're proposing that everyone is already \| running in `FROM scratch` containers, so a FreeBSD Jail \| is just a drop-in replacement. \| \| Linux containers are far from perfect, but as a \| developer... I _have_ played with FreeBSD Jails before, \| and come away frustrated by all the work you have to do \| yourself. \| handrous wrote: \| > > the single important binary \| \| > This is also such a weird thing to throw out there. I \| like a good Go program myself, but most companies are not \| only deploying single-binary statically linked \| applications. Most companies are also deploying some kind \| of Ruby, Python, or Java application... none of which are \| likely to be a single file in practice. \| \| Sure, but usual practice with containers is to put each \| thing in its own, unless they are _very_ tightly coupled. \| Web-app with a SQL database and a memory cache? Three \| containers. You _can_ do otherwise, but that 's typical. \| Usually each container ends up with one main, important \| running process, and not much else. \| \| [EDIT] \| \| > As someone responsible for deploying an application to \| production, what is the story around FreeBSD Jails for \| deploying across a cluster? Is there a Kubernetes- \| equivalent that can manage the allocation of resources, \| blue-green deployments, and manage the lifecycle of my \| FreeBSD Jails? \| \| > As someone responsible for deploying an application to \| production, do any of the major clouds support FreeBSD \| Jails? With Docker images, I can deploy those straight to \| ECS Fargate, Google Cloud Run, and half a dozen other \| services. Then I don't even have to think about my own \| infrastructure unless I need some really specialized \| hardware for a specific application. \| \| These are exactly the kinds of things I was thinking of \| when I noted that the OS itself has been seriously \| diminished in importance, for modern workflows. I agree \| that most commercial or high-profile open-source "cloud" \| tools and platforms are built around LXC/Docker. \| coder543 wrote: \| > Sure, but usual practice with containers is to put each \| thing in its own, unless they are very tightly coupled. \| Web-app with a SQL database and a memory cache? Three \| containers. You can do otherwise, but that's typical. \| Usually each container ends up with one main, important \| running process, and not much else. \| \| I agree, but... getting all the application dependencies \| in there is more than just getting a single binary in \| there. If it's just a single-binary Go program, then a \| Jail works just fine, but it's not that simple for a Ruby \| application. I'm definitely not talking about databases \| running in the same container as the application. That's \| where Kubernetes and docker-compose come in for multi- \| container orchestration, which are things that FreeBSD \| Jails don't have as far as I know. \| \| > These are exactly the kinds of things I was thinking of \| when I noted that the OS itself has been seriously \| diminished in importance \| \| Yes, but... these are all the things that FreeBSD doesn't \| offer. These are the real reasons that people don't talk \| about FreeBSD Jails in the same breath as Docker. The \| Docker container itself (or the FreeBSD Jail) as a unit \| of isolation is the least interesting part of the \| ecosystem. All of the developer tools, orchestration \| tools, and prebuilt images are what make the Docker \| universe so interesting, and make FreeBSD Jails... less \| interesting. \| \| You said you were confused why Jails don't have more \| mindshare. It has absolutely nothing to do with people \| being able to invent useless tools and write blog posts \| about them, and it has absolutely nothing to do with \| FreeBSD Jails being _too well documented_. You kind of \| implied those were the best explanations you could come \| up with. Those are not the problems _at all_ , and it \| seems disingenuous to me to say you think those are the \| problems unless you _really_ didn 't know the things I \| mentioned in my first reply. \| oarsinsync wrote: \| FreeBSD introduced Jails in 1999. \| \| I used my first Jail in 2001. \| \| Docker was started over a decade later in 2013. \| \| It's reasonable to be confused why Jails lacks the \| mindshare. "Because it lacks all these other over-the-top \| features that we need" might be reasonable in response, \| except that Docker didn't have any of these things on day \| 0 either. \| \| Jails had a 14 year head start, Docker reinvents the \| wheel, and nor particularly well at first. Why did it \| succeed more than Jails did? It wasn't because of the \| piss-poor native Mac support. \| tptacek wrote: \| It seems pretty obvious that the big thing here is that \| most people ship apps on Linux, not on FreeBSD. \| handrous wrote: \| My personal favorite thing about Docker, and the part I'd \| most miss if I switched to Jails (which I'm fairly \| confident could meet my needs with some fairly simple \| scripts and aliases that wouldn't take me long to arrive \| at, which is why I think there's so much less of an \| "ecosystem" there, even a nascent and under-developed \| one) is the way it forces projects to un-fuck their \| configuration. \| \| 500-line config, much of which few people ever care \| about, with all kinds of ill-conceived nesting? Better \| put the ~20 options that 99% of users ever touch in \| environment variables, and document them. Weird state \| garbage that's not captured in your config-on-disk? \| Better figure it out and get it into env vars, and have \| your startup script use those to transparently manage \| whatever bad decisions you made re: state in the past. \| Shit files all over the system? Better get that sorted \| out so people can handle persistence with at the _very_ \| most three total mounts--and oh, gee, look, now your \| simple example docker-compose also serves to document \| where exactly you store files. And so on. \| \| (my second-favorite thing is that it's a de-facto cross- \| distro package manager with very up-to-date packages that \| are trivial to completely and cleanly uninstall) \| vermaden wrote: \| > As a developer, are popular databases and applications \| pre-packaged as FreeBSD Jails so that I can spin one up \| on my laptop with a single command? \| \| The closest you can get is BastilleBSD (framework for \| FreeBSD Jails) and their templates - available here: \| \| https://github.com/BastilleBSD/templates \| https://bastillebsd.org/templates/ \| tptacek wrote: \| I don't know what people generally believe. \| \| But the attack surface of a Linux kernel is very large, is \| pretty unpredictable, and can't be coherently masked out with \| rules (my favorite example Jann Horn's VM reference count bug, \| which was a simple concurrency flaw in the core virtual memory \| system). By comparison, a Linux KVM hypervisor is not just a \| subset of the kernel by definition, but also a much smaller \| codebase, a tiny fraction of the whole kernel. \| \| Replacing shared-kernel isolation like seccomp-filtered \| containers with VMs is, architecturally, simply the replacement \| of a large trusted computing base with a smaller one. If the \| overhead is acceptable, it's hard to argue with from a security \| perspective. \| riobard wrote: \| That's the approach taken by Google's gVisor (at the cost of \| I/O and network performance). \| fsociety wrote: \| gVisor, for better or for worse, does a whole lot of other \| things than just seccomp filtering, and it shows in \| performance tests. \| encryptluks2 wrote: \| gVisor does more than filtering, they basically reimplemented \| the syscalls in an application kernel. At least with seccomp \| the performance overhead is minimal. \| tptacek wrote: \| No, that's really not at all what gVisor is. gVisor is best \| thought of as user-mode Linux --- a complete reimplementation \| of most of the OS kernel. It's not a system call filter; it's \| something much closer to a VM than to seccomp. \| \| gVisor is a very cool codebase. As an illustration of the \| approach: it includes its own TCP/IP stack; we use it in our \| command-line dev tool to allow people to SSH to their VMs \| over WireGuard without having to install WireGuard or obtain \| privileges to manage WireGuard. \| gorkish wrote: \| OK; https://github.com/harvester/harvester \| \| Security and performance aren't the only driving forces; there \| are a lot of technical and operational benefits to the \| abstraction and standard interfaces that you get when running \| stacks that might otherwise look like someone took an Xzibit \| meme too far. \| \| Also remember on a modern system, there are often at least 2 \| additional layers at work abstracting interfaces to the "bare \| metal" OS already. \| encryptluks2 wrote: \| I'm not disagreeing that abstraction can be useful, but the \| overhead of a VM is unnecessary if utilizing the full \| potential of containers. Afterall, the Linux Kernel is acting \| as the hypervisor already, so might as well trust it enough \| to properly sandbox containers too and use the right \| functionality to do so. I also think that running a \| virtualization layer adds quite a bit of complexity, so while \| it is cool that projects and companies have made it work and \| integrated it with a container solution, eliminating the VM \| layer altogether seems more ideal IMO. \| ashishbijlani wrote: \| > Can we somehow combine the advantages of the docker ecosystem \| with VMs? \| \| Shameless plug: this is exactly what our goal is with \| https://kwarantine.xyz We are creating a new hypervisor (from \| scratch) that can run strongly isolated Docker/LXC containers. \| mikepurvis wrote: \| Is this what gvisor is? https://github.com/google/gvisor \| ashishbijlani wrote: \| No, gVisor is from Google. They emulate system calls in user- \| space and use VMs, which increases runtime performance \| overhead. We use hardware virtualization to directly run \| containers -- no I/O emulation, no expensive VM exits, scale \| as needed. Initial comparison with FC/GVisor/Xen here: \| https://github.com/ashishbijlani/kwarantine \| monocasa wrote: \| I'm not sure gvisor requires vm exits. Their first backend \| used ptrace very similarly to how user mode Linux worked. \| \| Minor quip though since ptrace might even be slower than vm \| exits; your core point stands. \| rkeene2 wrote: \| User Mode Linux is still around and works well. I use it \| when I need a "fakeroot" without any special privileges \| on the host. \| \| https://rkeene.org/viewer/tmp/fakeroot.sh.htm \| tptacek wrote: \| It sounds like you just said "yes, but what we're building \| is faster". The userland Linux emulation is a security \| benefit, not a liability. \| amscanne wrote: \| The "fork" sounds like you blue pill the OS for each container? \| I'm assuming the concept is like Cappsule [1] or Bromium [2]? \| \| [1] https://cappsule.github.io/ [2] \| https://en.wikipedia.org/wiki/Bromium#/media/File:Bromium-en... \| ashishbijlani wrote: \| fork here is COW on the host kernel (i.e., copying EPT \| entries). We will post detailed technical documentation soon. \| eatonphil wrote: \| There are a few existing projects out there like this (running \| Docker images as virtual machines, specifically) if folks are \| interested. Slim [0] is the one I can remember off the top of my \| head. I think there are a couple more. \| \| Still, neat to have the walkthrough here in this post. \| \| https://github.com/ottomatica/slim \| thekevjames wrote: \| I had fun exploring Docker->VM conversion a while back [1], \| though the larger goal in my case was to be able to make the \| build path to custom GCP VM Images a bit simpler. Exciting to see \| other cases where folks are finding this sort of flow useful! \| \| 1: https://thekev.in/blog/2019-08-05-dockerfile-bootable- \| vm/ind... \| rwmj wrote: \| https://katacontainers.io/ ? \| bonzini wrote: \| Yes, indeed. However it's nice to see directly the mechanisms \| that let Kata do its magic. \| gravypod wrote: \| Something I'd be very interested in: building a PXE image from \| something declarative like Dockerfiles. \| justincormack wrote: \| Try LinuxKit https://github.com/linuxkit/linuxkit \| laurencerowe wrote: \| Google Container Optimized OS is basically this I think. It's \| what's used when you start a GCE instance with a docker image. \| \| https://cloud.google.com/container-optimized-os/ \| OldGoodNewBad wrote: \| I think a lot of folks are going out of their way to \| misunderstand what happened. Yes there are other similar projects \| and containers. No, none come from a long established _COMMUNITY \| RUN PROJECT_. This is something akin to the difference between \| VirtualBox and OpenBSD's vmd. Ones a product with a "free" tier, \| the other is a community project. \| tptacek wrote: \| As I understand the landscape here, the big enabling win of \| microvms is faster boot time; there's a cool qemu-lite slide deck \| that goes into detail about how they cut down boot time: \| \| https://www.linux-kvm.org/images/d/d2/03x05B-Chao_Peng-Light... \| \| The big win was slashing away the BIOS stuff. \| \| We use AWS's Firecracker to turn our customers Docker containers \| into Firecracker microvms (Firecracker is Amazon's Rust VMM, the \| engine for Fargate and Lambda). Anecdotally: in my dev \| environment, the difference between Firecracker boot times and \| native Docker container startup is imperceptible; the logging we \| do swamps the VM boot stuff. It's _very_ fast. ___________________________________________________________________ (page generated 2021-06-16 23:00 UTC)