(Replying to PARENT post)

I've had this sneaking but hard to articulate suspicion that datacenters, bare metal servers, VMs, operating systems, containers, OS processes, language VMs, and threads are all really attempts to abstract the same thing.

You want to run business code in a way that's protected from other business code but also able to interact with other business code and data in a well defined way.

I also have this sneaking suspicion that new generations are re-inventing the wheel in a lot of ways. If you have minimal containers running on a hypervisor how is that different from processes running on an operating system? You have all these CPU provided virtualization instructions to protect guests from each other and the host from the guests but there's no reason those instructions couldn't have been developed to protect processes from each other. You have indirections to protect one guest from accessing another's storage but there's no reason processes couldn't have the same protections (and they do in many operating systems). Why container orchestration and overlay networks instead of OS scheduling and IPC?

I'm sure people in academic computer science have already published many a paper about this but it feels different seeing it from inside an IT organization where people don't seem to apply the lessons from older technologies to newer ones and we end up in this constant churn of reinvention which, as far as I can tell, is mostly a way for people, both in management and in the trenches, to keep their jobs, at least until you're "too old to learn new things" and pushed aside.

👤dopylitty🕑6y🔼0🗨️0

(Replying to PARENT post)

And let me state again for the record that all of these promises being made by container systems sound an awful lot like the promises I was offered by 'real' operating systems in the early nineties.

I think the only real difference is that there has been a sea change in public opinion on this kind of aggressive isolation by default being worthwhile.

But a hypervisor publishing a bunch of services that talk to the world and each other? Things are beginning to look a bit more like microkernels as time goes on.

👤hinkley🕑6y🔼0🗨️0

(Replying to PARENT post)

"Containers" is an unfortunate term, since it really better describes the container image than an actual running process with API virtualisation.

I think VMs-as-containers is where we'll wind up. The container image has turned out to be the real thing of interest, the runtime is almost secondary. Virtual machine systems have closed the performance gap in a variety of ways.

For example: tearing out kernel checks for devices that will never be connected to the VM; taking advantage of hyper-privileged CPU instructions; being able to make and restore higher-fidelity checkpoints than an OS can for faster launches etc.

At which point the isolation benefits of VMs really begin to outweigh everything else. A hypervisor has a much smaller attack surface and has a much simpler role than a full monolithic OS kernel. It partitions the hardware and that's about it. It doesn't exist in a constant tension between kernel-as-resource-manager and kernel-as-service-provider.

👤jacques_chester🕑6y🔼0🗨️0

(Replying to PARENT post)

I'm quite hopeful for the limited re-introduction of hypervisors to "containment", if only because I've become disappointed with the lack of power app authors have to specify fine-grained, strict security details. This is because most of the fun lockdown toys that linux has, selinux, seccomp et al aren't easily "stackable", and those toys have been used by the orchestrator to perform the containment. And that's great, but it means containers all end up with broad, generic policies (that only care about container breakout) which can't be further restricted by app/container authors.

My hope is that lightweight virtualisation will give back the ability for app authors to tighten their container's straight jackets. Personally I've got my eye on kata containers.

👤ris🕑6y🔼0🗨️0

(Replying to PARENT post)

I have a Xen hypervisor at home, running on a (well-configured) NUC. VMs boot in about 15 seconds. I'm patient!

I'm not doing devops stuff; I no longer code, so I don't need a testing pipeline.

I looked into containers based on LXC, when it was first introduced. I decided to stay away - I don't want to get tied into Poettering's code. Yeah, I'm running systemd on some of the VMs, but you don't really have much choice nowadays. I still use Debian, but I'm bitter about their adoption of systemd - and my hypervisor machine runs SysV Init, because I know how that works.

I never mixed it with Docker. From what I've been reading, Docker is already old-hat (it's only about 3 years old; how did that happen?) Kubernetes seems to be the thing nowadays. I don't even know how to pronounce "Kubernetes".

What was wrong with LXC? Like, LXC comes with the OS. Nothing to install. Why do people love these 3rd-party container engines?

And "wrappers"? what purpose is served by container wrappers?

Serious questions, I'm not trolling. Promise. I'm just a bit out-of-date.

👤denton-scratch🕑6y🔼0🗨️0

(Replying to PARENT post)

> The Docker engine default seccomp profile blocks 44 system calls today, leaving containers running in this default Docker engine configuration with just around 300 syscalls available.

...preventing devs/ops people to run tools like iotop, unless extra capabilities are added.

I'm all in for containers, cgroups/namespaces but at the moment it's namespace isolation for the price of less features. Unless namespaces become first-class citizens in the Linux kernel, it will always be more efficient to just run on VMs or even Bare Metal. At least for non-planet scale workloads. :-)

👤blablabla123🕑6y🔼0🗨️0

(Replying to PARENT post)

I see container systems, such as Docker, as a packaging system more than anything else.
👤xfitm3🕑6y🔼0🗨️0

(Replying to PARENT post)

Just waiting for the new re-discovery of hypervisors, but with better marketing names.
👤pjmlp🕑6y🔼0🗨️0

(Replying to PARENT post)

The future is probably WASI: a sandboxed compile target with a capabilities based security model not owned by a single corporation. https://wasi.dev
👤spion🕑6y🔼0🗨️0

(Replying to PARENT post)

Is there still any place in 2019 for "system containers" like LXC/LXD?
👤curt15🕑6y🔼0🗨️0

(Replying to PARENT post)

Key point: AWS Firecracker does NOT run on AWS.

Unless you want to pay for bare metal instances.

AWS Firecracker DOES run on Google Cloud, Azure and Digital Ocean.

👤kresten🕑6y🔼0🗨️0

(Replying to PARENT post)

For the record, hypervisor were introduced almost before operating systems. In the begining it was just a switch to partition memmory and run multiple jobs on a big expensive machine.

So this is not the first time hypervisors are making a comeback.

👤panpanna🕑6y🔼0🗨️0

(Replying to PARENT post)

Wait....this isn't new - hypervisor-based container isolation has been in Windows Server since 2016 (it's called Hyper-V containers).
👤jasoneckert🕑6y🔼0🗨️0

(Replying to PARENT post)

So I read something succinct a while back - docker et al are distribution platforms not security platforms . They add "0" security against an adversary (think padlocks). How true is it ? And what high perf securely contained systems are there? OpenVZ?
👤nunchuckninja🕑6y🔼0🗨️0

(Replying to PARENT post)

I’ve been thinking about these problems for a while. Previously, I thought that the “put a VM on it” approach was the right one. In 2015, I wrote novm [1], which I think served as inspiration for some developments that followed. My thinking has changed over the years and I actually work on gVisor today (disclaimer!). I’d like to share some thoughts here.

Hypervisors never left. They are a fundamental building block for infrastructure and will continue to be.

The question is whether there will be a broad shift to start relying on hypervisors to isolate every individual application. In my opinion, just wrapping containers in VMs is not a solution. (Nor do I find it technologically interesting, but that’s me.) I agree that the approach addresses some of the challenges of isolation, but is one step forward, two steps back in other ways.

Virtualizing at the hardware boundary lets you do some things very well. For example, device state is simple, and hardware support lets you track dirty memory passively and efficiently, so you can implement live migration for virtual machines much better than you could for processes. It can divide big machines into fungible, commodity sizes (allowing applications from having to care about NUMA, etc.). It lets you pass though and assign hardware devices. It gives you a strong security baseline.

But abstractions work best when they are faithful. Virtual machines operate on virtual CPUs, memory and devices, and operating systems work best when those abstractions behave like the real thing. That is, CPUs and memory are mostly available, and hardware acts like hardware (works independently, interactions don’t stall).

Containers and applications operate on OS-level abstractions: threads, memory mappings, futexes, etc. These abstractions are the basis for container efficiency — not because startup time is fast, but because these abstractions allow for a lot of statistical multiplexing and over-subscription while still performing well. The abstractions provide a lot of visibility for the OS to make good choices with global information (e.g. informing the scheduler, reclaim policy, etc.).

A problem arises when you decide that you want to bind single applications to single VMs, and then run many VMs instead of many containers. Effectively, the abstractions that you expose are now CPUs and memory, and these just don’t work as well for over-subscription and overall infrastructure efficiency. There’s no shared scheduler or cooperative synchronization (e.g. in an OS, threads waking each other will be moved to the same core), there’s no shared page cache, etc.

There are other problems too: virtualization gives you a very strong security baseline, but you have to start punching significant holes to get the container semantics you want. E.g. the cited virtfs is a great example: it’s easy to reason about the state of a block device, but an effective FUSE api (and shared memory for metadata) is a much larger system surface. The hardware interface itself is not a silver bullet. Devices are still complex (escapes happen), and the last few years have taught us that even the hardware mechanisms can have flaws. For example, AFAIK Kata containers is still vulnerable to L1TF unless you’re using instance-exclusive cpusets or have disabled hyper-threading. (Whereas native processes and containers are not vulnerable to this particular bug.)

The “put a VM on it” approach also may not have the standard image problems that plain hypervisors have, but you’ve got portability challenges. It seems non-ideal that a container isolation solution can run in infrastructure X and Y, but not in standard public clouds or your on-prem VMWare hosts, etc. (There might be specific technologies for each case, but that’s rather the point.)

That’s my 2c. I’m pretty optimistic that we can have strong isolation while still preserving the efficiency, portability and features of container-based infrastructure. I like a lot of these projects (especially the ones doing technologically interesting things, e.g. nabla, x-containers, virtfs, etc.) but I don’t think the straight-up “put a VM on it” approach is going to get us there.

👤amscanne🕑6y🔼0🗨️0