What are the main differences between virtual machines and containers?

Container engine

  • = Docker engine
  • manages, builds and runs Docker containers
  • interacts with user (CLI or API)
    • also translates high-level commands for the containerd

Container runtime

  • = containerd [container-dee] and runc
  • 2 parts:
    • high-level container runtime (containerd)
      • long-running deamon process, handles the full lifecycle of the containers
    • low-level container runtime (runc)
      • e.g. when the container is created, runc communicates with OS kernel to create a separate process for the container
  • does abstraction from syscalls or OS in general

How do these interact?

  • user enters docker run -d nginx to Docker CLI
  • Docker CLI sends the run command to the Docker Daemon (= container engine)
  • Docker Daemon validates the request and prepares the environment (checking if nginx is locally or pulling it from the registry) and then instructs the containerd (= container runtime) with creating the container
    • containerd-shim component is responsible for keeping the container process running (even when the container runtime restarts)
    • runc uses the OS and Linux namespaces to actually set-up the container and then exits (the main process for the container is containerd-shim, which monitors it and reports back to higher levels)

Docker terminology

Linux namespaces

  • every Linux process belongs to some Linux namespace
    • the same as namespace isolates packages in a program, Linux namespace isolates processes in Linux
    • the reasons are primarily for security and isolation
  • connection between namespaces and containers
    • containers are a form of virtualization (they need to be isolated) and the processes “inside” shouldn’t know about other processes
    • with creation of a new container, a new namespace for it is created
      • it has it’s own mounted “filesystem”, it’s own hostname, PID sequences, users etc.
  • there are 7 namespaces:
    • Mount (mnt)
      • isolates filesystem mount points
      • a process in one MNT namespace sees a unique set of mount points (regardless of what is mounted on the host)
    • UTS (uts)
      • isolates value of hostname
    • IPC (ipc)
      • isolates communication between processes (message queues, semaphores, shared memory)
      • so different processes cannot communicate via shared memory or message queues (unless explicitly allowed)
    • PID (pid)
      • isolates PID number space
      • two processes in different PID namespace can have the same ID and there is no collision
    • Network (net)
      • isolated network resources (like network interfaces, routing tables, IP addresses etc.)
      • each container gets its own virtual eth0 Ethernet card and its own IP address
    • User (user)
      • isolates user and group IDs (UID, GID) between processes
      • use case: a process runs as root inside a container but as a regular process on the host
    • Cgroup (cgroup)
      • for limiting and measuring the process resource usage (CPU, memory, I/O etc.)
      • each process only sees it’s usage, not the usage of all processes
        • so it cannot interfere with resources, it does not technically have
      • kernel has tools to limit the resources

OverlayFS and image layering

  • each instruction in Dockerfile “adds” one read-only layer, so there could be a lot of layers in the image/container
  • OverlayFS mechanism creates a unified “merged” view on all the layers - so the processes in the container see it as one writable layer and can work with it
    • it uses the “copy on write” mechanism in the background
    • reading files - OverlayFS first looks on the top (writable, merged) layer and then propagates to lower layers and returns the first file found (so higher layers can “overshine” the lower ones)
    • writing files - OverlayFS looks for the file, if it is in lower image layers, it get copied into the common writable layer and then it is modified (the original file remains untouched)
    • deleting files - OverlayFS looks for the file and then it creates a special “whiteout” file in the upper writable layer, signaling that this file is “deleted”
      • the original file remains untouched
  • why?
    • storage efficiency - multiple containers could use the same image base (as there are only read-only files)
      • and each container has it’s own “upper writable” FS layer
    • images are immutable and we can rely on that
    • speed - newly created containers create only one empty upper writable layer (and that does not take much time)