Container Security Fundamentals: What Actually Matters

It started with a misconfigured CI runner.

A developer had a Jenkins pipeline building Docker images. The container ran as root. A dependency had a known RCE vulnerability. When the exploit landed, the attacker had root inside the container, and because that process was root, they also had root on the host. They pivoted to the secrets store, grabbed credentials, and spent three weeks inside the network before anyone noticed.

I’ve seen variations of this scenario more times than I’d like. And every time, the same assumption was at the root of it: people thought containers were a security boundary. They aren’t.

Containers are a convenience boundary. They isolate processes well enough that applications don’t step on each other. But they share the host kernel, and that changes everything about the security model.

This post is a practical walkthrough of the controls that actually matter. Not a threat model lecture. Not a compliance checklist. The things I configure on every containerized workload I run, and why each one earns its place.

The Myth of Container Isolation

Let me be specific about what a container actually is, because the mental model most developers carry is wrong.

A container is not a virtual machine. A VM runs a complete OS with its own kernel, fully isolated from the host. A container is just a combination of Linux namespaces, control groups (cgroups), and a layered filesystem. Namespaces restrict what the process can see (its own PID tree, network interfaces, filesystem mounts). Cgroups limit how much CPU and memory it can consume. The layered filesystem gives it the appearance of a complete OS installation.

But the kernel is shared. Every container on a host runs on the same kernel. If that kernel has an unpatched vulnerability, every container is potentially exposed. Dirty Cow. Dirty Pipe. The runc container escape vulnerabilities in 2019. These aren’t theoretical. They’re real exploits that worked precisely because the kernel boundary isn’t as hard as people assumed.

What does this mean practically?

A root process inside a container is root on the host if containment fails. A container with excessive Linux capabilities can call kernel interfaces that no application should ever touch. An image built from an old base layer carries every unpatched CVE from that base, and those CVEs are running inside your production infrastructure right now.

The good news: most of these risks are controllable. You just have to know which controls to apply.

Never Run as Root

This is the most impactful single change you can make, and most base images still default to root. Check for yourself:

docker run --rm ubuntu whoami
# root

That’s root. On a fresh, unmodified Ubuntu base image. And most application images are built from that kind of base without ever adding a non-root user.

Here’s the problem with running as root. If an attacker finds an exploitable vulnerability in your application, they execute code with root privileges. That dramatically expands what they can do: write to any file, read any file, modify network configuration, and potentially escape the container entirely if a kernel exploit is available.

Running as a non-root user limits the blast radius. A vulnerability in a process running as uid 1000 with no special capabilities is much harder to turn into a full system compromise. The attacker has to escalate privileges before doing anything interesting, and that escalation is another opportunity to detect and block them.

Adding a non-root user to a Dockerfile is straightforward:

FROM node:20-alpine

WORKDIR /app

# Copy and install dependencies as root (before switching users)
COPY package*.json ./
RUN npm ci --only=production

# Copy application code
COPY . .

# Create a non-root user and group
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

# Change ownership of app directory
RUN chown -R appuser:appgroup /app

# Switch to non-root user
USER appuser

EXPOSE 3000
CMD ["node", "server.js"]

The USER directive is permanent. Everything after it runs as appuser, including the CMD. When the container starts, your application process is non-root.

The complication: volume mounts. If you mount a host directory into the container, the files on the host have their original ownership. If those files are owned by uid 0 (root) or some other uid, your non-root container user may not be able to read or write them. The solution is to set volume ownership explicitly, either by chowning the host directory before mounting or using init containers in Kubernetes to fix permissions at startup.

Entrypoint scripts that need root for setup (installing packages, configuring the OS) are a related trap. Do that work at image build time, not at container start time. If your entrypoint script requires root, you haven’t finished moving the initialization into the Dockerfile.

In Kubernetes, you can enforce this at the pod spec level:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000

Set runAsNonRoot: true and Kubernetes will refuse to start a container that would run as root. Pair it with runAsUser and runAsGroup to be explicit rather than relying on whatever uid the image specifies.

Read-Only Filesystems

If your application doesn’t need to write to its own filesystem at runtime, take that ability away.

A read-only filesystem stops a whole category of attack in its tracks. Malware can’t write itself to disk. Configuration files can’t be tampered with after the container starts. Adversaries can’t drop tools, scripts, or payloads anywhere that persists between requests.

In Docker:

docker run --read-only --tmpfs /tmp myapp:latest

In Kubernetes:

securityContext:
  readOnlyRootFilesystem: true

The wrinkle is that most applications write something at runtime: log files, temporary uploads, process IDs, socket files. These writes fail immediately when the filesystem is read-only, and the first time you enable this on a production workload that wasn’t built for it, things will break.

The fix is tmpfs mounts. A tmpfs mount is an in-memory filesystem that allows writes but doesn’t persist beyond the container’s lifetime. For ephemeral writes, this is exactly what you want.

Here’s a Kubernetes pod spec that combines a read-only root filesystem with tmpfs volumes for the directories that need writes:

apiVersion: v1
kind: Pod
metadata:
  name: secure-app
spec:
  containers:
  - name: app
    image: myapp:latest
    securityContext:
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000
    volumeMounts:
    - name: tmp
      mountPath: /tmp
    - name: var-run
      mountPath: /var/run
  volumes:
  - name: tmp
    emptyDir:
      medium: Memory
  - name: var-run
    emptyDir:
      medium: Memory

emptyDir with medium: Memory gives you a tmpfs mount. Writes succeed, but nothing persists when the container restarts.

One important distinction: tmpfs is fine for scratch space and runtime-generated ephemeral data. It is not appropriate for application logs you intend to keep. Logs should go to stdout/stderr and be collected by your logging infrastructure. Don’t write logs to disk inside the container at all; let Kubernetes capture them from the container’s standard output.

Drop Linux Capabilities

This one gets skipped more often than it should, usually because people don’t fully understand what Linux capabilities are.

The Linux kernel divides the traditional root privilege into around 40 distinct capabilities. CAP_NET_BIND_SERVICE lets a process bind to ports below 1024. CAP_SYS_ADMIN is a massive capability that covers mounting filesystems, loading kernel modules, and dozens of other privileged operations. CAP_SETUID lets a process change its uid, which is how privilege escalation typically works.

Docker adds a default set of capabilities to every container, and the list is broader than most applications need. You can see the defaults in the Docker documentation, but the short version is: your average web application doesn’t need CAP_NET_ADMIN, CAP_SYS_CHROOT, or CAP_AUDIT_WRITE. They’re there by default anyway.

The right approach is to drop everything and add back only what you actually need:

docker run \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  myapp:latest

In Kubernetes:

securityContext:
  capabilities:
    drop:
      - ALL
    add:
      - NET_BIND_SERVICE

The challenge is figuring out which capabilities you actually need. The honest answer is: trial and error, or tooling.

For trial and error, start with --cap-drop=ALL and run your application through its normal operations. When something fails with a permissions error, check whether a specific capability would fix it. Add the minimum needed. This works but takes time.

A more systematic approach is to use strace to see what system calls your application makes, then cross-reference with which capabilities those calls require. It’s tedious but precise.

Tools like docker-slim can analyze your container and produce a minimal profile, including capabilities. I’ve used it on several projects and it’s useful for both capability analysis and image size reduction.

The investment is worth it. A container with --cap-drop=ALL and a minimal add-back has a dramatically smaller attack surface than one running with Docker’s default capability set. Kernel exploits that require specific capabilities simply won’t work.

Image Scanning

Your application code might be clean. Your base image might not be.

Container images are built on layers. Each layer is an immutable snapshot of a filesystem state. Your application layer sits on top of a base image layer, which sits on top of an OS layer. That OS layer was built at a specific point in time and contains every library and binary that was current when it was built. Libraries that now have published CVEs.

Unpatched base images are one of the most common sources of container vulnerabilities in production. The image you built six months ago has been running fine, but it’s accumulating CVE debt the entire time.

The practical fix is two parts: scan images in CI before they’re deployed, and rebuild images regularly (or at least be alerted when your base image gets a security update).

Trivy is my default recommendation. It’s free, fast, and genuinely easy to add to a CI pipeline:

# Install Trivy (Linux/macOS)
brew install aquasecurity/trivy/trivy

# Scan a local image
trivy image myapp:latest

# Fail build if CRITICAL or HIGH vulnerabilities found
trivy image --exit-code 1 --severity CRITICAL,HIGH myapp:latest

A minimal GitHub Actions step:

- name: Scan image for vulnerabilities
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: myapp:${{ github.sha }}
    format: table
    exit-code: 1
    severity: CRITICAL,HIGH

With exit-code: 1, the build fails if critical or high vulnerabilities are found. This gates deployments on scan results.

Docker Scout and Snyk are solid alternatives if you’re already in those ecosystems. The specific tool matters less than the habit of scanning consistently.

The other lever is base image choice. A full Ubuntu image has hundreds of packages, most of which your application never uses. Each one is a potential CVE surface. Distroless images (from Google) contain only the runtime your application needs. Alpine-based images are small and minimal. The smaller the base image, the smaller the attack surface, and the fewer vulnerabilities you’ll find in scans.

For a Go application, the distroless base is remarkably small:

FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o myapp .

FROM gcr.io/distroless/static-debian12
COPY --from=builder /app/myapp /myapp
USER nonroot:nonroot
ENTRYPOINT ["/myapp"]

The final image contains the static binary and nothing else. No shell, no package manager, no debugging tools. Trivy will find very little to flag.

Secrets: The One People Still Get Wrong

I keep seeing this. Not in greenfield projects. In real, production workloads, maintained by experienced developers who should know better.

ENV DATABASE_PASSWORD=supersecretpassword

Don’t do this.

ENV variables set in a Dockerfile are baked into the image. They appear in docker inspect. They appear in image layer history. Anyone with pull access to the image registry can read them:

docker inspect myapp:latest | grep -i password
# "DATABASE_PASSWORD": "supersecretpassword"

The same problem applies to COPYing secret files into an image. Even if you delete the file in a later layer, the file exists in an earlier layer and can be extracted with docker history or by inspecting the layer directly.

Don’t put secrets in images. Not in ENV, not in ARG (which shows up in build history), not by COPY.

The correct pattern is runtime injection. The application receives secrets at runtime, not at build time. The mechanisms for this differ by platform:

Kubernetes Secrets are the baseline. They’re not encrypted at rest by default (enable etcd encryption), but they’re better than baking secrets into images. Mount them as environment variables or files:

env:
  - name: DATABASE_PASSWORD
    valueFrom:
      secretKeyRef:
        name: db-credentials
        key: password

HashiCorp Vault is the more robust answer for organizations with complex secret management needs. The Vault Agent Injector can inject secrets directly into pods at startup without the application needing to know anything about Vault.

AWS Secrets Manager / GCP Secret Manager / Azure Key Vault are the cloud-native options if you’re on a specific cloud provider.

I have a dedicated post on managing secrets in Kubernetes at /posts/2026/kubernetes-secrets-management/ if you want more depth on the Kubernetes-specific patterns.

One more connection worth making: image scanning catches hardcoded secrets too. Trivy has secret scanning built in. It will flag API keys, passwords, and private keys found in image layers. This is another reason scanning in CI pays for itself; it catches the mistake before the image ever reaches a registry.

A Practical Starting Checklist

None of these controls are complicated to implement. Each one is an independent improvement you can make to any containerized workload right now. Here’s the floor:

Non-root user. Add a non-root user to your Dockerfile. Switch to it with USER. Enforce it with runAsNonRoot: true in Kubernetes. This is the highest-leverage single change you can make.

Read-only filesystem + tmpfs for writes. Set readOnlyRootFilesystem: true. Mount tmpfs volumes for /tmp and any other paths your application writes to. Log to stdout, not to a file.

Drop all capabilities, add back only what’s needed. Start with --cap-drop=ALL. Test your application. Add capabilities one at a time as needed. Document why each one is there.

Minimal base image + regular scanning in CI. Use distroless or Alpine base images. Add Trivy (or equivalent) to your CI pipeline. Fail builds on CRITICAL/HIGH CVEs. Rebuild images when base image security updates are released.

No secrets in images or ENV. Use runtime secret injection. Kubernetes Secrets are a starting point; Vault or a cloud secrets manager is the more durable answer. Scan images for hardcoded secrets as part of CI.

This is the floor, not the ceiling. There’s more: network policies, pod security standards, admission controllers, runtime threat detection (Falco is excellent for this), supply chain security (Sigstore/Cosign for image signing). Security is iterative. You don’t have to solve everything at once.

But if you implement this checklist on every container you run, you’ve eliminated most of the trivially exploitable weaknesses. You’ve taken away root access, removed the ability to tamper with the filesystem, stripped unnecessary kernel privileges, ensured known CVEs are caught early, and closed the most common secrets exposure patterns.

That’s a meaningfully more secure workload than the default. Start there.

Got questions about any of these controls, or running into complications with specific workloads? I’m happy to dig into specifics. Find me on X or LinkedIn.