Removing HCS containers

The Windows workers in the windows-x86_64 pool have a habit of quietly falling off the cluster. The ocluster-worker service is still RUNNING, but the scheduler lists the worker as disconnected.

I have been manually restarting these workers when this happens, but it’s time to investigate why. The worker log shows that it’s trying to prune, but the items can’t be pruned.

OBuilder partition: 28% free, 0 items
Pruning 100 items
Exec "ctr" "snapshot" "rm" "obuilder-base-63e5...-committed"
...
Pruned 0 items

The workers begin to prune old layers when the store has less than 30% free space. The prune uses a least-recently-used (LRU) strategy. After deleting the children, an independent sweep attempts to remove any base images that no longer have children.

The log shows it wants to prune 100 snapshots, but none were available, and then it tried to clean up by deleting base images in the hope that some were childless.

I tried deleting one of the base images from the log with ctr snapshot rm:

ctr: failed to remove "obuilder-base-63e5...-committed":
  cannot remove snapshot with child: failed precondition

That error would normally be fine and just indicate that there are still children, but none were pruned. The containerd store had 803 obuilder-* snapshots, but OBuilder’s SQLite database had none. The two had drifted completely apart, with hundreds of layers that OBuilder no longer knew about, none of which it could clean up because they were no longer in the database, and the base layers can’t be removed as they still have children.

The question now is, how do you end up with 800 snapshots that the database has forgotten?

Leaked layers could potentially come from restarting the server, but I’ve only done this a handful of times, which can’t really account for 800 layers. Running the log through grep showed 386 User cancelled job and 239 Timeout events. OCaml-CI (the user) cancels jobs when the PR is superseded or the build times out. Thus cancellations are the common case, not the exception.

And there were 18 leftover containers:

> ctr container ls
obuilder-run-376095    -    io.containerd.runhcs.v1
obuilder-run-376449    -    io.containerd.runhcs.v1
... (18 of them) ...

> ctr task ls
TASK    PID    STATUS

Eighteen obuilder-run-* containers, and no running tasks. These dead containers were never removed, and they hold the snapshot tree and base image from being deleted.

OBuilder’s HCS sandbox runs each build step like this:

let cmd = [t.ctr_path; "run"; "--rm"; ...; container_id] in
let proc = Os.exec_result ?stdin ~stdout ~stderr ~pp cmd in
Lwt.on_termination cancelled (fun () ->
  ...
  Os.exec_result [t.ctr_path; "task"; "kill"; "-s"; "SIGKILL"; container_id] ~pp);

ctr run --rm runs the container and deletes it when it exits. On cancellation, the code sends ctr task kill -s SIGKILL. This kills the container, then ctr run notices, and does the --rm cleanup, or does it?

Start a long-running container with --rm:

> ctr run --rm mcr.microsoft.com/windows/nanoserver:ltsc2025 \
    orphan-test cmd /c "ping -n 40 127.0.0.1"
Pinging 127.0.0.1 with 32 bytes of data:
Reply from 127.0.0.1: bytes=32 time<1ms TTL=128
...

Then, from another session, kill the task the way the sandbox does on cancel:

> ctr task kill -s SIGKILL orphan-test

And check what is left:

> ctr container ls
orphan-test    mcr.microsoft.com/windows/nanoserver:ltsc2025    io.containerd.runhcs.v1

The container is still there. --rm did not remove it. The only thing that reliably removes the container is doing it explicitly:

> ctr task delete --force orphan-test
> ctr container rm orphan-test

So the conclusion is that ctr run --rm only deletes the container when the run exits cleanly. If the task was killed because the job was cancelled, the container remains and holds its snapshot.

Therefore, tracing a complete failure case:

A build is cancelled or times out. ctr task kill stops it, but --rm never removes the container, so it lingers and keeps the snapshot and all of the parents.
OBuilder tries to delete the build’s snapshot, fails (the container still has it open), but the delete is recorded as a success as delete always returns unit: val delete : t -> id -> unit Lwt.t. The database row is removed while the snapshot stays.
That orphaned layer holds the parent layer, so the parent’s delete fails.
Eventually, the database is emptied, and the LRU pruner has nothing to prune.

My proposed solution to this is to stop relying on --rm. In the HCS sandbox, remove the container on every exit path:

(Os.exec_result [t.ctr_path; "task"; "delete"; "--force"; container_id] ~pp
   >>= fun _ -> Lwt.return_unit) >>= fun () ->
Os.exec_result [t.ctr_path; "container"; "rm"; container_id] ~pp

I’ll do a little testing before opening a PR. hcs-container-cleanup branch

Removing HCS containers

Categories

Tags