Work distribution with OCluster

We use OCluster to manage the build cluster for the CI services backing OCaml-CI and opam-repo-ci. However, it is a general-purpose tool and isn’t tied to being a build system; it can distribute any jobs across multiple worker machines.

In my case, I need to generate some training data for Tessera by downloading some Sentinel-1 and Sentinel-2 satellite data and applying a tiny AI model for cloud masking. I need to generate about 5,000 tiles. With the HTTP round-trip, querying the STAC catalogue and downloading a single tile takes about 10 minutes. GNU Parallel showed that running more than four concurrently on my machine had no performance advantage. Enter OCluster.

The README.md on the project homepage explains how to set up a cluster. In brief, you need one machine with a fixed and accessible IP address to be the scheduler, plus as many worker machines as you like. The workers make an outgoing connection to the scheduler so they can be behind NAT. Workers are grouped in pools based upon administrative boundaries, such as machine architecture. Run ocluster-scheduler ... --capnp-listen-address=tcp:0.0.0.0:9000 --capnp-public-address=tcp:w.x.y.z:9000 --pools=foo,bar which generates pool-foo.cap and pool-bar.cap which include the public IP address. Run a worker with ocluster-worker --connect=pool-foo.cap --name worker-1 ...

For ease of local testing, I created a Dockerfile which builds my (OCaml) project from source and installs third-party libraries such as the ONNX runtime. I could submit the Dockerfile directly to the cluster, but the cache pruning is more sophisticated when using an OBuilder spec. OCluster runs docker system prune when the Docker partition is low on space, whereas OBuilder prunes individual layers on a least-recently-used basis.

An OBuilder spec is really just an s-expression version of the Dockerfile. For example, converting a trivial Dockerfile into hello.spec

`Dockerfile.hello`

FROM debian:13
USER 1000:1000
RUN echo Hello World

`hello.spec`

((from debian:13)
 (user (uid 1000) (gid 1000))
 (run (shell "echo Hello World"))
)

This can be submitted to OCluster using your capability.

$ ocluster-client submit-obuilder --connect ~/mtelvers.cap --pool test --local-file ./hello.spec 
Tailing log:
Building on worker-1.ci.dev

(from debian:13)
Unable to find image 'debian:13' locally
13: Pulling from library/debian
ac9148dc57ca: Already exists
Digest: sha256:3615a749858a1cba49b408fb49c37093db813321355a9ab7c1f9f4836341e9db
Status: Downloaded newer image for debian:13
2026-03-09 11:28.03 ---> saved as "4ea035d1f0cfdda7660f299954022c3a974ec9e1ba5d06b3a9aa2bca24fdcfb7"

/: (user (uid 1000) (gid 1000))

/: (run (shell "echo Hello World"))
Hello World
2026-03-09 11:28.05 ---> saved as "e3859ae9dcce742a0d612e55f69b5ed1614551ca2b49109e43d08f2f2595fd57"
Job succeeded
Result: "e3859ae9dcce742a0d612e55f69b5ed1614551ca2b49109e43d08f2f2595fd57"

OCluster doesn’t provide any native mechanism to copy artefacts back from the worker machine. In the CI pipeline, there have been creative methods to do this. For example,

print markers to the log, followed by JSON structured data, which can be extracted and parsed
base64-encode some binary objects and print that to the log
setup a remote SSH server and add steps to the build to rsync the data

In a private environment, it may not be necessary to provide a secure upload and curl -X POST -F file=@somefile.bin http://w.x.y.z:8080 to a one-line Python HTTP server may be sufficient.

python3 -c "
  import cgi
  from http.server import HTTPServer, BaseHTTPRequestHandler
  class H(BaseHTTPRequestHandler):
      def do_POST(self):
          form = cgi.FieldStorage(fp=self.rfile, headers=self.headers, environ={'REQUEST_METHOD':'POST'})
          f = form['file']
          open(f.filename, 'wb').write(f.file.read())
          print(f'Saved {f.filename}')
          self.send_response(200)
          self.end_headers()
          self.wfile.write(b'OK\n')
  HTTPServer(('0.0.0.0', 8080), H).serve_forever()
"

However, if you do need an authentication mechanism, ocluster-client supports secrets via the command-line option --secret foo:/path/to/local/file, which can be read into an environment variable or placed in ~/.ssh/id_ed25519 or similar. e.g.

(run
  (secrets (foo (target /path/on/remote/filesystem)))
  (shell "TOKEN=$(cat /path/on/remote/filesystem) curl -H \"X-Token: $TOKEN\" ... "))

I used m4 to process my spec, replacing LAT and LON with actual latitude and longitude values, and submitted the jobs with a simple bash loop that invoked ocluster-client with each spec and redirected stdout to a log file.

Work distribution with OCluster

Categories

Tags

`Dockerfile.hello`

`hello.spec`