Building a STAC server to avoid scanning 3.8 million tiles
Mark Elvers
3 min read

Categories

  • tessera,ocaml

Tags

  • tunbury.org

The GeoTessera project produces 128-channel geospatial embeddings from Sentinel satellite imagery. The dataset is tiled at 0.1-degree resolution across the globe, covering 9 years and comprising roughly 3.8 million tiles, each containing embeddings and scale-factor files.

These tiles live on three storage backends: the primary source on okavango here in Cambridge (ZFS over spinning disks), an S3 bucket in AWS us-west-2, and a CephFS cluster in Scaleway Paris. Keeping them in sync was becoming slow due to the continual scanning of the source and target.

The s5cmd sync or rsync/rclone approach works, but they start by listing every file on both sides to compute the diff. With 3.8 million tile directories, each containing 3 files, that scan takes a very long time.

What I wanted was an index that tracked what each store contained so that the sync could be reduced to a set difference on metadata rather than a filesystem walk.

The existing registry

There is already registry.parquet which lists every tile on okavango with coordinates, year, file sizes, and hashes. For the target stores, I needed an equivalent parquet file per store that records which tiles it has.

Initially, the sync tool reads the content of a remote store from an s5cmd ls or find output and builds the parquet manifest. From then on, diffs are fast:

=== GeoTessera Sync Status ===

Registry: 3831542 tiles across 9 year(s)

Stores:
  okavango        3831542 tiles
  s3              3831566 tiles
  scaleway        3822382 tiles

Pairwise diffs (missing from target):
  okavango -> scaleway: 9184 missing

Copying the missing tiles then becomes a targeted operation where I can pipe the manifest into xargs -P 32 with s5cmd cp, rather than letting sync discover what’s missing by scanning everything.

Fixing the Arrow library

The tool is written in OCaml using mtelvers/arrow for parquet I/O. The upstream registry.parquet uses the large_string Arrow type (int64 offsets) for its hash column, which the OCaml bindings didn’t support. They only handled regular utf8 (int32 offsets). Reading the column would silently pass the C++ type check (thanks to a special-case hack) but then crash when the OCaml code tried to interpret int64 offsets as int32.

The fix added first-class LargeUtf8 support across the library: new read_large_utf8 / read_large_utf8_opt reader functions with int64 offset handling, large_utf8 / large_utf8_opt writer functions, a LargeUtf8 variant in the high-level Table.col_type GADT, and updates to fast_read for automatic type detection. The silent special case in the C++ layer was removed in favour of proper type dispatch. The library was also bumped from C++17 to C++20 to support Arrow 23 headers.

I didn’t need a STAC server

STAC (SpatioTemporal Asset Catalogue) is a standard for describing geospatial data. The sync tool doesn’t use it. Previously, I created mtelvers/tile-server, which served as the basis for this project. It works directly with parquet files. But since we had all the tile metadata loaded anyway, wrapping it in a STAC API was straightforward and gives us:

  • A standard API that tools like pystac, QGIS, and STAC browsers can query
  • Per-tile asset links showing which stores have each tile and where to download it
  • Spatial search by bounding box

The server loads the parquet files at startup, builds an in-memory index, and serves STAC-compliant JSON. The first store listed is the primary (its tiles form the catalogue); others are cross-referenced to populate asset links.

{
  "id": "2024_grid_0.85_49.95",
  "assets": {
    "okavango": {
      "href": "https://dl2.geotessera.org/.../2024/grid_0.85_49.95",
      "file:size": 108527232,
      "file:checksum": "sha256:..."
    },
    "s3": {
      "href": "https://tessera-embeddings.s3.us-west-2.amazonaws.com/.../2024/grid_0.85_49.95"
    },
    "scaleway": {
      "href": "https://dl1.scw.geotessera.org/.../2024/grid_0.85_49.95"
    }
  }
}

Map envy

The real motivation for the frontend was seeing the GeoTessera coverage map. It’s a beautiful visualisation of global tile coverage, and I felt a bit left out with plain data tables. Using the MapLibre GL frontend on top of the STAC API, with a Sentinel-2 satellite basemap, you can browse the tile inventory spatially, inspect per-tile metadata and store locations, and more.

It’s live at stac.mint.caelum.ci.dev.

The stack

The project is two OCaml binaries. Firstly, stac-server, which handles the STAC API using the parquet files, and secondly, stac-sync for CLI scanning stores, diffing manifests, generating copy lists, and recording synced tiles

Caddy sits in front as a reverse proxy, serving the static frontend at / and proxying /api/* to the OCaml server.

The source is at mtelvers/stac-server