In a previous post, I compared the ONNX Runtime with PyTorch on the CPU and GPU. In this post, I take this to the extreme to see if a CPU can outpace the NVIDIA L4 GPU.
I’m going to use my OCaml bindings to the ONNX Runtime and benchmark the inference performance of Tessera model using a NVIDIA L4 GPU against an AMD EPYC 9965 192-core CPU with AVX-512 support.
The Model
The model produces 128-dimensional embeddings from multi-temporal Sentinel-2 and Sentinel-1 satellite imagery. Each inference takes two inputs:
- S2 input:
[batch, 40, 11]40 time-sampled Sentinel-2 observations across 11 bands - S1 input:
[batch, 40, 3]40 time-sampled Sentinel-1 SAR observations across 3 channels
The output is [batch, 128] embedding per pixel.
The Benchmark
For the benchmark, I’m going to use a minimal OCaml program that isolates pure ONNX Runtime inference, removing all data loading and preprocessing overhead. It pre-fills input tensors with dummy data and runs the model repeatedly in a loop:
for _ = 0 to num_batches - 1 do
let _outputs = Onnxruntime.Session.run_cached_ba session
[|(s2_input,
[| Int64.of_int bs; Int64.of_int sample_size_s2; 11L |]);
(s1_input,
[| Int64.of_int bs; Int64.of_int sample_size_s1; 3L |])|]
~output_sizes:[| bs * latent_dim |]
in
()
done
The initial tests use a batch size of 2048 and runs ten batches (20,480 pixels). Testing showed that this scaled linearly to longer runs. ONNX Runtime version is 1.24.1.
Results
GPU vs CPU
| GPU (NVIDIA L4) | CPU (8 threads) | |
|---|---|---|
| Total (10 batches) | 9.0s | 113.6s |
| Per batch | 900ms | 11,360ms |
| Per pixel | 0.44ms | 5.55ms |
| GPU speedup | 12.6x | baseline |
As we would expect, the GPU is faster: 12.6 times faster!
CPU Thread Scaling
I picked 8 threads at random for the first test. How does it compare across a range of thread counts, as more cores should give better performance?
| Threads | ms/batch | Speedup vs 1 thread | vs GPU (900ms) |
|---|---|---|---|
| 1 | 48,168 | 1.0x | 53.5x slower |
| 4 | 16,494 | 2.9x | 18.3x slower |
| 8 | 11,360 | 4.2x | 12.6x slower |
| 16 | 11,138 | 4.3x | 12.4x slower |
| 32 | 10,383 | 4.6x | 11.5x slower |
| 64 | 10,389 | 4.6x | 11.5x slower |
| 128 | 10,161 | 4.7x | 11.3x slower |
| 192 | 9,989 | 4.8x | 11.1x slower |
Analysis
The CPU thread scaling plateaus at about 16 threads. Going from 1 to 8 threads gave a 4.2x speedup, but doubling further to 16 adds only 2%. Adding the remaining 176 cores contributes almost nothing. This will be investigated below.
The GPU wins by 11-12x for a single job. Even using all 192 cores of the EPYC 9965 in a single process, the L4 GPU is still 11x faster. A single CPU process can only effectively use one or two NUMA nodes’ memory controllers (~75-150 GB/s), while the L4 delivers 300 GB/s to a single computation. More on this later.
A single Sentinel-2 MGRS tile at 10m resolution is 10,980 x 10,980 pixels. Processing this with 1,024 x 1,024 blocks yields 121 blocks averaging ~500K-1M pixels each. At these single-job rates, a full tile takes approximately:
- GPU: ~16 hours (inference only)
- CPU (8 threads): ~200 hours
These projections get revisited after optimisation, and the final numbers look very different!
What About PyTorch?
An obvious question: is this an ONNX Runtime with OCaml bindings, or does native PyTorch show the same gap? An equivalent benchmark using PyTorch 2.10 with CUDA 12.6 loads the original checkpoint directly and runs the same model architecture with dummy tensors:
with torch.no_grad():
for _ in range(num_batches):
_ = model(s2_input, s1_input)
PyTorch vs ONNX Runtime
| Framework | GPU (L4) | CPU (8 threads) | GPU speedup |
|---|---|---|---|
| ONNX Runtime 1.24.1 | 9.0s (900 ms/batch) | 113.6s (11,360 ms/batch) | 12.6x |
| PyTorch 2.10 | 10.3s (1,026 ms/batch) | 140.2s (14,016 ms/batch) | 13.6x |
The results speak for themselves: across both frameworks, the GPU is the clear winner, 12-14x faster than the CPU. ONNX Runtime does edge out PyTorch by about 14%. This is likely due to graph optimisations applied during model export, but the dominant factor is GPU vs CPU, not the inference framework!
Does Batch Size Matter?
All the results above used a batch size of 2048. The GPU has thousands of CUDA cores while the CPU has far fewer. Should the CPU use smaller batches that fit better in cache?
Running a loop through the batch sizes from 32 to 5120 on both the CPU and GPU gave these results.
| Batch Size | CPU ms/pixel | GPU ms/pixel | GPU speedup |
|---|---|---|---|
| 32 | 5.62 | 0.63 | 8.9x |
| 64 | 5.56 | 0.49 | 11.3x |
| 128 | 5.40 | 0.45 | 12.0x |
| 256 | 5.05 | 0.44 | 11.5x |
| 512 | 5.16 | 0.46 | 11.2x |
| 1024 | 6.11 | 0.45 | 13.6x |
| 2048 | 6.07 | 0.44 | 13.8x |
| 4096 | 6.18 | 0.44 | 14.0x |
| 5120 | — | 0.43 | — |
| 6144+ | — | OOM | — |
The CPU has an optimum batch size of 256 (5.05 vs 6.07 ms/pixel). At small batch sizes the working set fits in cache, avoiding expensive main memory accesses. Above 1024, performance degrades as intermediate tensors spill to DRAM.
For the GPU, the per-pixel throughput is nearly flat from batch size 128 upwards, with a marginal improvement as the size increases. The maximum batch size is constrained by VRAM (24GB).
Even comparing each device at its optimal batch size (CPU at 256, GPU at 5120), the GPU is 11.5x faster. Tuning the batch size helps the CPU modestly but does not bridge the gap.
NUMA Topology: The Hidden Variable
The thread scaling results above were surprisingly poor with 192 threads barely faster than 8. Could NUMA (Non-Uniform Memory Access) explain this?
The AMD EPYC 9965 is a 2-socket system with 24 NUMA nodes (12 per socket). Each node has 16 physical cores, its own 32MB L3 cache, and a local memory controller serving ~128GB of DDR5. The key insight is in the distance table:
node distances:
0 1 ... 12 13 ...
0: 10 11 32 32
1: 11 10 32 32
12: 32 32 10 11
13: 32 32 11 10
Accessing memory on the same node costs 10 (local). A different node on the same socket costs 11 (10% penalty). But crossing to the other socket costs 32 aka a 3.2x latency penalty. When ONNX Runtime spawns a large number of threads without NUMA awareness, they scatter across nodes and sockets, and every shared tensor access becomes a cross-socket round trip.
To test this, all threads and memory can be pinned to a single NUMA node using numactl:
numactl --cpunodebind=17 --membind=17 ./bench_onnx.exe \
--model tessera_model.onnx --batch_size 256 --num_threads 16
NUMA-Pinned Thread Scaling (Node 17, batch_size=256)
| Threads | ms/pixel | Speedup vs 4 |
|---|---|---|
| 4 | 6.45 | 1.0x |
| 8 | 4.18 | 1.5x |
| 12 | 3.75 | 1.7x |
| 16 | 3.38 | 1.9x |
Within a single NUMA node, scaling is nearly linear. All 16 cores share the same L3 cache and memory controller with no cross-node traffic.
Distributing Across NUMA Nodes
Pinning to one node gives great per-core efficiency, but limits throughput to a single memory controller’s bandwidth. Would spreading across multiple nodes aggregate their bandwidth and cache?
Using numactl --interleave to stripe memory across nodes while binding threads to the corresponding cores:
| NUMA Nodes | Cores | ms/pixel | vs 1 node | vs GPU |
|---|---|---|---|---|
| 1 | 16 | 3.35 | 1.00x | 7.8x |
| 2 | 32 | 3.09 | 1.08x | 7.2x |
| 4 | 64 | 3.27 | 1.02x | 7.6x |
| 6 | 96 | 3.45 | 0.97x | 8.0x |
| 12 (full socket) | 192 | 3.64 | 0.92x | 8.5x |
Two nodes is the best with a modest 8% improvement from the extra bandwidth. However, beyond that, cross-node synchronisation overhead outweighs the gains.
Quickly verifying that the optimal batch size still holds in a 2-node configuration:
| Batch Size | 2 nodes (32 cores) ms/pixel |
|---|---|
| 128 | 3.09 |
| 256 | 3.07 |
| 512 | 3.18 |
| 1024 | 3.58 |
| 2048 | 3.79 |
Parallel Jobs: Exploiting the Full Machine
As shown above, a single inference job can’t efficiently use more than 1-2 NUMA nodes. But satellite tile processing is embarrassingly parallel as each tile, block and pixel is independent. What happens when multiple NUMA-pinned jobs run simultaneously?
I tested two strategies:
- 2 NUMA nodes per job (32 cores, the best single-job configuration), and
- 1 NUMA node per job (16 cores, maximum parallelism).
2 Nodes Per Job (32 cores each)
| Jobs | Wall Time | Total Pixels | Per-job ms/pixel | Aggregate ms/pixel | vs GPU |
|---|---|---|---|---|---|
| 1 | 31s | 10,240 | 3.07 | 3.07 | 7.1x slower |
| 2 | 34s | 20,480 | 3.29 | 1.66 | 3.9x slower |
| 4 | 36s | 40,960 | 3.54 | 0.89 | 2.1x slower |
| 6 | 36s | 61,440 | 3.35 | 0.59 | 1.4x slower |
| 12 | 47s | 122,880 | 4.33 | 0.38 | 1.1x faster |
1 Node Per Job (16 cores each)
| Jobs | Wall Time | Total Pixels | Per-job ms/pixel | Aggregate ms/pixel | vs GPU |
|---|---|---|---|---|---|
| 1 | 34s | 10,240 | 3.33 | 3.33 | 7.7x slower |
| 2 | 36s | 20,480 | 3.54 | 1.77 | 4.1x slower |
| 4 | 40s | 40,960 | 3.86 | 0.97 | 2.3x slower |
| 6 | 41s | 61,440 | 3.96 | 0.67 | 1.6x slower |
| 12 | 42s | 122,880 | 4.00 | 0.34 | 1.3x faster |
| 24 | 55s | 245,760 | 5.21 | 0.22 | 2.0x faster |
The 1-node-per-job strategy wins decisively at this scale. Each individual job is slightly slower (3.33 vs 3.07 ms/pixel), but with 24 jobs instead of 12, the aggregate throughput is twice as fast as the GPU!
I am impressed that the scaling is so linear: going from 1 to 24 jobs, the wall time only increases from 34s to 55s while processing 24 times the data. The key is that each job is fully contained within its NUMA node using its own 16 cores, 32MB L3 cache, and memory controller with zero cross-node traffic.
Summary
| Configuration | Per-job ms/pixel | Aggregate ms/pixel | vs GPU |
|---|---|---|---|
| Default (8 threads, batch=2048) | 5.55 | 5.55 | 12.6x slower |
| Tuned batch (8 threads, batch=256) | 5.05 | 5.05 | 11.5x slower |
| NUMA 1 node, 16 threads, batch=256 | 3.33 | 3.33 | 7.7x slower |
| NUMA 2 nodes, 32 threads, batch=256 | 3.07 | 3.07 | 7.1x slower |
| 12x NUMA jobs (2 nodes each) | 4.33 | 0.38 | 1.1x faster |
| 24x NUMA jobs (1 node each) | 5.21 | 0.22 | 2.0x faster |
| GPU (NVIDIA L4) | 0.43 | 0.43 | baseline |
The optimisation improved performance by 25 times, making the job twice as fast as the GPU.
Revisiting the full-tile projection from earlier: at 0.22 ms/pixel aggregate, the optimised CPU processes a 120-million-pixel MGRS tile in approximately 7.5 hours, about half the GPU’s ~16 hours.
Conclusion
You can beat a GPU with enough CPU cores if you take into account the NUMA topology.
Of course, the GPU is the easy option; adding --cuda 0 to the command line gives you 0.43 ms/pixel with zero tuning. The CPU approach took a lot more effort to tune the batch size and use numactl to pin work to the nodes.