GPU vs CPU for ONNX Inference: NVIDIA L4 vs AMD EPYC 9965
Mark Elvers
8 min read

Categories

  • ocluster

Tags

  • tunbury.org

In a previous post, I compared the ONNX Runtime with PyTorch on the CPU and GPU. In this post, I take this to the extreme to see if a CPU can outpace the NVIDIA L4 GPU.

I’m going to use my OCaml bindings to the ONNX Runtime and benchmark the inference performance of Tessera model using a NVIDIA L4 GPU against an AMD EPYC 9965 192-core CPU with AVX-512 support.

The Model

The model produces 128-dimensional embeddings from multi-temporal Sentinel-2 and Sentinel-1 satellite imagery. Each inference takes two inputs:

  • S2 input: [batch, 40, 11] 40 time-sampled Sentinel-2 observations across 11 bands
  • S1 input: [batch, 40, 3] 40 time-sampled Sentinel-1 SAR observations across 3 channels

The output is [batch, 128] embedding per pixel.

The Benchmark

For the benchmark, I’m going to use a minimal OCaml program that isolates pure ONNX Runtime inference, removing all data loading and preprocessing overhead. It pre-fills input tensors with dummy data and runs the model repeatedly in a loop:

for _ = 0 to num_batches - 1 do
  let _outputs = Onnxruntime.Session.run_cached_ba session
    [|(s2_input,
       [| Int64.of_int bs; Int64.of_int sample_size_s2; 11L |]);
      (s1_input,
       [| Int64.of_int bs; Int64.of_int sample_size_s1; 3L |])|]
    ~output_sizes:[| bs * latent_dim |]
  in
  ()
done

The initial tests use a batch size of 2048 and runs ten batches (20,480 pixels). Testing showed that this scaled linearly to longer runs. ONNX Runtime version is 1.24.1.

Results

GPU vs CPU

  GPU (NVIDIA L4) CPU (8 threads)
Total (10 batches) 9.0s 113.6s
Per batch 900ms 11,360ms
Per pixel 0.44ms 5.55ms
GPU speedup 12.6x baseline

As we would expect, the GPU is faster: 12.6 times faster!

CPU Thread Scaling

I picked 8 threads at random for the first test. How does it compare across a range of thread counts, as more cores should give better performance?

Threads ms/batch Speedup vs 1 thread vs GPU (900ms)
1 48,168 1.0x 53.5x slower
4 16,494 2.9x 18.3x slower
8 11,360 4.2x 12.6x slower
16 11,138 4.3x 12.4x slower
32 10,383 4.6x 11.5x slower
64 10,389 4.6x 11.5x slower
128 10,161 4.7x 11.3x slower
192 9,989 4.8x 11.1x slower

Analysis

The CPU thread scaling plateaus at about 16 threads. Going from 1 to 8 threads gave a 4.2x speedup, but doubling further to 16 adds only 2%. Adding the remaining 176 cores contributes almost nothing. This will be investigated below.

The GPU wins by 11-12x for a single job. Even using all 192 cores of the EPYC 9965 in a single process, the L4 GPU is still 11x faster. A single CPU process can only effectively use one or two NUMA nodes’ memory controllers (~75-150 GB/s), while the L4 delivers 300 GB/s to a single computation. More on this later.

A single Sentinel-2 MGRS tile at 10m resolution is 10,980 x 10,980 pixels. Processing this with 1,024 x 1,024 blocks yields 121 blocks averaging ~500K-1M pixels each. At these single-job rates, a full tile takes approximately:

  • GPU: ~16 hours (inference only)
  • CPU (8 threads): ~200 hours

These projections get revisited after optimisation, and the final numbers look very different!

What About PyTorch?

An obvious question: is this an ONNX Runtime with OCaml bindings, or does native PyTorch show the same gap? An equivalent benchmark using PyTorch 2.10 with CUDA 12.6 loads the original checkpoint directly and runs the same model architecture with dummy tensors:

with torch.no_grad():
    for _ in range(num_batches):
        _ = model(s2_input, s1_input)

PyTorch vs ONNX Runtime

Framework GPU (L4) CPU (8 threads) GPU speedup
ONNX Runtime 1.24.1 9.0s (900 ms/batch) 113.6s (11,360 ms/batch) 12.6x
PyTorch 2.10 10.3s (1,026 ms/batch) 140.2s (14,016 ms/batch) 13.6x

The results speak for themselves: across both frameworks, the GPU is the clear winner, 12-14x faster than the CPU. ONNX Runtime does edge out PyTorch by about 14%. This is likely due to graph optimisations applied during model export, but the dominant factor is GPU vs CPU, not the inference framework!

Does Batch Size Matter?

All the results above used a batch size of 2048. The GPU has thousands of CUDA cores while the CPU has far fewer. Should the CPU use smaller batches that fit better in cache?

Running a loop through the batch sizes from 32 to 5120 on both the CPU and GPU gave these results.

Batch Size CPU ms/pixel GPU ms/pixel GPU speedup
32 5.62 0.63 8.9x
64 5.56 0.49 11.3x
128 5.40 0.45 12.0x
256 5.05 0.44 11.5x
512 5.16 0.46 11.2x
1024 6.11 0.45 13.6x
2048 6.07 0.44 13.8x
4096 6.18 0.44 14.0x
5120 0.43
6144+ OOM

The CPU has an optimum batch size of 256 (5.05 vs 6.07 ms/pixel). At small batch sizes the working set fits in cache, avoiding expensive main memory accesses. Above 1024, performance degrades as intermediate tensors spill to DRAM.

For the GPU, the per-pixel throughput is nearly flat from batch size 128 upwards, with a marginal improvement as the size increases. The maximum batch size is constrained by VRAM (24GB).

Even comparing each device at its optimal batch size (CPU at 256, GPU at 5120), the GPU is 11.5x faster. Tuning the batch size helps the CPU modestly but does not bridge the gap.

NUMA Topology: The Hidden Variable

The thread scaling results above were surprisingly poor with 192 threads barely faster than 8. Could NUMA (Non-Uniform Memory Access) explain this?

The AMD EPYC 9965 is a 2-socket system with 24 NUMA nodes (12 per socket). Each node has 16 physical cores, its own 32MB L3 cache, and a local memory controller serving ~128GB of DDR5. The key insight is in the distance table:

node distances:
       0    1    ...  12   13   ...
  0:  10   11        32   32
  1:  11   10        32   32
 12:  32   32        10   11
 13:  32   32        11   10

Accessing memory on the same node costs 10 (local). A different node on the same socket costs 11 (10% penalty). But crossing to the other socket costs 32 aka a 3.2x latency penalty. When ONNX Runtime spawns a large number of threads without NUMA awareness, they scatter across nodes and sockets, and every shared tensor access becomes a cross-socket round trip.

To test this, all threads and memory can be pinned to a single NUMA node using numactl:

numactl --cpunodebind=17 --membind=17 ./bench_onnx.exe \
    --model tessera_model.onnx --batch_size 256 --num_threads 16

NUMA-Pinned Thread Scaling (Node 17, batch_size=256)

Threads ms/pixel Speedup vs 4
4 6.45 1.0x
8 4.18 1.5x
12 3.75 1.7x
16 3.38 1.9x

Within a single NUMA node, scaling is nearly linear. All 16 cores share the same L3 cache and memory controller with no cross-node traffic.

Distributing Across NUMA Nodes

Pinning to one node gives great per-core efficiency, but limits throughput to a single memory controller’s bandwidth. Would spreading across multiple nodes aggregate their bandwidth and cache?

Using numactl --interleave to stripe memory across nodes while binding threads to the corresponding cores:

NUMA Nodes Cores ms/pixel vs 1 node vs GPU
1 16 3.35 1.00x 7.8x
2 32 3.09 1.08x 7.2x
4 64 3.27 1.02x 7.6x
6 96 3.45 0.97x 8.0x
12 (full socket) 192 3.64 0.92x 8.5x

Two nodes is the best with a modest 8% improvement from the extra bandwidth. However, beyond that, cross-node synchronisation overhead outweighs the gains.

Quickly verifying that the optimal batch size still holds in a 2-node configuration:

Batch Size 2 nodes (32 cores) ms/pixel
128 3.09
256 3.07
512 3.18
1024 3.58
2048 3.79

Parallel Jobs: Exploiting the Full Machine

As shown above, a single inference job can’t efficiently use more than 1-2 NUMA nodes. But satellite tile processing is embarrassingly parallel as each tile, block and pixel is independent. What happens when multiple NUMA-pinned jobs run simultaneously?

I tested two strategies:

  • 2 NUMA nodes per job (32 cores, the best single-job configuration), and
  • 1 NUMA node per job (16 cores, maximum parallelism).

2 Nodes Per Job (32 cores each)

Jobs Wall Time Total Pixels Per-job ms/pixel Aggregate ms/pixel vs GPU
1 31s 10,240 3.07 3.07 7.1x slower
2 34s 20,480 3.29 1.66 3.9x slower
4 36s 40,960 3.54 0.89 2.1x slower
6 36s 61,440 3.35 0.59 1.4x slower
12 47s 122,880 4.33 0.38 1.1x faster

1 Node Per Job (16 cores each)

Jobs Wall Time Total Pixels Per-job ms/pixel Aggregate ms/pixel vs GPU
1 34s 10,240 3.33 3.33 7.7x slower
2 36s 20,480 3.54 1.77 4.1x slower
4 40s 40,960 3.86 0.97 2.3x slower
6 41s 61,440 3.96 0.67 1.6x slower
12 42s 122,880 4.00 0.34 1.3x faster
24 55s 245,760 5.21 0.22 2.0x faster

The 1-node-per-job strategy wins decisively at this scale. Each individual job is slightly slower (3.33 vs 3.07 ms/pixel), but with 24 jobs instead of 12, the aggregate throughput is twice as fast as the GPU!

I am impressed that the scaling is so linear: going from 1 to 24 jobs, the wall time only increases from 34s to 55s while processing 24 times the data. The key is that each job is fully contained within its NUMA node using its own 16 cores, 32MB L3 cache, and memory controller with zero cross-node traffic.

Summary

Configuration Per-job ms/pixel Aggregate ms/pixel vs GPU
Default (8 threads, batch=2048) 5.55 5.55 12.6x slower
Tuned batch (8 threads, batch=256) 5.05 5.05 11.5x slower
NUMA 1 node, 16 threads, batch=256 3.33 3.33 7.7x slower
NUMA 2 nodes, 32 threads, batch=256 3.07 3.07 7.1x slower
12x NUMA jobs (2 nodes each) 4.33 0.38 1.1x faster
24x NUMA jobs (1 node each) 5.21 0.22 2.0x faster
GPU (NVIDIA L4) 0.43 0.43 baseline

The optimisation improved performance by 25 times, making the job twice as fast as the GPU.

Revisiting the full-tile projection from earlier: at 0.22 ms/pixel aggregate, the optimised CPU processes a 120-million-pixel MGRS tile in approximately 7.5 hours, about half the GPU’s ~16 hours.

Conclusion

You can beat a GPU with enough CPU cores if you take into account the NUMA topology.

Of course, the GPU is the easy option; adding --cuda 0 to the command line gives you 0.43 ms/pixel with zero tuning. The CPU approach took a lot more effort to tune the batch size and use numactl to pin work to the nodes.