GPU acceleration is the default assumption for machine learning inference. But Intel’s AMX (Advanced Matrix Extensions) may close the gap. AMX is built into recent Xeon processors, which are available from Azure. Can they compete with similarly priced GPU-based machines for the Tessera pipeline?
The Tessera encoder produces 128-dimensional embeddings from multi-temporal Sentinel-2 and Sentinel-1 satellite imagery.
- Inputs: 40 time-sampled optical observations
[batch, 40, 11]and 40 SAR observations[batch, 40, 3] - Output: one embedding per pixel
[batch, 128]
The model processes the Earth’s land surface as a grid of 0.1° tiles. At 10m resolution, each tile is roughly 1,000 x 1,000 pixels. The exact dimensions vary with latitude, but 1 million is a good representative figure per tile.
Hardware and Cost
In this post, I am going to compare our monster AMD EPYC with NVIDIA L4 with two commonly available machines on Azure, which have nearly identical hourly rates:
| Machine | Hardware | Cores | Accelerator | $/hr |
|---|---|---|---|---|
| Azure D16s_v6 | Intel Xeon 8573C | 8 physical | AMX (bf16) | $0.98 |
| Azure NC8as_T4_v3 | AMD EPYC 7V12 | 4 physical | Tesla T4 | $0.94 |
| Monteverde | AMD EPYC 9965 (x2) | 384 physical | AVX-512 | — |
| Monteverde | — | — | NVIDIA L4 | — |
The T4 is the most common cloud GPU. The AMX VM is a standard compute instance with no GPU drivers, no CUDA libraries, nor any special VM image. The L4 represents the current generation of inference GPUs.
Convert to bfloat16
AMX accelerates bfloat16 matrix operations but does nothing for float32. The model was trained in float32, but to use AMX, we must convert to bfloat16. However, to do a like-for-like comparison, we must acknowledge that a GPU has Tensor cores designed for bfloat16 calculations, and if we accept a reduction in resolution on the CPU, we should also do so on the GPU. The code change is simple; we only need to add one line:
with torch.no_grad(), torch.autocast(device_type="cpu", dtype=torch.bfloat16):
output = model(s2_input, s1_input)
PyTorch’s autocast automatically converts supported operations (linear layers, matmuls, attention) to bfloat16 while keeping numerically sensitive operations (layer norms, softmax) in float32. On CPU, this routes matrix multiplications through the AMX tile units. On the GPU, it engages the Tensor Cores: same API, same line of code, different hardware backend.
Results
For a realistic comparison, I ran the Tessera pipeline on a downloaded dpixel data for my favourite area of Manchester, which has 772,875 pixels.
| Configuration | Inference (mm:ss) | Cost/hr | Cost/tile |
|---|---|---|---|
| L4 GPU, bfloat16 | 2:46 | — | — |
| L4 GPU, float32 | 7:21 | — | — |
| T4 GPU, float32 | 14:31 | $0.94 | $0.23 |
| AMX CPU, bfloat16 | 17:31 | $0.98 | $0.29 |
| T4 GPU, bfloat16 | 19:53 | $0.94 | $0.31 |
| AMX CPU, float32 | 40:36 | $0.98 | $0.66 |
Three results stand out:
At the same price point (~$1/hr), the AMX CPU processes a tile in 17.5 minutes vs the T4’s 14.5 minutes. It’s 20% slower, but that is still impressive: $0.29/tile vs $0.23/tile. The CPU is competitive, but slightly more expensive and slightly slower.
The T4 is slower with bfloat16 than with float32, which is surprising, as it has Tensor cores.
The L4 runs the inference in 2:46, which is five times faster than the T4 and six times faster than AMX. Modern Tensor Cores (Ada Lovelace generation) are clearly very efficient with bfloat16!
Does bfloat16 affect output quality?
Given the time and effort required to compute embeddings, I was concerned that converting from float32 to bfloat16 would degrade the resulting embeddings. Going fast is great, but not at the expense of data quality. As I now had the same tile processed six times, I could compare the results.
The pipeline outputs int8 quantised embeddings: each pixel gets a 128-dimensional vector of integers (−128 to +127) plus a per-pixel scale factor that reconstructs the original magnitude.
Firstly, all the float32 runs produce bit-identical output regardless of hardware: L4, T4, and AMX CPUs.
bfloat16 introduces small rounding differences:
| Comparison | Exact match | Off by 1 | Off by 2+ | Cosine similarity |
|---|---|---|---|---|
| L4 float32 vs L4 bfloat16 | 64.9% | 33.7% | 1.4% | 0.99991 |
| T4 float32 vs T4 bfloat16 | 64.9% | 33.7% | 1.4% | 0.99991 |
| AMX float32 vs AMX bfloat16 | 62.3% | 35.9% | 1.8% | 0.99990 |
About 65% of embedding values are identical between float32 and bfloat16. Of the rest, almost all differ by just 1 quantisation step out of 256. Cosine similarity averages 0.9999, and no pixel in the entire tile falls below 0.999.
The int8 quantisation itself introduces far more rounding than the float32-to-bfloat16 precision change.
Framework Matters: PyTorch vs ONNX Runtime
Previously, I found that ONNX outperformed PyTorch by 14% on both GPU and CPU; however, on AMX hardware, that reverses completely.
Testing the ONNX Runtime 1.24.4 with both the original float32 model and a float16-converted variant vs PyTorch bfloat16:
| Threads | PyTorch bfloat16 | ONNX float32 | ONNX float16 |
|---|---|---|---|
| 4 | 1.81 | 8.88 | 8.91 |
| 8 | 1.08 | 5.75 | 5.57 |
| 16 | 1.20 | 4.62 | 4.33 |
(ms/pixel, synthetic benchmark)
ONNX Runtime doesn’t appear to use AMX. Its float16 model runs no faster than its float32 model. ONNX Runtime’s MLAS library includes AMX-aware kernels, but there doesn’t seem to be an equivalent to PyTorch’s autocast. The model possibly could be explicitly exported with bfloat16 operations.
Tuning Details
For those who want to reproduce or adapt these results, here are the key tuning parameters we discovered.
Thread Count
On the 16-core AMX VM (32 vCPUs with hyperthreading):
| Threads | bfloat16 ms/pixel | float32 ms/pixel | AMX speedup |
|---|---|---|---|
| 1 | 7.07 | — | — |
| 2 | 3.48 | — | — |
| 4 | 1.81 | 6.03 | 3.3x |
| 8 | 1.08 | 3.06 | 2.8x |
| 16 | 1.20 | 1.89 | 1.6x |
| 32 | 17.71 | — | worse |
Hyperthreading seems to hurt AMX performance, and beyond 8 cores, the performance tails off. 8 physical cores with bfloat16 (1.08 ms/pixel) outperform 16 physical cores with float32 (1.89 ms/pixel).
Batch Size
On the 4-core AMX VM, all with bfloat16 autocast:
| Threads | Batch 64 | Batch 128 | Batch 256 | Batch 512 | Batch 1024 |
|---|---|---|---|---|---|
| 1 | 5.55 | 5.20 | 5.06 | 5.44 | 5.95 |
| 2 | 3.26 | 3.20 | 2.79 | 3.00 | 3.15 |
| 4 | 1.87 | 1.53 | 1.42 | 1.48 | 1.64 |
(ms/pixel)
A batch size of 256 achieves the best performance, confirming the results from the previous AMD EPYC benchmark. Presumably, this fits in L2/L3 cache where larger batches spill to main memory.
Cost
At the ~$1/hr price point, two options deliver similar throughput:
| Configuration | Time/tile | $/hr | $/tile | Tiles/$ |
|---|---|---|---|---|
| T4 GPU, float32 | 14.5 min | $0.94 | $0.23 | 4.4 |
| AMX CPU, bfloat16 | 17.5 min | $0.98 | $0.29 | 3.5 |
Stepping up to an A10 GPU (~$4/hr, Ada Lovelace generation) would likely process a tile in ~3 minutes with bfloat16, giving ~$0.20/tile. This would be slightly cheaper per tile, but at four times the hourly rate, they would need to run at capacity.
If you need tiles quickly, regardless of cost, an L4 or A10 with bfloat16 will process a tile in 3 minutes. If cost is a factor, and considering that the Tessera is embarrassingly parallel when you have an entire planet process, the T4 running the float32 model outperforms and undercuts the AMX bfloat16.
However, if you are prepared to use spot pricing, the AMX machines can be had at a substantial discount as low as $0.1785, while the T4 machines are in higher demand and cost $0.5176.