Can a CPU with Intel AMX Match a GPU for ML Inference?

GPU acceleration is the default assumption for machine learning inference. But Intel’s AMX (Advanced Matrix Extensions) may close the gap. AMX is built into recent Xeon processors, which are available from Azure. Can they compete with similarly priced GPU-based machines for the Tessera pipeline?

The Tessera encoder produces 128-dimensional embeddings from multi-temporal Sentinel-2 and Sentinel-1 satellite imagery.

Inputs: 40 time-sampled optical observations [batch, 40, 11] and 40 SAR observations [batch, 40, 3]
Output: one embedding per pixel [batch, 128]

The model processes the Earth’s land surface as a grid of 0.1° tiles. At 10m resolution, each tile is roughly 1,000 x 1,000 pixels. The exact dimensions vary with latitude, but 1 million is a good representative figure per tile.

Hardware and Cost

In this post, I am going to compare our monster AMD EPYC with NVIDIA L4 with two commonly available machines on Azure, which have nearly identical hourly rates:

Machine	Hardware	Cores	Accelerator	$/hr
Azure D16s_v6	Intel Xeon 8573C	8 physical	AMX (bf16)	$0.98
Azure NC8as_T4_v3	AMD EPYC 7V12	4 physical	Tesla T4	$0.94
Monteverde	AMD EPYC 9965 (x2)	384 physical	AVX-512	—
Monteverde	—	—	NVIDIA L4	—

The T4 is the most common cloud GPU. The AMX VM is a standard compute instance with no GPU drivers, no CUDA libraries, nor any special VM image. The L4 represents the current generation of inference GPUs.

Convert to bfloat16

AMX accelerates bfloat16 matrix operations but does nothing for float32. The model was trained in float32, but to use AMX, we must convert to bfloat16. However, to do a like-for-like comparison, we must acknowledge that a GPU has Tensor cores designed for bfloat16 calculations, and if we accept a reduction in resolution on the CPU, we should also do so on the GPU. The code change is simple; we only need to add one line:

with torch.no_grad(), torch.autocast(device_type="cpu", dtype=torch.bfloat16):
    output = model(s2_input, s1_input)

PyTorch’s autocast automatically converts supported operations (linear layers, matmuls, attention) to bfloat16 while keeping numerically sensitive operations (layer norms, softmax) in float32. On CPU, this routes matrix multiplications through the AMX tile units. On the GPU, it engages the Tensor Cores: same API, same line of code, different hardware backend.

Results

For a realistic comparison, I ran the Tessera pipeline on a downloaded dpixel data for my favourite area of Manchester, which has 772,875 pixels.

Configuration	Inference (mm:ss)	Cost/hr	Cost/tile
L4 GPU, bfloat16	2:46	—	—
L4 GPU, float32	7:21	—	—
T4 GPU, float32	14:31	$0.94	$0.23
AMX CPU, bfloat16	17:31	$0.98	$0.29
T4 GPU, bfloat16	19:53	$0.94	$0.31
AMX CPU, float32	40:36	$0.98	$0.66

Three results stand out:

At the same price point (~$1/hr), the AMX CPU processes a tile in 17.5 minutes vs the T4’s 14.5 minutes. It’s 20% slower, but that is still impressive: $0.29/tile vs $0.23/tile. The CPU is competitive, but slightly more expensive and slightly slower.

The T4 is slower with bfloat16 than with float32, which is surprising, as it has Tensor cores.

The L4 runs the inference in 2:46, which is five times faster than the T4 and six times faster than AMX. Modern Tensor Cores (Ada Lovelace generation) are clearly very efficient with bfloat16!

Does bfloat16 affect output quality?

Given the time and effort required to compute embeddings, I was concerned that converting from float32 to bfloat16 would degrade the resulting embeddings. Going fast is great, but not at the expense of data quality. As I now had the same tile processed six times, I could compare the results.

The pipeline outputs int8 quantised embeddings: each pixel gets a 128-dimensional vector of integers (−128 to +127) plus a per-pixel scale factor that reconstructs the original magnitude.

Firstly, all the float32 runs produce bit-identical output regardless of hardware: L4, T4, and AMX CPUs.

bfloat16 introduces small rounding differences:

Comparison	Exact match	Off by 1	Off by 2+	Cosine similarity
L4 float32 vs L4 bfloat16	64.9%	33.7%	1.4%	0.99991
T4 float32 vs T4 bfloat16	64.9%	33.7%	1.4%	0.99991
AMX float32 vs AMX bfloat16	62.3%	35.9%	1.8%	0.99990

About 65% of embedding values are identical between float32 and bfloat16. Of the rest, almost all differ by just 1 quantisation step out of 256. Cosine similarity averages 0.9999, and no pixel in the entire tile falls below 0.999.

The int8 quantisation itself introduces far more rounding than the float32-to-bfloat16 precision change.

Framework Matters: PyTorch vs ONNX Runtime

Previously, I found that ONNX outperformed PyTorch by 14% on both GPU and CPU; however, on AMX hardware, that reverses completely.

Testing the ONNX Runtime 1.24.4 with both the original float32 model and a float16-converted variant vs PyTorch bfloat16:

Threads	PyTorch bfloat16	ONNX float32	ONNX float16
4	1.81	8.88	8.91
8	1.08	5.75	5.57
16	1.20	4.62	4.33

(ms/pixel, synthetic benchmark)

ONNX Runtime doesn’t appear to use AMX. Its float16 model runs no faster than its float32 model. ONNX Runtime’s MLAS library includes AMX-aware kernels, but there doesn’t seem to be an equivalent to PyTorch’s autocast. The model possibly could be explicitly exported with bfloat16 operations.

Tuning Details

For those who want to reproduce or adapt these results, here are the key tuning parameters we discovered.

Thread Count

On the 16-core AMX VM (32 vCPUs with hyperthreading):

Threads	bfloat16 ms/pixel	float32 ms/pixel	AMX speedup
1	7.07	—	—
2	3.48	—	—
4	1.81	6.03	3.3x
8	1.08	3.06	2.8x
16	1.20	1.89	1.6x
32	17.71	—	worse

Hyperthreading seems to hurt AMX performance, and beyond 8 cores, the performance tails off. 8 physical cores with bfloat16 (1.08 ms/pixel) outperform 16 physical cores with float32 (1.89 ms/pixel).

Batch Size

On the 4-core AMX VM, all with bfloat16 autocast:

Threads	Batch 64	Batch 128	Batch 256	Batch 512	Batch 1024
1	5.55	5.20	5.06	5.44	5.95
2	3.26	3.20	2.79	3.00	3.15
4	1.87	1.53	1.42	1.48	1.64

(ms/pixel)

A batch size of 256 achieves the best performance, confirming the results from the previous AMD EPYC benchmark. Presumably, this fits in L2/L3 cache where larger batches spill to main memory.

Cost

At the ~$1/hr price point, two options deliver similar throughput:

Configuration	Time/tile	$/hr	$/tile	Tiles/$
T4 GPU, float32	14.5 min	$0.94	$0.23	4.4
AMX CPU, bfloat16	17.5 min	$0.98	$0.29	3.5

Stepping up to an A10 GPU (~$4/hr, Ada Lovelace generation) would likely process a tile in ~3 minutes with bfloat16, giving ~$0.20/tile. This would be slightly cheaper per tile, but at four times the hourly rate, they would need to run at capacity.

If you need tiles quickly, regardless of cost, an L4 or A10 with bfloat16 will process a tile in 3 minutes. If cost is a factor, and considering that the Tessera is embarrassingly parallel when you have an entire planet process, the T4 running the float32 model outperforms and undercuts the AMX bfloat16.

However, if you are prepared to use spot pricing, the AMX machines can be had at a substantial discount as low as $0.1785, while the T4 machines are in higher demand and cost $0.5176.

Can a CPU with Intel AMX Match a GPU for ML Inference?

Categories

Tags