Anima Mundi

Strategic analysis: the AI compute wars (2026)

The AI industry has shifted from training to inference.

In 2024, the constraint was capital.
In 2026, the constraint is token margins.

The competitive frontier is no longer maximum compute density. It is minimum cost per token. Compute has stopped being a headline metric and become a unit economics problem.


1. The TPU arbitrage

Google’s TPU strategy is not about hardware leadership. It is about vertical integration.

By keeping TPUs proprietary, Google avoids the enterprise tax: supporting thousands of environments, edge cases, and developer workflows. They support exactly one stack. One model family. One deployment path.

That simplicity shows up directly in costs.

A ~4.5× advantage at the hardware layer compounds upstream. Google can price Gemini at market rates while capturing 70–80% gross margins. Those margins are then recycled into the next TPU generation.

OpenAI and Anthropic raise capital to pay the Nvidia tax. Google uses margin to eliminate it.


2. The $20B regulatory bypass

Nvidia’s move around Groq is not a traditional acquisition. It is IP denial without triggering merger review.

The structure:

The signal:

The objective:

This was not about revenue. It was about removing a future branch of the design tree.


3. HBM vs SRAM: the latency wall

Modern GPUs are optimized for throughput, not determinism.

HBM delivers high capacity and bandwidth, but with significant latency variance. That works well for prompt prefill. It performs poorly during generation, especially in agentic or real-time loops where jitter compounds.

SRAM-based architectures behave differently. Execution is deterministic. Scheduling is static. Tail latency largely disappears.

The emerging split is structural:

Inference is no longer a single phase. It is a pipeline with different bottlenecks at each stage.


4. The compatibility matrix

The software moat is being attacked from two directions: portability and performance.

Framework Backend Hardware Status
PyTorch CUDA Nvidia GPU Native, deeply optimized
PyTorch XLA Google TPU Improving via torch_xla
JAX XLA Google TPU First-class, production
JAX CUDA Nvidia GPU Mature, high performance
TensorFlow XLA TPU / GPU Legacy, maintenance mode

The framework matters less than the compiler boundary it implies.


5. The real war: CUDA vs XLA

This is not a PyTorch vs JAX debate. It is a compiler war.

CUDA is a powerful lock-in. Most researchers live inside it, and leaving usually means weeks of subtle debugging.

XLA treats models as pure functions and pushes hardware decisions downstream. Portability is the default, not an afterthought.

The key development is PyTorch frontends that lower into XLA graphs. If developers can write idiomatic PyTorch and get XLA-level portability, the CUDA moat weakens quickly.

Once code is portable, hardware differentiation collapses. Once hardware is interchangeable, margins compress.


6. The data center is the chip

Nvidia’s strongest moat is no longer silicon. It is the interconnect.

Raw FLOPS do not matter if you do not control the wire. Performance at scale is a networking problem disguised as a hardware one.


7. Signals going into 2026

The next winners will not have the biggest models.

They will have the lowest latency, the cheapest tokens, and full control over how bits move through the system.

#ai #research #startups