There Is No Inference Moat - WIP

inference

llm

infrastructure

competitive-landscape

deepseek

The inference serving layer is being attacked from every direction simultaneously. DeepSeek compresses the problem from the model side, Cerebras from the hardware side, open source from the orchestration side. These are just examples. There will be no winner here.

Author

Laith Zumot

Published

May 17, 2026

There is no inference moat

Not for anyone assembling a stack today. Its not 2012 nor 1999. The problem is being compressed vertically from the model layer, horizontally from the hardware layer, and orthogonally from the software layer, faster than any single vendor can lock it down. Model efficiency, hardware architecture, firmware, networking, and orchestration are all in motion simultaneously, across proprietary and open source alike.

The model layer is eating the orchestration layer

Inference drives training decisions. Edge, general, specialised, grid, standalone: the deployment target shapes the architecture from the start. You do not design a model for a training cluster and figure out serving later.

The orchestration complexity that emerged around 2023 was a response to a real problem: disaggregated prefill and decode workers, cross-node KV cache transfer, latency-aware routing. Prefill is compute-bound. Decode is memory-bandwidth-bound. Running both on the same GPU causes contention. Routing KV blocks across nodes is expensive. Tools like NVIDIA Dynamo were built to manage this.

DeepSeek has been compressing that problem vertically, from inside the model architecture. V3 introduced MLA, which reduces KV cache size dramatically, and DeepEP, which handles cross-node expert communication at the PTX level by overlapping it with computation rather than serialising it. V4 goes further: DSA (DeepSeek Sparse Attention) reduces attention complexity from O(L²) to O(Lk), cutting per-token inference FLOPs by 73% and KV cache memory by 90% relative to V3.2, while supporting 1M token context windows.

That 90% reduction compresses the core problem the orchestration layer was built to solve. If there is less to transfer, the routing infrastructure matters less. If attention is linear, prefill gets cheaper. The model layer is eating the orchestration layer from above.

The layer that eats itself

The KV cache is the load-bearing problem underneath every piece of inference infrastructure discussed so far. Transformers grow KV cache linearly with sequence length: every new token appends new key and value entries that all subsequent tokens must attend to. At scale this means HBM fills up, prefill and decode start competing for memory, cross-node transfer becomes unavoidable, and you need KVBM to manage block reuse, NIXL to move blocks between workers, and a router smart enough to minimise cache misses. One architectural decision in 2017 spawned an entire infrastructure industry.

The model layer compresses it vertically. DeepSeek V4’s DSA cuts KV cache memory 90%. Less cache means less to transfer, less to route, less infrastructure to justify.

The hardware layer disrupts it horizontally. Full SRAM chips eliminate the HBM bottleneck, the constraint that makes decode memory-bound in the first place, removing bandwidth pressure and the disaggregation requirement with it.

And then there is the orthogonal escape hatch. SSMs (State Space Models, see: Hungry Hungry Hippos, Dao et al.) carry fixed-size recurrent state rather than a growing KV cache, making sequence length irrelevant and memory footprint constant. There is no KV cache to manage, route, compress, or transfer. On SRAM-native hardware, SSMs render the entire infrastructure problem they were built to solve nonexistent. The layer eats itself.

The hardware layer is eating the architecture

The entire disaggregated inference architecture exists because of one constraint: HBM bandwidth. Decode reads all model weights from HBM for every single output token. The ratio of compute to memory transfer is low, so the GPU waits on memory. This is why prefill and decode benefit from separation, why NIXL exists to move KV blocks between workers, and why cluster-level routing is a problem worth solving.

Cerebras built a chip without HBM. The WSE uses on-chip SRAM throughout and eliminates the memory bandwidth bottleneck at the source: the constraint causing prefill/decode disaggregation simply disappears. On Cerebras hardware, the architectural distinction between prefill and decode largely disappears, model switching becomes near-instantaneous, and cold starts approach zero.

At the edge the argument gets stronger. Edge nodes cannot afford cross-node KV cache transfers over a radio link; the latency makes disaggregated prefill and decode pools unworkable. Prefill has to happen locally, but local nodes rarely have enough memory to hold both model weights and KV cache simultaneously. DeepSeek V4’s 90% vertical compression makes edge prefill feasible without offloading. Full SRAM removes the HBM bandwidth bottleneck and the memory pressure simultaneously, collapsing the prefill/decode disaggregation complexity onto a single node and making local inference architecturally viable.

The telco inference play is weaker than the pitch implies. Tokens carry negligible volume, so transport was never the bottleneck. Latency is manageable with small on-device models; Apple already runs 3B parameter models on device with central server fallback, and telcos already route requests to minimise path length.

RAN traffic is overwhelmingly traditional protocol processing: scheduling, beamforming, HARQ, handover. AI is a small slice augmenting those functions. The merchant silicon vendors know this and are positioning accordingly. MWC 2026 crystallised the split: Nokia and Samsung are betting on merchant silicon, while Ericsson, Huawei, and ZTE are staying with purpose-built ASICs. AMD is partnering with Ericsson on Open Telco AI with GSMA backing. Intel is pushing CPU-based inference on Xeon, arguing you do not need a GPU for inference at all. NVIDIA is selling Grace Hopper into base stations with Nokia for concurrent 5G and AI workloads, with commercial trials starting 2026.

None of them have the SRAM argument. ASICs and FPGAs are built for the deterministic, low-latency protocol work that dominates RAN. A Cerebras chip in a RAN node is a more credible inference bet than a GPU: SRAM handles bursty low-latency workloads without the HBM bandwidth penalty, and the power envelope fits edge deployment constraints that GPUs do not. Selling a sledgehammer for a job that needs a scalpel is a revenue strategy with a shelf life.

The software stack has no floor

CUDA is the deepest layer of NVIDIA’s lock-in. Dynamo, Grove, and TensorRT all assume CUDA underneath. Mojo attacks that assumption orthogonally: a systems language built on MLIR, designed to compile efficiently for any accelerator, including CPUs, GPUs, TPUs, ASICs, and SRAM-native chips with no CUDA driver stack. If the compiler layer becomes hardware-agnostic, the entire proprietary software edifice above it has no foundation to stand on.

NVIDIA assembled AI Grid from acquired components faster than any team can integrate them. Chris Lattner is taking the opposite bet: build from the compiler layer up with MLIR, ship Mojo as the hardware-agnostic surface, and let the cohesion come from the foundation rather than the press release. Genuine platform integration takes years. Acquired stacks assembled under a new brand name are roadmaps; cohesion comes from the foundation, and that foundation takes years to pour. The difference shows up in production.

NVIDIA’s software play with Dynamo is structurally similar to CUDA: own the layer between the hardware and the workload before the ecosystem consolidates, then let switching costs do the rest. The open source alternatives are maturing in parallel and in 2026 several are production-grade.

CSPs and enterprises evaluating inference infrastructure have more options than the NVIDIA stack implies. dstack is an open-source, vendor-agnostic orchestrator that handles disaggregated prefill and decode, cache-aware routing, and cluster management across NVIDIA, AMD, TPU, and Tenstorrent hardware. It covers the same orchestration surface as Dynamo’s Grove and KAI Scheduler in a single tool, without the dependency chain those components carry. Grove requires KAI Scheduler for placement, KAI Scheduler assumes NVIDIA MIG, and MIG assumes NVIDIA GPUs; pull one card and the whole stack has nothing to stand on. dstack negotiates directly with whatever compute is present, carrying no such dependency chain.

llm-d, backed by Red Hat, brings KV-aware routing natively to Kubernetes on top of vLLM, targeting the same cross-node routing problem Dynamo addresses through an entirely different software surface.

The open source stack is converging on Dynamo’s capabilities. The question is whether NVIDIA can force adoption fast enough that switching costs take over before convergence completes.

The founding members of llm-d include Google, IBM, CoreWeave, and Red Hat, alongside NVIDIA. The motivations are commercial on all sides. Google has TPUs to protect. IBM has its own silicon roadmap. Both run cloud businesses where an NVIDIA software monopoly on top of an NVIDIA hardware monopoly is an existential cost problem. They are funding the open alternative because the alternative is paying a software tax on top of a hardware tax indefinitely. This is a Thucydides trap at the infrastructure layer: the rising software incumbent and the incumbents who benefit from keeping it contested are funding the same project for opposite reasons.

The competition at every level

The fragmentation runs through every level of the stack simultaneously. Each layer has multiple credible alternatives attacking from different directions, with different incentives and no coordination.

Layer	Incumbents	Challengers	What they compress
Model architecture	OpenAI, Anthropic	DeepSeek, Qwen, Gemma	KV cache, FLOPs, context cost
Training hardware	NVIDIA H100/B200	Cerebras WSE, Google TPU, AMD MI350, Intel Gaudi	HBM dependency, power, cost
Inference hardware	NVIDIA GPU	Cerebras WSE, Marvell ASIC, Broadcom, Intel Xeon	Memory bandwidth bottleneck, RAN fit
Transport	Ethernet, InfiniBand	Ciena, Infinera, Broadcom optical	Inter-node transfer cost
Serving engine	vLLM, TRT-LLM	SGLang, lmcache, Modular MAX	Throughput, latency, memory efficiency
Inference orchestration	NVIDIA Dynamo	dstack, llm-d, AIBrix	Cross-node routing, KV management
Cluster management	NVIDIA KAI/Grove	Slurm, BCM, dstack	GPU allocation, scheduling
Compiler / runtime	CUDA	Mojo/MLIR, ROCm, OpenCL	Hardware lock-in
Multi-model serving	vLLM multi-model	InferX, dstack	Cold start, model switching cost
Agent frameworks	OpenAI, LangChain	Every open source framework built this week	Demand assumption itself

The decision criteria for a CSP choosing between these stacks has little to do with technical merit. Cost is the only real criterion, and it compounds: high rates raised capex costs, 5G never delivered the revenue to justify the spend, and every CSP is now carrying that debt while being asked to bet again on AI infrastructure. The vendor that comes in cheapest with a credible reference customer wins; technical superiority carries little weight when the CFO is running the room.

The game is financing: who can subsidise CSP adoption long enough to become the default before the cycle turns. CSPs know this; it is a buyer’s market. Every vendor is competing to give away the most to lock in the least, and that continues until forward spend can no longer be justified by agent demand that has not materialised, rates stay high, and the whole thing reprices. The vendors who spent the most to manufacture the default lose the most when it does.

The counter-argument

SemiAnalysis makes the strongest case for the moat. Four million CUDA developers create a self-reinforcing flywheel: breakthrough ideas land on CUDA first, get tuned on CUDA first, and attract the next wave of developers. NCCL is pulling away from AMD’s RCCL fork with every release; AMD spends engineering cycles copying changes while NVIDIA advances to the next version. Python interfaces at every layer of the stack make CUDA easier to reach and harder to leave. The velocity argument has teeth.

But velocity in a dead end usually results in hitting a wall.

The NCCL gap matters less when the workload it optimises is being compressed out of existence. DeepSeek V4 cut KV cache memory 90%; the collective communications problem NCCL is being tuned for shrinks with it. The developer flywheel spins fast. SSMs carry no KV cache at all, and the infrastructure those 4 million developers are building around may not be the infrastructure the next generation of models requires. And if DeepSeek V4 is already migrating inference from CUDA to Huawei’s CANN, the flywheel has a leak that velocity alone cannot seal.

Epoch AI puts numbers on the trajectory: inference cost at a fixed capability level falls 5-10x per year. Network latency is the primary constraint for fast inference at real-world batch sizes. NVIDIA’s NCCL improvements target collective communications bandwidth, which only becomes the binding constraint above 500 requests per GPU; below that threshold the bottleneck sits elsewhere. Optimising the wrong constraint faster is a way to spend capital, not build a moat.

Speed of iteration on a shrinking problem does not compound; it converges.

What fragments

Nothing survives intact. The orchestration layer gets pulled apart by each adjacent layer solving its own piece more efficiently than any integrated stack can.

The model layer compresses the KV cache problem vertically until the routing infrastructure built around it becomes overhead. Different silicon removes the architecture constraint without touching the model. The compiler and open source layers reach the same hardware through a different software surface, leaving switching costs with nothing to attach to. Each pressure runs on its own timeline, for its own reasons, with no coordination required.

NVIDIA knows this. AI Grid is modular by design because when DeepSeek V4, Google, Cerebras, IBM, and a long tail of smaller players are each attacking a different layer simultaneously, tight integration becomes a liability. You stay loose enough to shed whatever layer stops mattering next. The stack that looks like a platform today is a collection of options for tomorrow.

The inference moat narrative is manufactured the same way pop music was in the 2000s. There is a quote, attributed variously, from a label executive stepping out of a listening session in the late 90s: “Right, now how do we get this one to number one.” Someone decides what the market needs, builds the narrative, spends to make it true, and calls it a platform before the demand has been validated. The underlying question is the same: is the demand real or manufactured?

The agent bet is load-bearing for the whole stack. If agents work at scale, the multi-model switching problem is real, the inference orchestration complexity is justified, and the capital allocation makes sense. If agents work but with small specialised models on-device, the hyperscaler inference stack is oversized for the actual demand. If agents do not work at scale at all, the entire narrative collapses. Nobody knows yet. Every infrastructure decision being made today is a bet on that answer.

There is no inference moat. There is a race to be the last layer standing when everything above and below has been commoditised, built on a demand assumption that has not been proven. Nobody is winning that race yet.

All of this capital, all of this fragmentation, all of this manufactured urgency, to allow little Johnny to solve a math problem.

References

Models

DeepSeek V3 Technical Report – arxiv.org/abs/2412.19437
DeepEP: Expert-parallel communication library – github.com/deepseek-ai/DeepEP
DeepSeek V4: Towards Highly Efficient Million-Token Context Intelligence – puiching-memory.github.io
Hungry Hungry Hippos: Towards Language Modeling with State Space Models (H3, ICLR 2023) – arxiv.org/abs/2212.14052

Hardware

Cerebras WSE-3: comparison with NVIDIA GPU systems – arxiv.org/html/2503.11698v1

Orchestration and serving

NVIDIA Dynamo architecture – docs.nvidia.com/dynamo
dstack: vendor-agnostic GPU orchestrator – github.com/dstackai/dstack
llm-d: Kubernetes-native distributed inference (CNCF Sandbox, Red Hat / Google / IBM / NVIDIA / CoreWeave) – github.com/llm-d/llm-d
InferX: multi-model serverless inference runtime – inferx.com

Software

Mojo: hardware-agnostic systems language built on MLIR – docs.modular.com/mojo/vision

Telco / RAN

MWC 2026: The RAN Chip Wars – merchant vs ASIC strategies – 650group.com
AMD advances AI for telco networks, MWC 2026 – amd.com