TokenSpeed, Blackwell, and Vera Rubin: The Runtime Boundary Is Moving

Start here: what this post is really about

This post is about a simple shift: AI applications are becoming more like operating systems. They do not just send one prompt to one model and wait for one answer. They run loops, call tools, search files, spawn sub-agents, remember context, retry failed steps, and keep moving state between software, CPU memory, GPU memory, and networked machines.

That changes the bottleneck. The question is no longer only “which model is fastest?” or “which GPU has the most FLOPs?” The question is: where should each part of the AI task live while the system is running?

TokenSpeed’s launch is worth paying attention to, but not because the industry needed another OpenAI-compatible inference server. The interesting part is the boundary it attacks.

If you are not deep in inference infrastructure, that sentence needs unpacking. An inference server is the software that receives a request, prepares the model input, schedules GPU work, manages attention cache, streams tokens back, and keeps many users from blocking each other. vLLM, SGLang, TensorRT-LLM, Dynamo, and TokenSpeed all sit somewhere in this serving stack. They differ in what layer they optimize and where they believe the “hot path” should live.

Agentic coding workloads have made request lifecycle, KV ownership, prefix reuse, prefill/decode overlap, and scheduler jitter first-order performance concerns. The question is no longer whether vLLM, SGLang, TensorRT-LLM, or TokenSpeed can serve tokens. They can. The question is where the runtime should place ownership of request state, cache state, scheduler transitions, and CPU/GPU handoff.

Terms used in this post

Runtime The software layer that keeps an AI task moving while it runs: request handling, scheduling, memory ownership, cache lookup, GPU dispatch, streaming, retries, and tool calls.

Boundary A handoff point between layers: Python to C++, CPU to GPU, one GPU to another GPU, prefill to decode, local cache to remote cache, or agent runtime to model server.

KV cache The attention memory created while a model reads context. Reusing it can make repeated long-context work cheaper. Losing it can force the system to recompute expensive context.

Prefill The phase where the model reads the prompt/context and builds attention state. Long prompts make prefill expensive.

Decode The phase where the model generates output tokens. Decode wants low latency and stable streaming.

Agentic workload A workload where the AI system performs steps, calls tools, reads files, spawns helpers, and loops until a task is done. Coding agents are the clearest example.

Agentic coding is not a chatbot workload

The important part of NVIDIA’s Claude Code trace is not simply that agents use more tokens. The important part is the shape: many requests, sub-agent fanout, context growth, tool calls, compaction, and repeated cache-sensitive transitions.

agent loop
  → file reads
  → terminal commands
  → tool results
  → sub-agent calls
  → context growth
  → prefix-cache reuse
  → prefill
  → decode
  → compaction
    → repeat

A normal chatbot request is usually one user message followed by one assistant response. A coding agent is different. It may inspect a repository, run commands, read errors, ask another agent to review a file, update context, compact history, and then call the model again. That means the model server sees a chain of related requests rather than one isolated request.

For this workload, the expensive path is not just the attention kernel inside the GPU. It is the full path around the model: how context is built, how cache is reused, how requests are scheduled, how tool outputs inflate the prompt, and how state moves between components.

agent runtime
  → tokenizer / context builder
  → scheduler
  → prefix cache
  → KV allocation
  → prefill
  → KV transfer / offload
  → decode
  → tool loop

This is why “tokens per second” alone is not enough. A system can generate tokens quickly and still waste money if it repeatedly rebuilds the same context, misses cache reuse, blocks decode behind long prefill, or lets tool output make every next request larger.

The CPU is not just the host anymore

The old mental model was simple: the CPU received HTTP requests, tokenized text, launched GPU work, and the GPU did model compute. That was good enough when the workload looked like chat completion.

The new model is different. The CPU now owns much more of the task:

agent runtime behavior
tool execution and sandbox environments
file I/O and terminal command output
request lifecycle and scheduler state
prefix-cache metadata and KV ownership metadata
context compaction and memory movement
cache/offload coordination and telemetry

NVIDIA’s Vera CPU positioning makes this explicit. NVIDIA describes Vera as purpose-built for agentic AI, including code, tools, data workflows, data movement, memory management, and system control. That is the CPU becoming part of the inference story, not just a box that launches GPU work.

The ratio shift is in cores, not just chips

At the rack level, GB200 NVL72 and Vera Rubin NVL72 both expose the same visible package ratio:

36 CPUs : 72 GPUs
= 1 CPU : 2 GPUs

That sounds unchanged at first glance. But the useful ratio for inference engineers is not only CPUs per GPU. It is orchestration capacity per GPU: how much CPU-side control, memory, tool execution, and scheduling capacity exists for each accelerator. NVIDIA’s Vera Rubin NVL72 specification lists 3,168 custom NVIDIA Olympus CPU cores across the rack, or 88 cores per Vera CPU. That gives:

36 Vera CPUs × 88 cores = 3,168 CPU cores
3,168 CPU cores / 72 GPUs = 44 CPU cores per GPU

That does not mean every GPU “needs” exactly 44 cores. It means NVIDIA is designing the platform around the reality that agentic AI needs a lot of CPU-side work to keep GPU-side model compute fed, coordinated, and useful.

Blackwell is the proving ground

Blackwell is NVIDIA’s current GPU platform for this transition. It attacks the GPU-side problems: low-precision inference paths such as FP4/FP8, MoE throughput, long-context prefill, NVLink scale-up bandwidth, dense rack-level GPU memory, and better cost per token for very large models.

The important point is not “Blackwell is fast.” The important point is that Blackwell makes bigger workloads possible, which exposes the next bottleneck. Once the GPU can serve larger models and longer contexts, the system starts failing at boundaries around the GPU: scheduling, cache placement, network movement, CPU orchestration, and application behavior.

But agentic coding workloads stress more than GPU math. They stress prefill/decode interference, cache-aware routing, prefix reuse, KV placement, tool-call latency, CPU scheduling overhead, sub-agent fanout, context compaction, queueing, and P99 jitter.

This is why production teams can buy better GPUs and still have bad AI economics. If the workload shape is wrong, the runtime can spend the savings on recomputation, waiting, queueing, and cache misses.

Dynamo is the distributed-serving bridge

Dynamo acknowledges that prefill, decode, and KV state are separate resource classes. Long-context prefill wants throughput. Decode wants low latency. KV transfer wants locality and bandwidth.

In a small setup, one machine may do everything. In a serious serving system, that stops being optimal. The machine that is good at reading huge context may not be the same machine that should stream low-latency output. The place where KV cache is created may not be the place where it should be used. Dynamo matters because it treats these as separate phases that can be scheduled and coordinated.

prefill worker:
  consume long context
  build KV cache

KV transfer layer:
  move attention state to decode side

decode worker:
  stream low-latency tokens

Dynamo is the rack-level version of the same boundary question TokenSpeed raises locally: what belongs together, what should be separated, and which layer should own the state?

TokenSpeed is the local runtime signal

TokenSpeed is interesting because it makes a clear boundary-placement claim: keep Python useful for model definition and research velocity, but move hot request state, lifecycle transitions, KV ownership, page/block metadata, and scheduler control into systems code.

In practical terms, TokenSpeed is saying: do not put every hot transition in the most flexible layer just because it is easy to program there. Python is excellent for research, model wiring, and user-facing ergonomics. But once a serving system is carrying many concurrent requests, the hot path needs predictable memory ownership, predictable scheduling, and lower jitter.

Python remains useful for:
  model definition
  research velocity
  execution ergonomics
  user-facing control surfaces

Systems code should own:
  request lifecycle
  KV ownership
  page/block metadata
  scheduler transitions
  overlap timing
  hot-path dispatch

That does not mean Python is bad. It means Python should not own every hot runtime edge. The question is where flexibility stops helping and starts becoming boundary tax.

Python 3.14 changes the baseline, not the boundary question

Python 3.14 and free-threaded CPython raise the ceiling for Python-side orchestration. But they do not settle the inference-serving boundary question.

The GIL was never the only issue. The hard problems also include object churn, allocator pressure, queueing behavior, request lifecycle correctness, KV ownership semantics, page/block metadata, P99 scheduler jitter, and kernel dispatch timing.

This matters because a lot of inference debates collapse into “Python versus C++.” That is too simple. The real question is which state needs dynamic programmability and which state needs hard ownership, tight timing, and predictable memory behavior.

Vera Rubin is the hardware answer

NVIDIA’s agentic systems article frames Vera Rubin as an extreme co-design platform: Vera CPUs for orchestration and tool execution, Rubin GPUs for model compute, NVLink and networking for fabric, and Dynamo-style software for distributed serving.

In other words, NVIDIA is not just making a faster GPU. It is describing an entire system: CPU, GPU, memory, network, cache movement, and serving software designed together because agentic workloads stress all of those layers at once.

Rubin GPU:
  long-context model compute
  prefill
  decode
  attention / MLA
  MoE expert compute
  HBM-resident KV

Vera CPU:
  tool execution
  agent environments
  control-heavy software
  cache/offload coordination
  memory movement
  orchestration

Fabric:
  KV movement
  expert traffic
  storage/network coordination
  distributed serving

This is why Vera Rubin belongs in the same conversation as TokenSpeed and Dynamo. TokenSpeed points at local runtime ownership. Dynamo points at distributed serving ownership. Vera Rubin points at hardware platform ownership. They are different layers of the same problem: state has to live in the right place while the task is running.

The real concept: boundary tax

At Touchdown Labs, the term we should use is boundary tax. Boundary tax is the cost paid when runtime state crosses the wrong boundary or lives at the wrong layer.

CPU waiting on GPU
GPU waiting on CPU
prefill blocking decode
decode starved by long-context prefill
prefix-cache miss from bad context shape
KV recomputation from bad routing
KV offload stall
scheduler queue delay
tool output inflating context
Python/C++ boundary overhead
    fabric transfer stall
agent loop doing unnecessary work

Boundary tax is not one metric from one dashboard. It is the sum of small penalties across the path of a task. Individually, each penalty can look acceptable. Together, they explain why the bill grows, why P99 latency gets worse, and why a product feels slower than the raw model benchmark promised.

Traditional inference metrics expose pieces of this, but not the whole task path. Tokens/sec tells you how fast the engine moved tokens. TTFT tells you when the first token arrived. GPU utilization tells you whether the accelerator was busy.

But an agentic coding workload needs a higher-level metric: cost per useful AI task.

What “cost per useful AI task” means

Tokens are a billing unit. GPUs are a hardware unit. Requests are an API unit. But users pay for outcomes: a bug fixed, a support ticket resolved, a report summarized, a code review completed, a workflow automated.

Cost per useful AI task asks: how much did the whole system spend to finish the thing the user actually wanted? That includes model tokens, retries, failed tool calls, cache misses, queueing, GPU time, CPU orchestration, and provider routing.

What Touchdown Labs is building

Touchdown Labs does not need to predict whether vLLM, SGLang, TensorRT-LLM, TokenSpeed, or Dynamo wins every workload. The useful question is:

Where did this agent task pay the boundary tax?

The answer might be terminal output polluting context, prompt cache misses, prefill blocking decode, scheduler jitter hitting P99, KV cache spill, the wrong model handling an easy step, CPU orchestration starving GPU work, or provider routing burning margin.

This is the practical reason to care about all of this. Most teams do not need to become runtime researchers. They need to know why their AI bill is rising, why latency is unstable, why a self-hosted deployment is underperforming, or why switching models did not fix the product-level economics.

For agentic inference, the unit is not the token. It is not the request. It is not the GPU. The unit is the completed task.

TokenSpeed is the runtime signal. Dynamo is the distributed-serving signal. Blackwell is the current proving ground. Vera Rubin is the hardware roadmap. Python 3.14 changes the orchestration baseline. Touchdown measures the cost when all of those boundaries meet a real workload.

What a team should take away

If you run AI workloads today, the immediate takeaway is not “go rewrite everything.” The takeaway is to measure the task path before buying a bigger machine or switching providers. Start with these questions:

How much of the bill comes from useful completed work versus retries and dead-end calls?
How often are we rebuilding context that should have been reused?
Are long prefill requests blocking latency-sensitive decode?
Are tool outputs making every next request larger than it needs to be?
Is our provider/model/router choice matched to the difficulty of each step?
Are we measuring cache hit rate, TTFT, TPOT, queueing, and task completion together?
If self-hosting, do we know whether the bottleneck is GPU math, CPU orchestration, memory, cache, or network?

These questions are the bridge from infrastructure theory to money. They turn TokenSpeed, Blackwell, Dynamo, and Vera Rubin from buzzwords into a practical map of where the system may be paying unnecessary boundary tax.

Closing

The next phase of inference will not be won by asking which GPU is fastest in isolation. It will be won by asking where the work belongs.

Which state belongs on CPU?
Which state belongs on GPU?
Which state belongs in HBM?
Which state belongs in a cache tier?
Which state belongs in Python?
Which state belongs in C++ or Rust?
Which state belongs in the distributed serving layer?
Which state should the agent runtime expose?

Agentic inference is forcing a new CPU/GPU contract. TokenSpeed shows it in the runtime. Dynamo shows it in distributed serving. Vera Rubin shows it in hardware. Touchdown measures the cost of getting that contract wrong.