---
title: Externalized State: Cut Your Coding-Agent Bill 30% to 70% Without Touching a Kernel
subtitle: A drop-in SKILL.md and action guide for engineers using Claude Code, Codex, Antigravity CLI, Hermes, or OpenClaw
date: 2026-05-23
authors: Touchdown Labs
license: MIT
related: https://touchdown-labs.com/blogs/openenv-hackathon-recap
---

# Externalized State for Coding Agents: Action Guide

## The CEO TL;DR: The Blind Agent Tax

**You do not need to be a kernel engineer. You can just store everything in a markdown file, create a table of contents, and optimize your Claude Code or Codex bill so you do not max it out. It is that simple.**

If your team is running AI developers or research agents, you are likely burning thousands of dollars a month on unnecessary LLM bills. This is not because your engineers are writing bad code, and it is not because the models are inefficient. 

**It is because you are paying a Blind Agent Tax.**

Every single time a coding agent (like Claude Code, Codex, or Hermes) boots up to perform a task, it has complete amnesia. It does not know where your configuration folders, database clients, router endpoints, or test runners live. 

To orient itself, the agent spends its first 5 to 10 iterations running shell commands, searching directories, and re-reading configurations. Because these thousands of "orientation tokens" are loaded at the very start of the session, they reside in the active prompt. For the rest of the task, every single subsequent tool call must repeatedly carry this bloated orientation context.

**You are paying a recurring 80% tax on every API call to have the model repeatedly figure out what it already solved a dozen times.**

Externalized state is a new, first-class category of inference optimization. Instead of paying specialized engineers to write complex GPU memory hacks, you write down the agent's workspace orientation into a lightweight, budgeted cheat sheet on local disk (`CONTEXT_MAP.md`). 

The agent loads this 500-word cheat sheet instantly, bypasses the discovery loop entirely, and goes straight to work. 

*   **Zero training required.** No model fine-tuning or custom adapters.
*   **Zero GPU specialists.** Bypasses the GPU and serving stack entirely.
*   **30% to 70% Cost Reduction.** Shipped in 30 minutes by any developer or non-technical manager.

---

## The Convergence: Prompt, Context, and Harness Engineering

To understand why this is a first-class paradigm, we have to look at how modern agent workflows have evolved:

1. **Prompt Engineering**: Focused on phrasing. We tried to write the perfect English instructions to make the model behave. It was ad-hoc and unscalable.
2. **Context Engineering**: Focused on token budgeting. We tried to compress, trim, and prioritize exactly which raw files fit into the model's active window.
3. **Harness Engineering**: Focused on the environment. We built tools, shell execution sandboxes, and file-reading APIs surrounding the agent.

**Externalized state is where prompt engineering, context engineering, and harness engineering converge.** 

Instead of treating prompt instructions as static text, context as a raw file dump, and harnesses as blind tool execution, the PEEK framework represents a **research-driven approach** that tries to **formalize and standardize** this intersection. We treat the agent's orientation context as a budgeted, evictable database cache map managed by standard text-orchestration tools. 

By formalizing these ad-hoc prompt and context tricks into a repeatable, standard cache schema on disk, you build a clean, auditable engineering discipline. Let us walk you through how this works step-by-step.

---

## What is a Cache Schema? The Legend & Table of Contents

At its core, a **Cache Schema** under the PEEK framework is not a complex database system or a proprietary binary index. 

**It is simply a Legend, a Table of Contents (TOC), or a Map Index of your codebase stored as a lightweight markdown file.**

In agentic workflows, the single largest token waste and cost driver is not the final code the agent generates. It is the intermediate discovery tools it runs. When a blind agent is asked to locate a database configuration or an API path, it runs a recursive `grep` or directory indexing commands (`find . -name "*.py"` or `ls -R`). 

These operations are **the most expensive calls in the entire agentic loop**. They flood the prompt context with thousands of lines of irrelevant file lists, unrelated code paths, and diagnostic outputs. Since these tokens live in the context window, they bloated the GPU's memory for the entire task.

The PEEK **Context Map** acts as a pre-built Legend. It serves as a constant-sized, budgeted Table of Contents that answers the "where is the code" and "how do I run tests" questions immediately:

```
                  THE BLIND AGENT                       THE PEEK CACHED AGENT
             (Without a Cache Schema)                  (With a Cache Schema)
             
               "Where is Stripe key?"                    "Where is Stripe key?"
                         │                                         │
               ┌─────────┴─────────┐                     ┌─────────┴─────────┐
               ▼                   ▼                     ▼                   ▼
          [Tool: grep]      [Tool: find .]        [CONTEXT_MAP.md]    Bypass discovery
               │                   │                     │            tools completely
         (2,500 tokens)      (3,000 tokens)        (500 tokens)
               │                   │                     │
               └─────────┬─────────┘                     └─────────┬─────────┘
                         ▼                                         ▼
                 GPU KV Thrashing!                     Instant direct file access
```

By reading this 500-token Table of Contents first, the agent bypasses the expensive `grep` and directory indexing loops completely, navigating directly to the correct file in a single turn.

---

## Visualizing the Physics: Blind Tax vs. Orientation Cache

```
┌──────────────────────────────────────────┐    ┌──────────────────────────────────────────┐
│      THE BLIND AGENT TAX (Amnesia)       │    │     THE ORIENTATION CACHE (PEEK)         │
│  Repeated discovery loops thrash HBM     │    │  Budgeted Context Map bypasses loops     │
├──────────────────────────────────────────┤    ├──────────────────────────────────────────┤
│ - Agent boots blind with zero memory     │    │ - Agent boots with CONTEXT_MAP.md        │
│ - Runs "find", "ls", and reads files     │    │ - Reads route & test locations instantly │
│ - Orientation bloat = 6,800 prompt tokens│    │ - Bypasses directory search completely   │
│ - Every subsequent turn carries these   │    │ - Prompt remains light & cheap           │
├──────────────────────────────────────────┤    ├──────────────────────────────────────────┤
│ Total Cost: $1.85 (15 turns)             │    │ Total Cost: $0.35 (3 turns) [81% Saved]  │
└──────────────────────────────────────────┘    └──────────────────────────────────────────┘
```

---

## The First-Principles Problem: Orientation Bloat and GPU Thrashing

To understand the exact systems physics, let us compare how an agent handles a common task: adding a Stripe billing webhook endpoint to a backend application.

### The Physics Without PEEK (The Cold Start)

You prompt: *"Add a Stripe webhook to handle subscription updates."*

1.  **Turn 1 (The Goal)**: The user initiates the task.
2.  **Turn 2 (Directory Search)**: The agent has no state. It runs `find . -name "*.py"` to locate routes. This returns a massive list of 150 Python files, bloating the prompt by **2,500 tokens**.
3.  **Turn 3 (Route Analysis)**: The agent locates `src/api/router.py` and reads it to see how endpoints are defined, adding another **1,200 tokens**.
4.  **Turn 4 (Test Discovery)**: The agent wants to verify existing tests. It runs a global test suite, generating compiler and test logs that add another **2,100 tokens**.
5.  **Turn 5 (Config Discovery)**: The agent searches for where API keys are loaded, reading `configs/dev.json` and `src/core/config.py` (**1,000 tokens**).

*   **The Waste**: Before writing a single line of code, **6,800 tokens** of directory paths, build logs, and config schemas are loaded into the prompt context. For the next 10 turns of writing and debugging the webhook, those 6,800 resident tokens are repeatedly sent, thrashing the GPU's KV cache.
*   **Total Task Cost**: **$1.85** in API bills.

---

### The Physics With PEEK (The Orientation Cache)

Instead of the cold start, we use the PEEK semantic caching loop:

1.  **The Distiller (Extraction)**: After the webhook is successfully added, the Distiller observes the session and extracts three transferable facts:
    *   *Fact 1*: API routes are registered via the FastAPI router in `src/api/router.py`.
    *   *Fact 2*: Billing tests run using `pytest tests/billing/`.
    *   *Fact 3*: Stripe secrets are loaded from `configs/stripe.yaml`.
2.  **The Cartographer (Writing)**: The Cartographer writes these facts into the persistent `CONTEXT_MAP.md` file in the project.
3.  **The Evictor (Budgeting)**: The Evictor checks the file's token length. Since it is only 500 tokens (well below the 2,000-token budget), it saves the file to disk.

When you boot the agent for the next task (e.g. *"Add a subscription refund endpoint"*), the agent automatically reads `CONTEXT_MAP.md` first:

1.  **Turn 1 (The Goal)**: You prompt: *"Add a subscription refund endpoint."*
2.  **Turn 2 (Direct Action)**: The agent reads `CONTEXT_MAP.md`. It learns that routes live in `src/api/router.py`, secrets live in `configs/stripe.yaml`, and tests are run via `pytest tests/billing/`. 
3.  **Turn 3 (Execution)**: The agent bypasses the directory search entirely. It directly modifies `src/api/router.py`, writes tests in `tests/billing/test_refund.py`, and runs `pytest tests/billing/`. 

*   **Result**: The task is completed in 3 turns instead of 15. The prompt context never bloats.
*   **Total Task Cost**: **$0.35** (an **81% cost reduction**).

---

## How It Works: The PEEK Semantic Cache Loop

The MIT CSAIL PEEK paper (Zhuohan Gu, Qizheng Zhang, Omar Khattab, and Samuel Madden, arXiv:2605.19932, May 2026) proves that we can manage orientation like a hardware database cache, but at the purely semantic text layer:

```
┌─────────────────────────────────────────────────────────┐
│ 1. DISTILLER (The Text Parser)                          │
│    Watches the agent's tool calls and file edits.       │
│    Extracts transferable, repo-specific facts.          │
│    "Tests live in tests/. Configs in configs/."         │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│ 2. CARTOGRAPHER (The File Writer)                       │
│    Writes these distilled text strings into a simple    │
│    persistent markdown file (CONTEXT_MAP.md).            │
│    The map is loaded into the agent prompt next run.     │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│ 3. EVICTOR (The Budget Enforcer)                        │
│    Tracks a hard token or character budget.             │
│    If CONTEXT_MAP.md grows past it, prunes the oldest   │
│    or lowest-priority text blocks to keep it light.     │
└─────────────────────────────────────────────────────────┘
```

1.  **The Distiller**: A lightweight workspace observer watches the agent's actions and pulls out the knowledge that is *transferable* to future tasks.
2.  **The Cartographer**: Turns the distilled knowledge into structured edits to the persistent `CONTEXT_MAP.md` file. The map sits in the agent's system prompt across tasks, so the next agent already knows what the previous one worked out.
3.  **The Evictor**: Manages a hard token budget (e.g., 2,000 tokens). When the context map grows past its budget, the Evictor drops the oldest or lowest-priority entries via an LRU-like policy, keeping the cache extremely light.

---

## Why This is Different: Workload Optimization vs. GPU Hacks

It is critical to distinguish this top-down approach from bottom-up GPU-layer optimizations (like LMCache or vLLM prefix caching):

*   **Bottom-Up (Standard serving-layer work)**: Acts on token tensors. Requires custom CUDA kernels, quantization pipelines, and physical memory tiering servers in the data center. This requires months of specialized systems engineering and breaks with every framework update.
*   **Top-Down (Externalized State / PEEK)**: Bypasses the GPU and the KV cache stack entirely. It is a simple markdown file managed by basic Python or TypeScript string orchestration outside the serving engine. It requires zero CUDA specialists, zero driver updates, and zero framework migrations. Any developer or non-technical product manager can configure and deploy it in an afternoon.

---

## Actionable Quickstart: Pick Your Platform

### 1. Claude Code (Anthropic)

**Setup:** Claude Code utilizes `CLAUDE.md` (project-level memory that auto-loads) and `.claude/skills/` (agent skills with progressive disclosure). 

**Step 1: Create the context map file.** In your repository root:
```bash
mkdir -p .claude/orientation
touch .claude/orientation/context-map.md
```

**Step 2: Drop in the starter context map.** Copy this template into `.claude/orientation/context-map.md`:
```markdown
# Context Map: <project name>
<!-- Token budget: 2000 tokens. Distilled orientation knowledge for agents.
     Updated by the cartographer skill. Do not hand-edit unless you know why. -->

## Repo layout (last updated: <date>)
- Tests: `tests/`
- Source: `src/`
- Configs: `configs/<env>.yaml`
- Evidence files: `*.evidence.json`
- Build output: `dist/` (gitignored)

## Verified patterns (use these before reinventing)
- Database calls go through `src/db/client.py` (never direct psycopg2)
- API routes use FastAPI with `src/api/router.py` as the entrypoint
- Logging: `from src.observability import get_logger`
- Test fixtures live in `tests/conftest.py`

## Known landmines
- `legacy/` is read-only. Do not edit.
- `scripts/migrate.py` requires `--dry-run` first on prod data.
- The `auth/` module has its own test runner (`pytest tests/auth -m auth`).

## Recent decisions (most recent first)
- 2026-05-22: chose Postgres over SQLite for the queue (see ADR-014)
- 2026-05-18: switched from celery to ARQ for async jobs

## Open invariants
- No new dependencies without checking `pyproject.toml` first.
- All new endpoints need a corresponding integration test in `tests/api/`.
```

**Step 3: Wire it into your `CLAUDE.md`.** Add this line at the top:
```markdown
# CLAUDE.md
@.claude/orientation/context-map.md

<!-- rest of your CLAUDE.md below -->
```
The `@` import pulls the context map into the system prompt. Claude reads the map first and spends zero tokens running discovery commands.

**Step 4: Set up the cartographer skill.** Create `.claude/skills/cartographer/SKILL.md`:
```markdown
---
name: cartographer
description: Update .claude/orientation/context-map.md with new orientation knowledge the agent learned this session. Use at the END of any non-trivial task where you discovered a repo convention, a landmine, or a pattern worth remembering.
---

# Cartographer

When you have finished a task and noticed something worth remembering for next time:

1. Read `.claude/orientation/context-map.md`
2. Identify the section where the new knowledge belongs (layout / patterns / landmines / decisions / invariants)
3. Write a one-line entry: be specific, include file paths
4. If the map is over 2000 tokens after the edit, drop the oldest entry from "Recent decisions" first
5. Commit with `chore(orientation): <one-line summary>`

Do NOT write generic advice. Only write knowledge that is specific to this repo, transferable across future tasks, and verifiable.
```

---

### 2. Codex / Codex CLI (OpenAI)

**Step 1: Create the context map.** Save the template above to `.codex/orientation.md`.

**Step 2: Pull it into `AGENTS.md`** (the project-level memory file Codex auto-loads). Include the orientation text inline under a `## Orientation` heading:
```markdown
# AGENTS.md

## Orientation
<!-- Begin auto-managed context map. Do not hand-edit; use the cartographer flow. -->

<!-- paste context-map.md content here -->

<!-- End auto-managed context map. -->

## Project rules
...
```

**Step 3: Wire the cartographer slash command.** Create `.codex/commands/cartographer.md`:
```markdown
# /cartographer

Update the Orientation section of AGENTS.md with a new entry distilled from this session.
Format: one line, file-path-specific, verifiable.
```

---

### 3. Antigravity CLI (Google)

**Step 1: Create the context map.** Save to `.antigravity/orientation.md`.

**Step 2: Reference it from `agent_rules.md`** at the project root:
```markdown
# agent_rules.md

## Always-loaded context
Read `.antigravity/orientation.md` before every task. Treat its contents as ground truth about repo layout, patterns, landmines, and recent decisions.

## Update protocol
After any non-trivial task, evaluate whether the orientation file needs an update. If yes, append a one-line entry to the appropriate section. Do not exceed 2000 tokens total: drop the oldest entry from "Recent decisions" if the file would exceed budget.
```

---

## Measuring Success

Pick three metrics, baseline them for a week before turning the orientation cache on, and compare:

| Metric | Before | After | Target |
| --- | --- | --- | --- |
| Median tokens per task | _____ | _____ | 30% to 50% drop |
| Median iterations per task | _____ | _____ | 25% to 40% drop |
| Cost per successful task | _____ | _____ | 30% to 70% drop |
| p95 task latency | _____ | _____ | 20% to 40% drop |
| Task success rate | _____ | _____ | flat or up |

**The direct measurement check:**
*   **Claude Code**: Use the built-in usage reporting (`/cost` slash command per session).
*   **Codex**: Check the OpenAI Usage dashboard tagged by request headers.
*   **Antigravity CLI**: Built-in usage logs at `~/.antigravity/logs/`.

---

## Touchdown Labs: How We Can Help

Touchdown Labs measures and optimizes the hidden infrastructure work behind agentic AI tasks: kernels, serving engines, KV cache, prefix caching, routing, CPU tool loops, workload replay, and hardware-aware diagnostics: all tied back to cost per successful task at p95/p99 latency.

We show teams how **not to max out their Claude Code, Codex, or coding subscriptions through brute force alone**.

That is the whole point. At Touchdown Labs, we believe every layer of the stack has to be intentional and thought through from first principles if you want the system to be optimized from the ground up. 

It is not enough to throw more tokens, more agents, more GPUs, or more subscriptions at the problem. The real leverage comes from understanding how the entire stack works together: prompts, tools, context, cache, runtime, compiler, kernels, hardware, routing, and infrastructure.

That is what we are building toward.

If you deploy this guide and need help automating the distiller-cartographer-evictor loop, or want to integrate our standardized `EvidencePacket` trace-validation system, reach out:

*   **Contact**: [william@touchdown-labs.com](mailto:william@touchdown-labs.com)
*   **Open Careers**: We are hiring kernel, systems, and forward-deployed engineers.
