Supported Inputs

This document specifies the planned input artifacts for inferguard analyze <results_dir> when analyzing DeepSeek-V4 GMI benchmark outputs from SemiAnalysis InferenceX and AgentX.

The analyzer is best-effort by default: it discovers supported files recursively, records missing-artifact findings, and emits a partial report when enough data exists. Strict mode may treat missing required artifacts as fatal.

Directory discovery

The analyzer walks <results_dir> recursively and groups artifacts into cells using this precedence:

Explicit metadata inside agg_*.json.
Recipe or script directory name.
Parent directory basename.
File path fallback.

Common layout:

results/gmi-dsv4-YYYYMMDD/
  rigs/
    h200/single_node/<cell>/
    b200/single_node/<cell>/
    b300/single_node/<cell>/
    gb200/multi_node/<recipe>/
  inferguard_report/

Artifact matrix

Artifact	Producer	Required?	Purpose
`agg_*.json`	InferenceX `utils/process_result.py` or `utils/process_agentic_result.py`	Yes for InferenceX fixed-sequence cells	Primary normalized benchmark summary.
`detailed_results.csv`	AgentX trace replay	Yes for AgentX cells	Per-request success, timing, token, and cache-hit data.
`metrics_server_metrics.csv`	AgentX metrics collector	Recommended for AgentX cells	Prefix-cache, KV offload, and server aggregate metrics.
`results*.json`	Benchmark/eval runner	Optional	Raw benchmark or eval outputs.
`sample*.jsonl`	Eval runner	Optional	Sample-level eval outputs.
`meta_env.json`	Runner or workflow	Optional	Environment and commit metadata.
`inferguard_timeline.jsonl`	`inferguard disagg status --json` live overlay loop	Optional enrichment	Live disagg findings and endpoint snapshots.
`summary.csv`	InferenceX workflow or collector	Optional	Sweep-level summary table.
`benchmark_command.txt`	Run harness	Optional	Reproducibility metadata.
`server.log`, `.log`, `.tar.gz`	Runner / srt-slurm	Optional	Evidence links in the artifact manifest; not parsed as metrics in v1.
`manifest.json`	Campaign wrapper	Optional	Expected cells, upload targets, and whether live timeline was expected.
`summary.json`	InferGuard Bench native runner	Yes for native InferGuard runs	Aggregate counts, latency, TTFT, throughput, tokens, concurrency, workload breakdown, KVCast mode, and redaction status.
`requests.jsonl`	InferGuard Bench native runner	Yes for native InferGuard runs	Request specs used in the run. Prompt content may be redacted when `--redact-prompts` is used.
`metrics.jsonl`	InferGuard Bench native runner	Yes for native InferGuard runs	Per-request client metrics including latency, TTFT, first SSE timing, token source labels, success/error, and KVCast metadata.
`run.json` / `config.json`	InferGuard Bench native runner	Yes for native InferGuard runs	Reproducibility metadata for the benchmark invocation and artifact bundle.

InferGuard native bench output

Native InferGuard runs are recognized by summary.json with schema_version: inferguard-bench-summary/v1. The analyzer reports these cells as source_format: inferguard-bench-native.

Expected companion files:

run.json
config.json
requests.jsonl
metrics.jsonl
summary.json
report.md

Native output records KVCast/replay metadata but does not claim official InferenceX methodology. concurrency is null at the cell identity level when a native run contains multiple concurrency levels; the full list is preserved under topology.concurrency_levels.

InferenceX `agg_*.json`

Static and srt-slurm cells should include these fields when available.

Identity fields

Field	Meaning
`hw`	Hardware label, for example `h200`, `b200`, `b300`, `gb200`.
`model`	Model name or path.
`infmax_model_prefix`	InferenceX model prefix, when emitted.
`framework`	Serving stack, for example `vllm` or `dynamo-vllm`.
`precision`	Weight/KV precision label, for example `fp4` or `fp8`.
`image`	Container image.
`disagg`	Whether the run used disaggregated serving.
`is_multinode`	Whether the run was multi-node.

Shape fields

Field	Meaning
`isl`	Input sequence length.
`osl`	Output sequence length.
`conc`	Concurrency.

Topology fields

Single-node fields:

tp
ep
dp_attention

Multi-node/disagg fields:

prefill_tp
prefill_ep
prefill_dp_attention
prefill_num_workers
decode_tp
decode_ep
decode_dp_attention
decode_num_workers
num_prefill_gpu
num_decode_gpu

Throughput fields

tput_per_gpu
output_tput_per_gpu
input_tput_per_gpu
total throughput fields if present
output throughput fields if present
input throughput fields if present

Latency fields

The analyzer should preserve emitted latency keys and normalize common ones:

mean_ttft
p50_ttft
p90_ttft
p95_ttft
p99_ttft
mean_tpot
p50_tpot
p90_tpot
p95_tpot
p99_tpot
mean_itl
p99_itl
intvty

AgentX `detailed_results.csv`

Expected columns:

Column	Meaning
`success`	Request success flag.
`request_start_time`	Request start timestamp.
`request_complete_time`	Request completion timestamp.
`ttft`	Time to first token.
`ttlt`	Time to last token.
`itl`	Inter-token latency.
`input_tokens`	Prompt token count.
`output_tokens_expected`	Expected generated tokens.
`output_tokens_actual`	Actual generated tokens.
`cache_hit_blocks`	Prefix/KV cache-hit block count.
`cache_miss_blocks`	Prefix/KV cache-miss block count.

Derived metrics:

request count
success rate
QPS
mean/p99 TTFT
mean/p99 TTLT
mean/p99 ITL
output tokens per second
theoretical cache hit rate

AgentX `metrics_server_metrics.csv`

Expected fields when available:

Field	Meaning
`prefix_cache_hits`	Server prefix-cache hit count.
`prefix_cache_queries`	Server prefix-cache query count.
`cpu_prefix_cache_hits`	CPU prefix-cache hit count.
`cpu_prefix_cache_queries`	CPU prefix-cache query count.
`kv_offload_bytes_gpu_to_cpu`	Bytes offloaded from GPU to CPU.
`kv_offload_bytes_cpu_to_gpu`	Bytes restored from CPU to GPU.
`kv_offload_time_gpu_to_cpu`	Time spent on GPU→CPU offload.
`kv_offload_time_cpu_to_gpu`	Time spent on CPU→GPU restore.
`cpu_kv_cache_usage_pct`	CPU KV cache utilization percentage.
`prompt_tokens_total`	Prompt token total.
`generation_tokens_total`	Generated token total.
`request_success_total`	Successful request total.

Derived metrics:

server GPU cache hit rate
server CPU cache hit rate
KV offload bytes by direction
KV offload time by direction
cache/offload pressure findings

Eval artifacts

The v1 analyzer treats eval files as tolerant JSON/JSONL inputs because the exact schema can vary by runner.

Supported filenames:

results*.json
sample*.jsonl
meta_env.json

Behavior:

Preserve top-level numeric and string metrics when possible.
Link sample files in artifact_manifest.
Emit eval_regression only when comparable baseline fields are present.
Emit metrics_unavailable only when eval analysis was expected but no eval artifact exists.

srt-slurm multi-node result directories

The analyzer should recurse through recipe output trees and associate files with the nearest cell/recipe directory.

Expected patterns:

**/agg_*.json
**/*results*.json
**/inferguard_timeline.jsonl
**/server*.log
**/benchmark*.log
**/multinode_server_logs.tar.gz

Cell identity should prefer fields from agg_*.json; path inference is fallback only.

`inferguard_timeline.jsonl`

Timeline input is optional enrichment. Missing timeline should not make a run invalid unless manifest.json declares it expected.

Supported line shapes:

inferguard-timeline/v1 wrapper records.
Raw disagg-status/v1 records from one-shot captures.

Wrapper record shape:

{
  "schema_version": "inferguard-timeline/v1",
  "observed_at": "2026-04-29T22:01:30Z",
  "sequence": 0,
  "status": "healthy",
  "proof_level": "live",
  "capabilities": {
    "diagnosis": "on",
    "actuation": "off",
    "replay": "off",
    "recall": "off"
  },
  "disagg_status": {
    "schema_version": "disagg-status/v1",
    "prefill": {},
    "decode": {},
    "transfer": null,
    "findings": []
  }
}

Timeline-derived metrics:

sample count
first observed timestamp
last observed timestamp
finding counts by code
first finding timestamp
first critical finding timestamp
first live disagg finding before a post-run TTFT cliff, when computable

Unsupported inputs in v1

The planned v1 analyzer does not parse these as structured metrics:

arbitrary server logs
binary profiler dumps
private/pro-tier InferGuard memory or replay outputs
cloud provider billing exports
benchmark harnesses unrelated to InferenceX or AgentX

Unsupported files may still appear in artifact_manifest for traceability.