Analyze a Run

Planned CLI for post-run analysis of DeepSeek-V4 GMI benchmark outputs produced by SemiAnalysis InferenceX and AgentX.

inferguard analyze <results_dir> \
  --output-dir <results_dir>/inferguard_report \
  --format both \
  --fail-on critical \
  --best-effort

inferguard analyze is a read-only report generator. It does not launch benchmarks, change serving configuration, call private/pro-tier modules, use LLMs, or perform actuation. The command complements the live overlay command:

inferguard disagg status --prefill <url> --decode <url> --json

The live command can produce inferguard_timeline.jsonl during a run; the analyzer consumes that timeline after the run alongside InferenceX and AgentX result artifacts.

Command shape

inferguard analyze <results_dir> [OPTIONS]

Argument / flag	Default	Meaning
`<results_dir>`	required	Root directory containing one or more benchmark cells or recipe result directories.
`--output-dir PATH`	`<results_dir>/inferguard_report`	Destination for generated reports.
`--format json\\|md\\|both`	`both`	Select `report.json`, `report.md`, or both.
`--fail-on never\\|warning\\|critical`	`critical`	Exit non-zero after report generation when a finding at or above this severity exists.
`--strict / --best-effort`	`--best-effort`	Strict mode fails on missing required artifacts; best-effort mode records findings and continues.
`--timeline-glob TEXT`	`**/inferguard_timeline.jsonl`	Discovery pattern for live overlay timeline files.
`--json`	`false`	Also print the generated JSON report to stdout.

Exit codes

Code	Meaning
`0`	Report written; no finding at or above `--fail-on`.
`1`	Report written; warning threshold tripped.
`2`	Report written; critical threshold tripped.
`3`	No supported benchmark artifacts found, input parsing failed before a report could be produced, or report writing failed.

Output files

By default the analyzer writes:

<results_dir>/inferguard_report/
  report.json
  report.md

Per-cell reports may also live beside raw result artifacts when a caller runs the analyzer on a cell directory directly.

`report.json`

report.json uses schema version inferguard-analyze/v1. See Schemas for the normative field contract.

Top-level sections:

Field	Meaning
`schema_version`	Always `inferguard-analyze/v1`.
`generated_at`	UTC timestamp for report generation.
`input_root`	Analyzer input directory.
`analyzer`	InferGuard version and OSS capability declaration.
`run_summary`	Cell counts, completion status, and missing-artifact summary.
`cells`	Normalized per-cell records.
`cross_run`	Cross-cell comparisons and detected curves.
`findings`	Flattened finding list across the run.
`artifact_manifest`	Files discovered and files written.

`report.md`

report.md is the human-readable companion report. Planned sections:

Executive summary
Benchmark matrix
Artifact completeness
Per-cell results
Live InferGuard timeline
Bottleneck analysis
Evidence-based next measurements
Co-publish artifact manifest

The report may describe observed cliffs, plateaus, missing artifacts, or follow-up measurements. It must not recommend automatic configuration changes or use Pro-tier advisory/actuation language.

Supported benchmark modes

The analyzer is planned to support these DeepSeek-V4 GMI result shapes:

Mode	Primary artifacts
InferenceX single-node fixed sequence	`agg_.json`, benchmark `results.json`, optional `inferguard_timeline.jsonl`
InferenceX multi-node srt-slurm disagg	`agg_*.json`, recipe result subdirectories, logs, optional `inferguard_timeline.jsonl`
AgentX trace replay	`detailed_results.csv`, `metrics_server_metrics.csv`, optional `agg_*.json`
Eval outputs	`results.json`, `sample.jsonl`, `meta_env.json`

Detailed input requirements are in Supported inputs.

Finding codes

Analyzer-native codes:

Code	Severity	Meaning
`missing_required_artifact`	warning or critical	A required input for the detected result type is absent.
`invalid_run_no_successful_requests`	critical	The run completed with zero successful requests.
`partial_run`	warning	Success rate is below the planned validity threshold.
`metrics_unavailable`	warning	Metrics needed for a requested analysis section are missing.
`ttft_cliff`	warning or critical	p99 TTFT jumps materially while throughput gain is small.
`tpot_degradation`	warning	p99 TPOT worsens materially versus a comparable point.
`throughput_plateau`	info or warning	Higher concurrency produces little throughput gain while latency worsens.
`kv_pressure`	warning	Cache usage or offload signals indicate sustained KV pressure.
`kv_offload_thrash`	warning	GPU→CPU and CPU→GPU offload both rise with latency degradation.
`prefix_cache_regression`	warning	Observed cache hit rate drops versus comparable points or theoretical rate.
`eval_regression`	warning	Eval metrics regress versus the selected baseline.

Live-overlay codes reused when read from inferguard_timeline.jsonl:

prefill_decode_imbalance
kv_transfer_stall
kv_transfer_errors_present
endpoint_unreachable
engine_unidentified

Metric normalization

The analyzer normalizes metrics into these groups when present:

Group	Fields
Identity	`hardware`, `model`, `framework`, `precision`, `source_format`, `recipe_name`
Shape	`isl`, `osl`, `concurrency`, `scenario_type`, `is_multinode`
Topology	`tp`, `ep`, `dp_attention`, `prefill_`, `decode_`, `num_prefill_gpu`, `num_decode_gpu`
Completion	`num_requests_total`, `num_requests_successful`, `success_rate`, `status`
Throughput	`total_tput_tps`, `input_tput_tps`, `output_tput_tps`, `tput_per_gpu`, `input_tput_per_gpu`, `output_tput_per_gpu`
Latency	`mean_ttft`, `p99_ttft`, `mean_tpot`, `p99_tpot`, `mean_itl`, `p99_itl`, `intvty`
Cache/offload	`theoretical_cache_hit_rate`, `server_gpu_cache_hit_rate`, `server_cpu_cache_hit_rate`, `kv_offload_bytes_gpu_to_cpu`, `kv_offload_bytes_cpu_to_gpu`
Timeline	sample count, first finding time, finding counts by code, lead time to detected TTFT cliff

Missing optional metrics are represented as null in JSON and called out only when they block a requested analysis section.