Architecture
This page is the operator-oriented architecture overview. The detailed schema and historical architecture authority remains the specification; this page is intentionally shorter and optimized for OSS readers deciding how the pieces fit together.
One-sentence architecture
InferGuard is a read-only evidence pipeline: launch or observe a serving stack, collect request/engine/GPU artifacts, validate whether the evidence is publishable, then produce diagnosis, failure, cost, cliff, and recommendation reports only when the claim status allows it.
Evidence pipeline
flowchart LR
A[simulate-gpu or launch-engine] --> B[request-profile]
A --> C[collect-metrics]
B --> D[validate-completed]
C --> D
D --> E{claim_status}
E -->|measured| F[diagnose-bottleneck]
E -->|measured or inferred| G[classify-failures]
E -->|live_complete only for measured math| H[compute-cost]
E -->|completed sweeps| I[find-cliffs]
F --> J[report-completed]
G --> J
H --> J
I --> J
The most important boundary is validate-completed: downstream commands can still run on incomplete input, but they must downgrade claims or refuse operator recommendations when the required evidence is missing.
Fourteen core modules
| Module | Primary command(s) | Responsibility |
|---|---|---|
inferguard.cli |
all commands | Typer entrypoint, command wiring, shared signal handling, stdout summaries. |
inferguard.io |
shared | Atomic writes, tolerant JSON readers, JSONL flushing, partial-results registration, child-process cleanup. |
inferguard.preflight |
preflight |
Read-only model/engine compatibility checks and tokenizer mismatch probes. |
inferguard.synthetic |
simulate-gpu |
Synthetic GPU/Slurm bundle generation for local smoke tests. |
inferguard.launch_engine |
launch-engine |
vLLM/SGLang/LMCache/Dynamo-SGLang command capture, external validation, healthcheck artifacts, process cleanup. |
inferguard.request_profile |
request-profile |
OpenAI-compatible request loop with TTFT, TPOT, latency, token, error, and per-field claim evidence. |
inferguard.collect_metrics |
collect-metrics |
Engine Prometheus, LMCache, and DCGM timeline collection. |
inferguard.validate |
validate-completed |
Publishability classification, missing-artifact checks, live-complete quorum, claim downgrades. |
inferguard.diagnose_bottleneck |
diagnose-bottleneck |
Bottleneck verdicts across prefill, decode, KV, queue, network, host, launch, and insufficient evidence. |
inferguard.classify_failures |
classify-failures |
Failure-class regex/evidence ranking from logs and artifacts. |
inferguard.report_completed |
report-completed |
Refusal-gated operator recommendation reports. |
inferguard.find_cliffs |
find-cliffs |
Capacity-envelope and cliff detection across sweeps. |
inferguard.cost_model |
compute-cost |
Cost-per-useful-task and safe concurrency calculations with validation-aware downgrades. |
inferguard.agentx_adapter |
agentx-ingest |
AgentX CSV to canonical InferGuard artifact conversion. |
Adjacent surfaces are still first-class OSS APIs: inferguard.bench for replay/KVCast/cold-start/compare workloads, inferguard.analyze for existing result trees, inferguard.disagg for live Prometheus overlay, inferguard.profile for live and retro profile loops, inferguard.harness for daemon/agent tracing, inferguard.router and inferguard.workload for routing, and inferguard.mcp_server for MCP tools.
Claim status enum
Every public artifact that makes an evidence claim must use one of four canonical values:
claim_status |
When to use it | Publication meaning |
|---|---|---|
synthetic |
Generated by simulate-gpu, dry-run fixtures, or synthetic-only smoke tests. |
Useful for local validation; not real GPU evidence. |
inferred |
Evidence is indirect, partial, or missing one or more proof fields. | Can guide next steps; quote with caveats. |
measured |
Live evidence is present and the relevant validation gates pass. | Suitable for measured claims if the artifact set is included. |
not_proven |
The claim failed validation or cannot be checked. | Do not publish as a positive claim. |
Non-canonical labels such as partial, downgraded, or inferred_without_engine_metrics may appear only in explanatory fields like reason, claim_reason, or claim_caveat; they must not be emitted as claim_status.
Completed-run statuses
validate-completed emits a run-level status and per-job status:
| Status | Meaning |
|---|---|
live_complete |
Required live request, launch, engine metrics, and GPU metrics evidence exists. |
live_incomplete |
Some live artifacts exist, but the live-complete quorum is not satisfied. |
synthetic_only |
Synthetic mimic markers are present and no live evidence supersedes them. |
missing_required_artifacts |
Required contract or job artifacts are absent. |
not_publishable |
Invalid inputs or synthetic/live conflicts prevent publication. |
A live_complete job requires:
request_profile/requests_profile.jsonlexists, is non-empty, and has at least one successful row;launch/healthcheck.jsonreports status code200or an equivalent success state;metrics/engine_metrics_timeline.jsonlexists, is non-empty, and contains recognized engine metrics;metrics/gpu_metrics_timeline.jsonlexists, is non-empty, and contains required DCGM GPU utilization and framebuffer signals;- the artifact contract's required paths are present or explicitly downgraded.
Artifact layout
A completed run root normally looks like this:
results-root/
matrix_plan.json
expected_artifact_contract.json
validation_report.json
validation_report.md
jobs/<job-id>/
request_profile/requests_profile.jsonl
request_profile/requests_summary.json
metrics/engine_metrics_timeline.jsonl
metrics/gpu_metrics_timeline.jsonl
metrics/metrics_summary.json
launch/command.json
launch/healthcheck.json
diagnosis/bottleneck_diagnosis.json
diagnosis/failure_classification.json
report/operator_recommendation.json
Not every command writes every directory. The validator is the authority on which paths are required for a specific matrix or contract.
Network and safety model
InferGuard does not provision cloud resources and does not phone home by default. Runtime network calls are limited to user-supplied endpoints: OpenAI-compatible chat completions, Prometheus engine metrics, DCGM exporter metrics, prefill/decode/transfer metrics, or explicit telemetry audit commands after consent.
launch-engine is the only command that can spawn serving engine subprocesses. v0.7.1 launches workers in their own process group and registers them with shared SIGINT/SIGTERM cleanup so interrupted runs do not leave orphan vLLM/SGLang processes.
How to extend the architecture
- Add tests first for new artifact contracts, claim-status behavior, and CLI flags.
- Keep new public artifact fields additive or version the schema.
- Update CLI reference when command help changes.
- Update hardware coverage when matrix coverage changes.
- Keep private/pro-tier modules outside the OSS import graph; see the public repository.