LMCache Compatibility
InferGuard treats LMCache support as mode-specific evidence, not a single yes/no flag. LMCache can run embedded in an engine process, as a standalone multiprocess service, through controller-backed P2P sharing, or as part of disaggregated prefill. These modes expose different metrics and require different proof.
Architecture Priority
InferGuard's priority is the current LMCache architecture:
- Primary target: standalone MP. LMCache runs as
lmcache server; vLLM connects withLMCacheMPConnector; telemetry comes fromlmcache_mp_*, LMCache HTTP health/status endpoints, logs, and optional OTel/trace replay. - Compatibility target: embedded/in-process. LMCache runs inside the engine through
LMCacheConnectorV1or the vLLM LMCache offload flag path; telemetry commonly appears as productionlmcache:*metrics and inline engine logs. - Detection/evidence-only until fixtures exist: P2P, disaggregated prefill, controller/internal API, logs, OTel, and trace recording. HTTP, log, OTel, and
.lctinputs are now accepted as packet evidence, but they still need live golden fixtures and detector rules before they can support customer-facing claims by themselves.
The old LMCacheConnector v0-style string is not a priority. InferGuard should flag it as stale/unsupported unless the operator explicitly documents an older pinned stack.
Progress
The upstream tracker for this effort is
/Users/chen/Projects/Touchdown-Labs/docs/sdlc/188-2026-05-07-lmcache-inferguard-observability-source-of-truth.md.
As of that tracker, LMCache observability coverage remains 58 / 100. This
page documents parser, packet, report, and diagnosis behavior that exists in
InferGuard; it does not claim live-complete coverage. The next score movement
requires one clean live vLLM + standalone LMCache MP packet with metrics, HTTP,
logs, trace recording, fixture replay, and tests.
Support Levels
| Surface | Typical launch shape | Evidence InferGuard can use today | Support level |
|---|---|---|---|
| Standalone MP | lmcache server plus vLLM LMCacheMPConnector |
LMCache /metrics with lmcache_mp_*, vLLM /metrics, launch/config artifacts |
Partial, highest priority |
| Embedded vLLM LMCache | vLLM with LMCacheConnectorV1, LMCacheConnectorV1Dynamic, or --kv-offloading-backend lmcache |
Engine /metrics, production lmcache:* metrics, launch/config artifacts, inline vLLM/LMCache logs |
Partial, compatibility priority |
| Embedded SGLang LMCache | SGLang --enable-lmcache using LMCacheLayerwiseConnector through SGLang's radix cache |
SGLang /metrics, aggregate sglang:cache_hit_rate, HiCache/storage metrics when present, LMCache config/log evidence |
Partial, compatibility priority |
| P2P sharing | multiple engines, enable_p2p, controller, NIXL |
production lmcache:* P2P metrics when present; logs can be parsed as conservative packet evidence and surfaced by diagnosis |
Parser/report partial; live proof missing |
| Disaggregated prefill | prefiller/decoder roles using NIXL | launch/config artifacts and NIXL/PD log hints can be parsed as conservative packet evidence and surfaced by diagnosis | Parser/report partial; live proof missing |
| CacheBlend | blend-mode lookups/retrieve/store with lmcache_blend_* and cb.* spans |
lmcache_blend_* metrics are normalized, CacheBlend OTel spans are parsed, and report/diagnosis can surface CacheBlend finding codes |
Parser/report partial; live proof missing |
| Lookup-hash JSONL | lookup_hashes_*.jsonl with redacted key-shape metadata |
privacy-bounded parser redacts raw hashes and preserves request/model/chunk-shape summaries; packet/report plumbing accepts lookup-hash evidence | Parser/report partial; live proof missing |
| Controller / internal API | lmcache_controller or internal API server |
not collected as a structured packet yet | Planned |
| Logs | engine and LMCache logs | copied into packets and parsed for conservative LMCache hints | Partial |
| OTel spans | MP tracing exported to operator-supplied JSONL | parsed into LMCache OTel evidence for mp.store, mp.retrieve, and mp.lookup_prefetch |
Partial |
Trace recording .lct |
MP --trace-level storage binary trace recording |
parsed as LMCache trace evidence; malformed traces are recorded without aborting packet creation | Partial |
| Trace replay metadata | lmcache trace info / replay JSON, JSONL, and CSV summaries |
replay info, JSON, JSONL, and trace_replay_ops.csv evidence can be parsed and surfaced in packet/report/diagnosis flows |
Parser/report partial; live proof missing |
What lmcache-compat Does Today
inferguard lmcache-compat compares available Prometheus text against known LMCache and vLLM metric families. It reports whether each family is:
populated: present with non-zero data;zero: present but all values are zero;missing: absent from the input;partial: some required families are present and others are not.
Use it when you have one or both of:
- engine metrics from vLLM/SGLang/Dynamo-compatible Prometheus endpoints;
- LMCache metrics from a standalone MP server or embedded production endpoint.
- Optional LMCache HTTP,
.lct, and OTel evidence JSON files produced bycollect-lmcacheor equivalent local parsing.
For MP runs, prefer packet collection first:
inferguard collect-lmcache \
--output-dir modal-out/lmcache-packet \
--engine-metrics-file vllm.prom \
--lmcache-metrics-file lmcache.prom \
--lmcache-http-base-url http://localhost:7000 \
--lmcache-http-thread-name eviction \
--lmcache-log-file lmcache.log \
--lmcache-trace-file lmcache-trace.lct \
--lmcache-otel-file lmcache-otel.json \
--expect-mode mp \
--mp-trace-recording-enabled \
--mp-tracing-enabled
collect-lmcache fetches safe read-only MP HTTP routes from the base URL and
records destructive routes such as cache clearing and metrics reset as skipped
evidence rather than invoking them.
Example:
inferguard lmcache-compat \
--engine-metrics-file vllm.prom \
--lmcache-metrics-file lmcache.prom \
--lmcache-http-evidence-file lmcache_http_evidence.json \
--lmcache-trace-evidence-file lmcache_trace_evidence.json \
--lmcache-otel-evidence-file lmcache_otel_evidence.json \
--expect-mode mp \
--fail-on missing-required
Use --l2-configured only when the run actually configured an MP L2 adapter.
Without that flag, L2 families are reported as not_applicable so an L1-only
lab is not treated as a failed L2 proof.
For MP runs, pass the observability settings from the launch command when they are known:
inferguard lmcache-compat \
--lmcache-metrics-file lmcache.prom \
--expect-mode mp \
--mp-prometheus-port 9090 \
--mp-event-bus-queue-size 10000 \
--mp-metrics-sample-rate 0.01 \
--mp-tracing-enabled
The JSON report includes an lmcache_mp_observability section with
service_instance_ids from Prometheus target_info, cache_salt cardinality,
L2 adapter labels, EventBus tail-drop risk, sampled-histogram sparsity, and
whether metrics/tracing/logging were disabled by config.
Testing the Packet A missing-family failure mode
Before a live Packet A fixture is accepted, developers can test the strict
failure path with the non-scoreable fixture at
tests/fixtures/lmcache_live/packet_a_missing_prometheus/:
PACKET=tests/fixtures/lmcache_live/packet_a_missing_prometheus
inferguard lmcache-compat \
--engine-metrics-file "$PACKET/vllm_metrics_loaded.prom" \
--lmcache-metrics-file "$PACKET/lmcache_metrics_loaded.prom" \
--lmcache-http-evidence-file "$PACKET/lmcache_http_evidence.json" \
--lmcache-log-evidence-file "$PACKET/lmcache_log_evidence.json" \
--lmcache-lookup-hash-evidence-file "$PACKET/lmcache_lookup_hash_evidence.json" \
--expect-mode mp \
--fail-on missing-required \
--json
This fixture proves only the diagnostic shape. It should show detected_mode=mp,
then fail on missing required Prometheus families such as lookup_tokens and
l1_memory. The bundled HTTP/log/lookup-hash evidence is alternate live-shaped
evidence, not scoreable replacement metrics. Keep coverage at 58 / 100 until
a real Packet A run exports those Prometheus families, is imported as an
accepted compact fixture, and passes tests.
diagnose-bottleneck reads metrics/lmcache_compat_report.json and now
promotes user-facing LMCache finding codes for MP logs, CacheBlend, P2P, PD,
trace replay, and lookup-hash surfaces when those findings are present in the
report. It also reads metrics/lmcache_log_evidence.json from collected
packets and can emit conservative log-backed diagnoses such as
lmcache_log_p2p_evidence_present,
lmcache_log_pd_evidence_present, and lmcache_log_stale_connector. These are
inferred unless paired with measured Prometheus/HTTP/trace evidence.
Metric Surfaces
Embedded / Production lmcache:*
The embedded and production surface commonly includes:
- request counters such as
lmcache:num_retrieve_requests,lmcache:num_store_requests, andlmcache:num_lookup_requests; - token counters such as
lmcache:num_requested_tokens,lmcache:num_hit_tokens,lmcache:num_lookup_tokens, andlmcache:num_lookup_hits; - hit-rate gauges/histograms;
- retrieve/store latency and speed histograms;
- local CPU, remote backend, memory-management, health, and chunk-statistics metrics;
- P2P transfer metrics such as
lmcache:num_p2p_requests,lmcache:num_p2p_transferred_tokens,lmcache:p2p_time_to_transfer, andlmcache:p2p_transfer_speed.
Prometheus exporters may normalize colons to underscores depending on the scrape path. InferGuard preserves unknown LMCache-like metric names so new upstream metrics are not discarded.
For vLLM embedded mode, the current source-backed connector strings are:
LMCacheConnectorV1;LMCacheConnectorV1Dynamicwithkv_connector_module_path="lmcache.integration.vllm.lmcache_connector_v1";--kv-offloading-backend lmcache/kv_offloading_backend="lmcache"as launch/config evidence for the vLLM LMCache offload path;- legacy
LMCacheConnector, which InferGuard should treat as stale/pinned evidence unless the operator documents an old stack.
For SGLang embedded mode, current mainline source evidence points to:
python -m sglang.launch_server --enable-lmcache;- SGLang
LMCRadixCache; - LMCache
LMCacheLayerwiseConnector; - SGLang metrics such as
sglang:cache_hit_rate, queue gauges, HiCache host-token gauges, KV-transfer histograms, and storage metrics.
No current-mainline SGLang MP connector contract has been proven yet. InferGuard
must not mark SGLang MP as supported until source and a live fixture prove it.
SGLang HiCache-only metrics are not LMCache proof; InferGuard keeps them as
SGLang cache/storage context unless --enable-lmcache,
LMCacheLayerwiseConnector, LMCRadixCache, or lmcache:* evidence is also
present.
Standalone MP lmcache_mp_*
Standalone MP mode uses lmcache server and exposes lmcache_mp_* metrics from the LMCache server endpoint. Important families include:
- StorageManager read/write counters;
- L1 read/write/eviction counters;
- L1 chunk lifecycle histograms;
- real-reuse histograms with
cache_salt; - L2 store/prefetch counters and throughput histograms;
- lookup requested/hit token counters with
model_nameandcache_salt; - L0 GPU block lifecycle histograms;
- L0-L1 throughput histograms;
- engine loaded-chunk counters;
- observable gauges for active prefetch jobs, L1 memory, and in-flight L2 work.
MP runs often produce a mix of populated, zero, and missing families. For example, an L1-only run should not be expected to populate L2 throughput. A run that populates StorageManager/L1 counters but never emits lookup token counters is an integration or workload question, not automatic proof that caching failed.
MP metrics are sampled in several places. Lifecycle and throughput histograms default to a 1% sample rate, while counters count all events. Missing sampled histograms should be explained separately from missing always-counted counters. EventBus is also bounded; if EventBus self-metrics are absent, InferGuard flags tail-drop observability risk instead of pretending drops are impossible.
Mode Detection Rules
InferGuard should interpret the packet conservatively:
lmcache_mp_*present: likely standalone MP.lmcache:*orlmcache_present withoutlmcache_mp_*: likely embedded/production surface.- P2P metrics or P2P connection logs present: P2P candidate.
- prefiller/decoder connector roles or PD config present: disaggregated-prefill candidate.
- controller-only API responses without engine/cache metrics: controller-only packet, not cache performance proof.
Do not claim full LMCache compatibility from one metric prefix. A complete packet should include launch/config evidence and the relevant metrics/logs for the mode under test.
Required Evidence By Mode
Embedded
- launch command or config showing
LMCacheConnectorV1or LMCache offload flags; - consistent
PYTHONHASHSEEDacross participating processes; - engine
/metricsoutput; lmcache:*or normalized LMCache metrics;- logs showing first-request store and repeated-request retrieve when available.
Standalone MP
lmcache servercommand and config;- vLLM connector config showing
LMCacheMPConnector; - LMCache healthcheck;
- vLLM healthcheck;
- LMCache
/metrics; - vLLM
/metrics; - LMCache server logs;
- optional OTel/trace replay artifacts if tracing is in scope.
Important source-backed caveat: current vLLM LMCacheMPConnector does not
export connector-specific Prometheus metrics through vLLM because its
build_prom_metrics() implementation returns None. For MP, the required
cache observability source is the standalone LMCache server, not the vLLM
connector metrics surface.
P2P
- at least two engine launch commands;
- controller launch command;
- P2P config with instance IDs, peer init/lookup ports, controller URLs, and transfer channel;
- NIXL/RDMA/TCP mode;
- consistent
PYTHONHASHSEED; - peer connection logs;
- cross-engine retrieval logs;
- P2P metrics when exposed.
Disaggregated Prefill
- prefiller and decoder launch commands;
- producer/consumer connector roles;
- NIXL config and ports;
- proxy/router behavior if used;
- transfer success evidence;
- TTFT comparison against baseline.
Known Gaps
Current InferGuard support is not 100% LMCache compatible. The highest-priority gaps are MP-first:
- Add a clean full MP golden fixture with LMCache MP metrics, HTTP evidence,
logs, and optional
.lct/ OTel evidence. - Add an MP L2 live fixture.
- Calibrate detector rules for MP mode mismatch, missing MP metrics, zero-hit-after-warmup, hash-seed risk, missing lookup counters, L1 pressure, L2 stalls, EventBus drop risk, and trace/OTel evidence gaps.
- Expand structured log parsing for MP store/retrieve lifecycle proof.
- Validate embedded vLLM and SGLang connector classification against live
fixtures. Fixture-backed parser support exists for vLLM
LMCacheConnectorV1Dynamic/kv_offloading_backend=lmcache, staleLMCacheConnector, SGLang--enable-lmcache/LMCacheLayerwiseConnector/LMCRadixCache, and HiCache-only separation. - Add P2P mode detection and P2P metric normalization.
- Add controller and internal API collection.
- Add live compatibility fixtures for embedded, P2P, PD, OTel, and trace recording.
Until those are complete, InferGuard should describe LMCache findings as evidence levels:
supported: enough telemetry exists for the claim;partial: telemetry proves some behavior but not the full claim;missing_signal: required telemetry is absent;inferred_without_engine_metrics: workload shape suggests a result but live engine/cache evidence is missing.
Modal MP Lab Reference
The 2026-05-06 Modal lab validated standalone MP telemetry, not P2P:
- LMCache server ran separately.
- vLLM used
LMCacheMPConnector. - LMCache and vLLM healthchecks passed.
- The latest loaded metrics contained 136
lmcache_mpseries, 117 of them non-zero. - StorageManager, L1, L1 lifecycle, and L0 lifecycle families populated.
- Lookup token counters, real-reuse histograms, L2 families, throughput families, and some gauges did not populate in that L1-only workload/config path.
diagnose-bottlenecknow readslmcache_compat_report.jsonand emits specific missing-signal rules, such aslmcache_mp_lookup_counters_missing, instead of only a generic insufficient-evidence result.
This is exactly the kind of packet lmcache-compat is intended to make inspectable before anyone makes a customer or upstream claim.