Vendor-neutral inference optimization for production.

We benchmark, diagnose, and optimize AI inference across every serving engine and every GPU. Independent. No vendor lock-in. Kernel-level depth validated on H100, H200, B200, MI300X, and MI355X.

Validated Across Production Infrastructure
vLLM / Inferact SGLang / RadixArk TRT-LLM NVIDIA Dynamo LMCache H100 / H200 / B200 MI300X / MI355X

/0.1

Our Software

InferScope

26-tool MCP server for KV cache analysis, memory planning, and serving engine configuration. Connects to Cursor or Claude Desktop. Supports vLLM, SGLang, TRT-LLM, NVIDIA Dynamo, and LMCache with live Prometheus telemetry scraping.

Repository
/0.1
ISB-1

5-phase inference serving benchmark. Capacity cliff detection, pressure degradation profiling, KV cache behavior analysis, and disaggregated transfer measurement. Validated across real workloads with reproducible methodology and statistical rigor.

Repository
/0.2
Kernel Track

ISA-level GPU kernel development targeting AMD CDNA4 (gfx950) and NVIDIA Blackwell (sm_100). Automated kernel generation, MFMA/WGMMA instruction scheduling, and FP4/FP8 quantization research. HIP and CUDA.

Repository
/0.3

/0.2

Research Focus

Applied research in inference performance. Not toy experiments. Every result is hardware-grounded, reproducible, and validated against production-scale deployments.

KV Cache Systems

Budget calculation across GQA and MLA architectures. Quantization strategy comparison (FP8, FP4, INT4). Eviction policies, prefix reuse, and session degradation under sustained load.

PagedAttention RadixAttention LMCache

Disaggregated Inference

Prefill/decode separation, KV-aware routing, NIXL transfer bandwidth profiling, and CXL-attached memory tier analysis. Measured across NVIDIA Dynamo and vLLM disaggregated serving.

NIXL Dynamo CXL

Hardware-Aware Optimization

ISA-level kernel development for Hopper, Blackwell, and CDNA4. MFMA matrix core scheduling, inline assembly optimization, and LDS/shared memory tiling strategies for inference workloads.

sm_90a sm_100 gfx950

Serving Engine Configuration

Config compilation for vLLM, SGLang, TRT-LLM, and NVIDIA Dynamo. Automated launch flag generation. Tensor-parallel sizing, KV cache dtype selection, and GPU memory utilization tuning.

vLLM SGLang TRT-LLM

Long-Context Workloads

128K+ context benchmarking for coding agents, RAG pipelines, and multi-turn chat. Capacity cliff analysis at each context length. KV cache pressure profiling under realistic arrival patterns.

Qwen3-Coder Kimi K2 DeepSeek R1

Neocloud Benchmarking

Client-side capacity profiling of Nebius, Fireworks, Together, and Groq. TTFT-based cliff detection without server metrics. Cross-provider comparison validated against self-hosted baselines.

Nebius Fireworks Together

/0.3

ISB-1 Benchmark Methodology

Five phases. Each answers a different question about your inference deployment. Run against your own endpoint or a neocloud API. First useful benchmark costs under $10.

01

Baseline

Throughput and latency at low load. Establishes "what's normal" for the deployment. Full Prometheus metrics on self-hosted, client-side TTFT/TPOT on APIs.

02

Capacity Cliff

Binary search for maximum concurrent sessions at each context length before degradation. The core question: how many users can this actually serve before it breaks?

03

Pressure

Behavior at 125%, 150%, 200% of measured capacity. Classifies failure mode: preemption, swap, queue overflow, or graceful degradation.

04

Cache Behavior

Cold start warmup curves, prefix reuse quantification, session degradation over time. Measures whether caching is actually helping or just adding complexity.

05

Offload & Disaggregation

KV cache tier transitions, disaggregated transfer overhead, NIXL bandwidth measurement, LMCache storage backend profiling. Self-hosted only.

26+
MCP diagnostic tools
7+
GPU platforms validated
5
Serving engines supported
50+
Metrics per benchmark run

/0.4

Kernel-Level Research

Automated kernel generation and ISA-level optimization. Not wrapper code. Direct instruction scheduling for inference-critical paths on current-generation silicon.

// EasyInference repository structure — proof of execution
EasyInference/
  products/
    isb1/ // benchmark standard, harness, configs, analysis
      harness/ // OpenAI-compatible replay, streaming TTFT/TPOT
      analysis/ // statistical validation, claim evaluation, plots
      workloads/ // agent, chat, coding, RAG, SWE-Bench
      quality/ // HumanEval, MMLU-Pro, ROUGE, RULER
    inferscope/ // operator diagnostics and probe tooling
      benchmarks/ // KV capacity probe, pressure ramp, disagg transfer
      engines/ // vLLM, SGLang, TRT-LLM, Dynamo, Atom
      hardware/ // GPU detection, roofline model, profile DB
      optimization/ // memory planner, workload classifier, recommender
      telemetry/ // Prometheus capture, normalizer, live scraping
      tools/ // 26 MCP tools: audit, diagnose, kv_cache, profiling

// Inferscope research repository
Inferscope/
  papers/ // 30 paper analyses: KV compression, disagg, speculative decoding
  competitors/ // 7 deep-dives: Fireworks, Groq, Modal, SambaNova, Together...
  docs/ // Dynamo index (2,700 files), LMCache v0.4.2, Hopper ISA ref
  semianalysis/ // Curated research: disagg serving, KV cache, Hopper arch

/0.5

Education

Technical knowledge shouldn't be gatekept. We publish research findings, practical guides, and short-form content that makes inference infrastructure accessible to everyone.

/0.5.1

Substack

Weekly deep-dives with real benchmark data, technical breakdowns, and case studies from production deployments. Free tier for the community. Paid tier for full datasets, methodology details, and enterprise briefings.

Subscribe
/0.5.2

Micro-Courses

Focused courses on Whop covering self-hosted AI setup, model selection, inference optimization, and deployment security. Each solves one problem in 30–90 minutes. Built for operators, not academics.

Browse courses
/0.5.3

Short-Form Video

TikTok, YouTube Shorts, and Instagram Reels. Comedy skits and educational content that explain inference concepts in 60 seconds. Distribution-first approach to technical education.

Watch
/0.5.4

Open Research

30+ published paper analyses covering KV compression, disaggregated serving, speculative decoding, and hardware benchmarks. 7 competitor deep-dives. All findings are open and reproducible.

View research

/0.6

Inference Consulting

We work directly with engineering teams running production inference. Vendor-neutral. Every recommendation is backed by benchmark data from your actual workloads on your actual hardware.

Inference Audit

Full ISB-1 benchmark of your serving deployment. Capacity cliff detection, KV cache efficiency analysis, and configuration review. Delivered as a scored report with prioritized fixes.

Optimization Sprint

2-4 week engagement. We profile, diagnose, and implement optimizations: serving engine tuning, memory planning, batch configuration, and kernel-level improvements where needed.

Fleet Assessment

Cross-cluster benchmarking for neoclouds and hyperscalers. Comparative analysis across GPU SKUs, serving engines, and model configurations. Data-driven fleet allocation recommendations.

Engine Migration

Guided migration between serving engines (vLLM, SGLang, TRT-LLM, Dynamo). Before-and-after benchmarking with capacity cliff comparison. Zero-downtime transition planning.

Kernel Development

Custom ISA-level kernel work for inference-critical paths. MFMA and WGMMA instruction scheduling, FP4/FP8 quantization kernels, and attention operator optimization for CDNA4 and Blackwell.

Data Center Co-Design

GPU selection modeling, CXL memory tier planning, and disaggregated inference architecture design. For teams building or expanding inference-optimized compute clusters.

Get in touch →

Vendor-neutral inference optimization. Backed by benchmark data.