We benchmark, diagnose, and optimize AI inference across every serving engine and every GPU. Independent. No vendor lock-in. Kernel-level depth validated on H100, H200, B200, MI300X, and MI355X.
/0.1
26-tool MCP server for KV cache analysis, memory planning, and serving engine configuration. Connects to Cursor or Claude Desktop. Supports vLLM, SGLang, TRT-LLM, NVIDIA Dynamo, and LMCache with live Prometheus telemetry scraping.
Repository →5-phase inference serving benchmark. Capacity cliff detection, pressure degradation profiling, KV cache behavior analysis, and disaggregated transfer measurement. Validated across real workloads with reproducible methodology and statistical rigor.
Repository →ISA-level GPU kernel development targeting AMD CDNA4 (gfx950) and NVIDIA Blackwell (sm_100). Automated kernel generation, MFMA/WGMMA instruction scheduling, and FP4/FP8 quantization research. HIP and CUDA.
Repository →/0.2
Applied research in inference performance. Not toy experiments. Every result is hardware-grounded, reproducible, and validated against production-scale deployments.
Budget calculation across GQA and MLA architectures. Quantization strategy comparison (FP8, FP4, INT4). Eviction policies, prefix reuse, and session degradation under sustained load.
Prefill/decode separation, KV-aware routing, NIXL transfer bandwidth profiling, and CXL-attached memory tier analysis. Measured across NVIDIA Dynamo and vLLM disaggregated serving.
ISA-level kernel development for Hopper, Blackwell, and CDNA4. MFMA matrix core scheduling, inline assembly optimization, and LDS/shared memory tiling strategies for inference workloads.
Config compilation for vLLM, SGLang, TRT-LLM, and NVIDIA Dynamo. Automated launch flag generation. Tensor-parallel sizing, KV cache dtype selection, and GPU memory utilization tuning.
128K+ context benchmarking for coding agents, RAG pipelines, and multi-turn chat. Capacity cliff analysis at each context length. KV cache pressure profiling under realistic arrival patterns.
Client-side capacity profiling of Nebius, Fireworks, Together, and Groq. TTFT-based cliff detection without server metrics. Cross-provider comparison validated against self-hosted baselines.
/0.3
Five phases. Each answers a different question about your inference deployment. Run against your own endpoint or a neocloud API. First useful benchmark costs under $10.
Throughput and latency at low load. Establishes "what's normal" for the deployment. Full Prometheus metrics on self-hosted, client-side TTFT/TPOT on APIs.
Binary search for maximum concurrent sessions at each context length before degradation. The core question: how many users can this actually serve before it breaks?
Behavior at 125%, 150%, 200% of measured capacity. Classifies failure mode: preemption, swap, queue overflow, or graceful degradation.
Cold start warmup curves, prefix reuse quantification, session degradation over time. Measures whether caching is actually helping or just adding complexity.
KV cache tier transitions, disaggregated transfer overhead, NIXL bandwidth measurement, LMCache storage backend profiling. Self-hosted only.
/0.4
Automated kernel generation and ISA-level optimization. Not wrapper code. Direct instruction scheduling for inference-critical paths on current-generation silicon.
/0.5
Technical knowledge shouldn't be gatekept. We publish research findings, practical guides, and short-form content that makes inference infrastructure accessible to everyone.
Weekly deep-dives with real benchmark data, technical breakdowns, and case studies from production deployments. Free tier for the community. Paid tier for full datasets, methodology details, and enterprise briefings.
Subscribe →Focused courses on Whop covering self-hosted AI setup, model selection, inference optimization, and deployment security. Each solves one problem in 30–90 minutes. Built for operators, not academics.
Browse courses →TikTok, YouTube Shorts, and Instagram Reels. Comedy skits and educational content that explain inference concepts in 60 seconds. Distribution-first approach to technical education.
Watch →30+ published paper analyses covering KV compression, disaggregated serving, speculative decoding, and hardware benchmarks. 7 competitor deep-dives. All findings are open and reproducible.
View research →/0.6
We work directly with engineering teams running production inference. Vendor-neutral. Every recommendation is backed by benchmark data from your actual workloads on your actual hardware.
Full ISB-1 benchmark of your serving deployment. Capacity cliff detection, KV cache efficiency analysis, and configuration review. Delivered as a scored report with prioritized fixes.
2-4 week engagement. We profile, diagnose, and implement optimizations: serving engine tuning, memory planning, batch configuration, and kernel-level improvements where needed.
Cross-cluster benchmarking for neoclouds and hyperscalers. Comparative analysis across GPU SKUs, serving engines, and model configurations. Data-driven fleet allocation recommendations.
Guided migration between serving engines (vLLM, SGLang, TRT-LLM, Dynamo). Before-and-after benchmarking with capacity cliff comparison. Zero-downtime transition planning.
Custom ISA-level kernel work for inference-critical paths. MFMA and WGMMA instruction scheduling, FP4/FP8 quantization kernels, and attention operator optimization for CDNA4 and Blackwell.
GPU selection modeling, CXL memory tier planning, and disaggregated inference architecture design. For teams building or expanding inference-optimized compute clusters.