Automated CUDA is almost here. Revenue and capability per GPU are doubling.

TRL OpenEnv integration

siliconvalley.northeastern.edu

Unsloth RL guide

unsloth.ai

Northeastern: Small Collab Leads to Big Win at the 2026 OpenEnv Hackathon

Perplexity inference systems.

Improving Unigram Tokenizer CPU Performance

pplx-garden open-source inference garden

RDMA Point-to-Point Communication for LLM Systems

fabric-lib: RDMA Point-to-Point Communication for LLM Systems

Disaggregated Prefill and Decode

Weight Transfer for RL Post-Training in under 2 seconds

Enabling Trillion-Parameter Models on AWS EFA

Mercor.

RL environments and the future of AI work

mercor.com

Introducing APEX-Agents

mercor.com

APEX-Agents dataset (Hugging Face)

Archipelago: open-source agent-evaluation harness

Expert data drives model performance

mercor.com

RL environments as shared infrastructure.

Prime Intellect: The Open Stack for Self-Improving Agents

Prime Intellect Environments Hub

Environments Hub: A Community Hub To Scale RL To Open AGI

Prime Intellect verifiers docs

Prime Intellect verifiers overview

Prime Intellect verifiers GitHub repository

Will Brown research page

willcb.com

Prime Intellect prime-rl GitHub

Prime Intellect prime-rl trajectories docs

Prime Intellect prime-rl environments docs

PRIME-RL: Async & Decentralized RL Training at Scale

Prime Intellect: Recursive Language Models

Recursive Language Models technical report

Prime Intellect Lab architecture

INTELLECT-2: globally distributed RL training

INTELLECT-2 technical report

INTELLECT-3 technical report

Jack Min Ong Google Scholar profile

scholar.google.com

Jack Min Ong OpenReview profile

TOPLOC: trustless verifiable inference

Prime Collective Communications Library technical report

NVIDIA Dynamo users featuring Prime Intellect

NVIDIA Dynamo developer page

docs.dynamo.nvidia.com · docs.nvidia.com

NVIDIA Dynamo introduction

docs.dynamo.nvidia.com

NVIDIA Dynamo disaggregated serving guide

docs.dynamo.nvidia.com

NVIDIA Dynamo LMCache integration

docs.dynamo.nvidia.com

NVIDIA Dynamo TensorRT-LLM backend

slime: SGLang-native post-training framework

slime speculative decoding docs

thudm.github.io

THUDM/slime GitHub repository

SGLang for RL systems

sgl-project.github.io

SGLang speculative decoding docs

docs.sglang.io

Miles GitHub repository

LMSYS: Introducing Miles for large-scale MoE RL

LMSYS: ROCm support for Miles RL on AMD

LMSYS: DeepSeek-V4 on day zero with SGLang and Miles

LMSYS: SGLang + AMD MoRI on MI355X for DeepSeek-R1

ROCm docs: SGLang MoRI distributed inference on MI355X

rocmdocs.amd.com

SPECTRE: Hybrid ordinary-parallel speculative serving

AMD ROCm: Day-0 support for slime + TritonForge

DeepCoder release writeup

DeepCoder-14B-Preview model card

Profiling and compiler-path evidence.

Hugging Face: Profiling in PyTorch

PyTorch profiler documentation

pytorch.org

PyTorch torch.compile documentation

pytorch.org

PyTorch: Why is torch.compile so fast? Kernel fusion

pytorch.org

NVIDIA Nsight Systems documentation

NVIDIA Nsight Compute documentation

NVIDIA CUDA C Programming Guide

NVIDIA CUDA Binary Utilities

NVIDIA PTX ISA documentation

NVIDIA NVRTC runtime compilation documentation

GPU MODE: PTX/SASS level review (May 28, 2026)

NVIDIA CUDA Toolkit 13.3 release notes

NVIDIA technical blog: CUDA 13.3, Tile C++, CompileIQ, and CUDA Python 1.0

CUDA Tile C++ API Reference 13.3 release notes

OpenAI: Introducing Triton 1.0

openai.com

Triton paper: an intermediate language and compiler for tiled neural network computations

eecs.harvard.edu

Triton matrix multiplication tutorial

triton-lang.org

JAX Pallas kernel language documentation

docs.jax.dev

Apache TVM TensorIR documentation

tvm.apache.org

TileLang paper and documentation

arxiv.org · tilelang.com

CuPy RawKernel documentation

docs.cupy.dev

Numba-CUDA documentation

nvidia.github.io

Kokkos programming model

kokkos.org

SYCL 2020 specification

khronos.org

Commercial serving paths.

Inferact official site

inferact.ai

TechCrunch: Inferact commercializes vLLM

techcrunch.com

vLLM GitHub repository

vLLM speculative decoding docs

vLLM Speculators v0.5.0: DFlash Support and Online Training

vllm.ai

EAGLE: speculative sampling and feature uncertainty

EAGLE-2: dynamic draft trees

EAGLE-3: training-time test and multi-layer feature fusion

vLLM Speculators: EAGLE-3 algorithm docs

vLLM Speculators docs: EAGLE-3 user guide

DFlash: Diffusion Drafting for Fast Speculative Decoding

Speculative Speculative Decoding

YC Paper Club: Inference, Diffusion, World Models, and More

vLLM Speculators docs: DFlash user guide

vLLM Speculators docs: algorithm decision guide

vLLM EAGLE draft-model executable examples

developers.googleblog.com

vLLM: EAGLE-3.1 speculative decoding update

vllm.ai

Red Hat: EAGLE-3 in vLLM speculative decoding

developers.redhat.com

Red Hat: speculative decoding with vLLM and gpt-oss

developers.redhat.com

Google Developers: DFlash speculative decoding on TPU v5p

Makora SMC-SD: Sequential Monte Carlo Speculative Decoding

makora.com

SMC-SD GitHub repository

Faster LLM Inference via Sequential Monte Carlo

SGLang GitHub repository

SGLang v0.5.12 release

TokenSpeed: speed-of-light LLM inference for agentic workloads

lightseek.org

TokenSpeed-kernel README and registry/selection design

TokenSpeed server parameters and backend selection

lightseek.org

TokenSpeed model recipes for Kimi K2.5 / K2.6

lightseek.org

Kog AI homepage

kog.ai

Kog Inference Engine tech preview: 3,000 output tokens/s/request

blog.kog.ai

Kog monokernel deep dive on AMD MI300X

blog.kog.ai

Kog Delayed Tensor Parallelism for faster Transformer inference

blog.kog.ai

LayerScale stateful inference engine docs

docs.layerscale.ai

LayerScale paper: Attention Once Is All You Need

docs.layerscale.ai · docs.layerscale.ai

LayerScale Flash Queries docs

docs.layerscale.ai

LayerScale streaming data and SDK docs

LayerScale multi-agent tool-calling paper

LMCache documentation: KV cache layer for LLM serving

docs.lmcache.ai

Tensormesh: repeated context and cached-token inference path

tensormesh.ai

Tensormesh raises $20M and launches Tensormesh Inference

tensormesh.ai

Tensormesh LMCache storage ROI calculator and cost analysis

tensormesh.ai · tensormesh.ai

RadixArk: $100M seed to build open infrastructure for frontier AI

radixark.com

BusinessWire: RadixArk launches to grow SGLang and Miles

businesswire.com

vLLM x Mooncake Store: distributed KV cache pool for agentic traces

vllm.ai

vLLM MooncakeStoreConnector usage guide

vLLM CPU installation and backend documentation

vLLM PR #39445: CPU FP8 attention for AMX / AVX-512

vLLM PR #40900: MooncakeStoreConnector

Mooncake GitHub: KV-cache-centric disaggregated serving platform

NVIDIA TensorRT-LLM developer page

Hugging Face Text Generation Inference documentation

huggingface.co · github.com

LMDeploy OpenAI-compatible serving documentation

lmdeploy.readthedocs.io · github.com

Ray Serve LLM documentation

docs.ray.io

KServe LLMInferenceService and llm-d

kserve.github.io · kserve.github.io · llm-d.ai

Kueue, Kubernetes DRA, Gateway API inference extension, and LeaderWorkerSet

kueue.sigs.k8s.io · kubernetes.io · gateway-api-inference-extension.sigs.k8s.io · lws.sigs.k8s.io

NVIDIA GPU Operator and DCGM Exporter telemetry

docs.nvidia.com · docs.nvidia.com

BentoML / OpenLLM documentation

docs.bentoml.com · github.com

LiteLLM and Portkey gateway docs

docs.litellm.ai · portkey.ai

llama.cpp, Ollama, and MLX-LM local serving

github.com · github.com · github.com

dstack: open-source TEE runtime for confidential AI

dstack.org

dstack GitHub, usage, and security model

github.com · usage · security

Media, voice, and diffusion serving.

ElevenLabs: latency optimization

ElevenLabs: understanding audio streaming

ElevenLabs: realtime TTS WebSocket guide

LiveKit Agents documentation

LiveKit: open-source realtime platform and Cloud

LiveKit voice AI quickstart

LiveKit turn detection and interruptions

Pipecat: open-source voice and multimodal agent framework

docs.pipecat.ai

faster-whisper: self-hostable Whisper implementation on CTranslate2

Qwen3-TTS repository and package examples

Kokoro-82M: Apache-2.0 local TTS model card

Piper: local neural TTS repository

Coqui XTTS-v2 model card and license context

Resemble AI Chatterbox TTS repository

Canopy Orpheus-TTS repository

F5-TTS repository and license notes

Spark-TTS Triton / TensorRT-LLM serving notes

Higgsfield AI video generator

higgsfield.ai

Higgsfield CLI / MCP for image and video generation

higgsfield.ai

Higgsfield SDK: image-to-video and speech-to-video APIs

npmjs.org

Seedance 1.0 technical report

ByteDance Seedance official page

seed.bytedance.com

Wan2.2 official repository

Wan2.2 T2V-A14B model card

Alibaba Cloud Model Studio video models including Wan2.5 preview APIs

alibabacloud.com

Diffusers Qwen-Image pipeline documentation

Qwen-Image model card quick start

Qwen-Image-2.0 technical report

Diffusers Flux2 pipeline documentation

HiDream-I1-Full model card and license notes

Stable Diffusion 3.5 Medium model card and community license

LTX-2.3 model page and release context

ltx.io

LTX-2.3 supported API models

docs.ltx.video

Diffusers LTX-2 pipeline documentation

LTX-2 pipeline package README

LTX-2 Hugging Face model card and community license

HunyuanVideo official repository

Where Do the Joules Go? Diagnosing Inference Energy Consumption

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

AdaCache: Adaptive Caching for Faster Video Generation with Diffusion Transformers (ICCV 2025)

openaccess.thecvf.com

TeaCache: Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model (CVPR 2025)

openaccess.thecvf.com

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device (CVPR 2025)

openaccess.thecvf.com

Faster Image2Video Generation / VCUT

Mobile Video Diffusion

StreamWise: adaptive serving for multimodal generation workflows

DDiT: dynamic resource allocation for diffusion transformers

TetriServe: efficient DiT serving for image generation

xDiT: inference engine for diffusion transformers

Latent Parallelism for video diffusion serving

Chorus: inter-request caching for video diffusion serving

Local, edge, and deskside AI.

MLX: array framework for Apple silicon

MLX-LM: text generation and fine-tuning on Apple silicon

Apple Mac Studio technical specifications

apple.com

NVIDIA DGX Spark and DGX Station deskside AI systems

blogs.nvidia.com

NVIDIA DGX Spark User Guide hardware overview

NVIDIA DGX Station Development Guide

AMD Ryzen AI Halo developer platform and Ryzen AI Max PRO 400 Series

AMD Ryzen AI Max+ 395 product page

AMD Ryzen AI Max+395 generative AI performance technical article

Rubric and benchmarking discipline.

Standard Kernel R-axis rubric

standardkernel.com

Benchmarking guide

standardkernel.com

PTX-layer work

standardkernel.com

Compiler infrastructure.

LLVM

llvm.org

MLIR

mlir.llvm.org

MLIR: A Compiler Infrastructure for the End of Moore's Law

Modular

MAX

Modular 25.6 NVIDIA/AMD/Apple unification

Modular 26.2 MAX image generation + Mojo kernels

Modular 26.3 Mojo 1.0 Beta + MAX video generation

Structured Mojo Kernels Part 1

Spectral SCALE

Søndergaard: How CUDA won by not being a standard

Søndergaard: The brain still needs the hammer

Søndergaard: CUDA was always cross-platform

SCALE docs: compile CUDA with SCALE and scaleenv

SCALE docs: basic CUDA vector-sum example

SCALE docs: BLAS / cuBLAS compatibility example

Spectral Compute scale-validation repository

Benchmarks.

KernelBench

KernelBench-X

Wafer KernelArena

kernelarena.ai

Wafer #1 Qwen3.5-397B on AMD MI355X via TensorWave

tensorwave.com

Wafer: Achieving Heterogeneous Compute One Kernel at a Time

wafer.ai

SemiAnalysis InferenceX

InferenceX v2

WarpSpeed · SOL-ExecBench · Cursor multi-agent · kernel-design-agents · K-Search.

Core Auto: When AI Starts Writing Systems Code (Mark Saroufim, May 28, 2026)

coreauto.com

MLSys 2026 invited talk: When AI Starts Writing Systems Code

mlsys.org

GPU MODE KernelBook: PyTorch to Triton translations

pygpubench: benchmarking untrustworthy GPU kernels

The Hardware Lottery

ProgramBench: code generation and long-horizon programming behavior

doubleAI: WarpSpeed: Surpassing Expert-Written Kernels At Scale (Mar 31, 2026)

doubleai.com

doubleAI: WarpSpeed and the Need for Artificial Expert Intelligence (Mar 31, 2026)

doubleai.com

doubleAI: Press release: Beats a decade of expert-engineered GPU kernels

businesswire.com

doubleGraph: WarpSpeed-optimized cuGraph (Apache-2.0)

Cursor: Speeding up GPU kernels by 38% with a multi-agent system (Edward Lin, Apr 14, 2026)

cursor.com

Anysphere: kernel-optimization-results (per-problem metrics + solutions)

NVIDIA: SOL-ExecBench paper (arXiv 2603.19173v1)

NVIDIA: SOL-ExecBench repo (Apache-2.0)

NVlabs: SOLAR (Speed-of-Light Analysis for Runtime)

SOL-ExecBench leaderboard (NVIDIA Research)

research.nvidia.com

MIT HAN Lab: kernel-design-agents (Kernel Mafia, MLSys 2026)

PolyArch: Humanize (RLCR plan-execute-review harness)

DongyunZou: KernelWiki (Blackwell/Hopper PR-derived skill, 2,179 refs)

DongyunZou: ncu-report-skill (Nsight Compute for B200/sm_100)

UC Berkeley: K-Search via Co-Evolving Intrinsic World Model (arXiv 2602.19128)

MLSys 2026 FlashInfer AI Kernel Generation Contest

mlsys26.flashinfer.ai

Reward hacking, real failure modes, and the harness response.

KernelBench PR #25: Investigate Sakana Kernel

airevolution.poltextlab.com

Sakana AI CUDA Engineer: public retraction context

Shimizu: Lucas Beyer dissection of Sakana CUDA Engineer

medium.com

Sakana robust-kbench (Apache 2.0)

Wafer: A Field Guide to Reward Hacking in AI Kernel Generation

wafer.ai

DeepReinforce: Hacks and Defenses in Automatic GPU Kernel Generation

deep-reinforce.com

FlashInfer-bench Issue #21: cache-based exploit at MLSys 2026

Makora: Discovery & Mitigation of Reward Hacks

makora.com

Unsloth: RL Reward Hacking guide

unsloth.ai

HF kernels skill · AMD CDNA4 / MI355X R3-R4 stack.

Modal GPU Glossary: README / first-principles GPU map

modal.com

Modal GPU Glossary: contributors (Charles Frye, Matthew Nappo, Modal team)

modal.com

Modal GPU Glossary: CUDA thread example

modal.com

modal-labs/gpu-glossary source repository

Hugging Face: Writing custom kernels with code agents (docs)

Hugging Face: Custom Kernels for All from Codex and Claude

burtenshaw/kernel-skill (cuda-kernels skill source)

Hugging Face: CLI skills (cuda-kernels, rocm-kernels, xpu-kernels)

AMD GPUOpen: machine-readable GPU ISA (CDNA4 XML)

gpuopen.com

ROCm Documentation: GPU hardware specifications (MI355X / gfx950)

Salykova: Matrix Core Programming on AMD CDNA3 and CDNA4

ROCm Blogs: From Naive to Near-Peak: Gluon GEMM Kernels on MI355

RadeonFlow_Kernels: FP8 GEMM, MoE, MLA on MI300X (MIT)

SGLang PR #22409: GLM-5.1-MXFP4 nightly CI for MI30x and MI35x

llama.cpp PR #21570: CDNA4 (gfx950) support for MI350X/MI355X

Quantization and precision formats.

NVIDIA Transformer Engine: low-precision training introduction (BF16/FP16, FP8, MXFP8, NVFP4 recipes)

NVIDIA Transformer Engine: NVFP4 format and scaling

NVIDIA: Pretraining Large Language Models with NVFP4

NVIDIA technical blog: NVFP4 low-precision model training without losing accuracy

NVIDIA Blackwell Ultra technical blog: NVFP4, FP8, HBM, Tensor Cores

NVIDIA TensorRT-LLM on H100: FP8 inference and TCO framing

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

OCP 8-bit floating point specification (OFP8, E4M3/E5M2)

opencompute.org

OCP Microscaling Formats MX v1.0 specification (MXFP4, block size, scale format)

opencompute.org

NVIDIA Transformer Engine NVFP4 recipe and block scaling

Pretraining Large Language Models with MXFP4 on Native FP4 Hardware

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Training LLMs with MXFP4

Diagnosing FP4 inference: NVFP4 and MXFP4 layer/block sensitivity

Benchmarking PTQ under microscaling floating-point formats

RaZeR: Pushing the Limits of NVFP4 Quantization with Redundant Zero Remapping

NVFP4-RaZeR public code and experiment scripts

cuDNN frontend MXFP8 128x4 scale-factor layout

nvidia.github.io

AMD Instinct MI355X product page: FP16/BF16, FP8/OCP-FP8, MXFP6, MXFP4/FP4, HBM

AMD Quark for Instinct accelerators: MI355X float8 and MXFP4 support

quark.docs.amd.com

AMD ROCm: high-accuracy MXFP4, MXFP6, and mixed precision on MI355 GPUs

ROCm Gluon GEMM tutorial: MI355X FP16, BF8, MXFP4 GEMM performance

Evaluation, quality replay, and latency verification.

NVIDIA GenAI-Perf: LLM throughput and latency metrics (TTFT, ITL, request latency)

NVIDIA GenAI-Perf Analyze: p99 latency plus GPU telemetry scenario reports

NVIDIA NIM benchmarking guide: TTFT, e2e latency, ITL, TPS, RPS

Ray LLMPerf: load tests and correctness tests for LLM APIs

SGLang bench_serving: online throughput, TTFT, ITL, TPOT, JSONL records

vLLM bench serve: serving benchmarks and goodput SLO constraints

Revisiting SLOs and System Level Metrics in LLM Serving

EleutherAI lm-evaluation-harness: reproducible model quality evaluation

Stanford CRFM HELM: holistic, reproducible, transparent foundation-model evaluation

OpenAI Evals: framework for LLM and LLM-system evaluations

RAGAS metrics: faithfulness, response relevance, context precision, tool/agent metrics

docs.ragas.io

LongBench v2: long-context reasoning and realistic multitask evaluation

longbench2.github.io

SWE-bench: real GitHub issue resolution benchmark

swebench.com

EvalPlus: rigorous code correctness and code-efficiency evaluation

Eigen AI · Nebius acquisition · full-stack inference optimization.

BusinessWire: Nebius agrees to acquire Eigen AI (May 1, 2026)

businesswire.com

Bloomberg: Nebius to buy startup that makes AI run faster, cheaper ($643M)

bloomberg.com

Eigen AI; The Next Chapter: Joining Nebius (model / system / kernel framing)

Nebius + Eigen AI partnership announcement (Token Factory)

nebius.com

Eigen AI: Day-0 NVFP4 inference on Blackwell for Nemotron 3 Nano Omni

EigenData: Self-Evolving Function-Calling Data (71.5% BFCL critical issues)

Eigen AI: Reliable Post-Training for Interactive Tool-Using Agents

MIT HAN Lab: Sparse Attention (SpAtten), Ryan Hanrui Wang et al.

hanlab.mit.edu

MLSys 2024 Best Paper Award: AWQ, Wei-Chen Wang et al.

mlsys.org

SiliconANGLE: Nebius acquires Eigen AI for $643M (CUDA/Triton kernel detail)

siliconangle.com

FlashAttention NaN bug class.

FA3/FA4 Issue #2374: NaN in unused V cache entries

FA3 Issue #1974: NaN in forward when NaN exists in unread V regions

FA3 Issue #1052: NaN gradients in BW pass

PyTorch PR #130014: fix NaN in CPU flash-attention lazy softmax

AgentX trail.

Bryan Shan × Cameron Quilici: Researcher Conversations at GTC

PR #993: agentic trace replay

PR #1032: ISB-1 converted traces + kv-cache-tester contract helpers

PR #1258: AgentX v0.2

Issue #1358: MI355X CPU offload OOM

Issue #1359: cache-bust mismatch

Issue #1369: Kimi-K2.5 tokenizer trust-remote-code

mKernel and GPU-driven communication.

UCCL: mKernel blog post

uccl-project.github.io

uccl-project/mKernel GitHub repository

mKernel source: MoE Dispatch + Group GEMM fused path

Ziming Mao mKernel launch note

x.com

Ziming Mao: GPU communication, UCCL, and mKernel

maoziming.github.io

Berkeley Sky Computing Lab

UCCL project

uccl-project.github.io

InferenceX: GB200 NVL72 vs B200 on Kimi K2.5, 3.1x from Wide EP vLLM

InferenceX: B200 NVFP4 vs H200 INT4 on Kimi K2.5 / Kimi K2.6

InferenceX: AMD MI355X Kimi K2.5 vLLM / AITER movement row

UCCL-EP: Efficient and Flexible MoE Communication

Zyphra: Implementing Inference Communication Overlap on AWS Inferentia2

zyphra.com

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping

AWS Neuron Kernel Interface documentation

AWS Neuron Inferentia2 architecture documentation

AWS: Scaling Rufus with over 80,000 AWS Inferentia and Trainium chips for Prime Day

aws.amazon.com

AWS Trainium family overview

aws.amazon.com

AWS Neuron Trainium2 architecture documentation

AWS Neuron: Introducing NeuronX Distributed Inference

Hugging Face TGI: Neuron backend for AWS Trainium and Inferentia

Hugging Face Optimum Neuron: vLLM plugin for AWS Trainium and Inferentia

Microsoft Maia 200: The AI accelerator built for inference

blogs.microsoft.com

Meta: Expanding custom silicon to power AI workloads

about.fb.com

FlashLib and classical ML operator kernels.

FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

flashml-org.github.io

FlashML-org/flashlib GitHub repository

RAPIDS cuML documentation

RAPIDS cuML accelerator documentation

RAPIDS cuDF documentation

RAPIDS RAFT documentation

RAPIDS Memory Manager documentation

FlashLib source: KMeans Triton assignment kernel

FlashLib source: KNN CuteDSL fused kernel

FlashLib source: PCA fused covariance kernels

FlashLib source: runtime/FLOPs/HBM estimator API

AMD MI355X observability anecdote.

Crusoe: Virtualizing AMD Instinct MI355X GPUs with AMD Pensando Pollara 400 AI NIC on Linux KVM

crusoe.ai

AMD Pensando Pollara 400 AI NIC

Crusoe Cloud AMD MI355X instances

crusoe.ai

InferenceX: AMD MI355X Qwen3.5 SGLang 8k/1k throughput curve

SGLang Issue #19633: Qwen3.5-397B FP8 vs BF16 on MI355

SGLang PR #21234: AMD MXFP4 Qwen3.5-397B-A17B

KV-cache compression.

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

SpectralQuant: 3% Is All You Need

nanothoughts.substack.com

SpectralQuant technical note: Breaking TurboQuant's Compression Limit

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

SparseSpec GitHub repository

LM programming, systems search, and externalized state.

Omar Khattab: MIT CSAIL faculty profile

csail.mit.edu

Omar Khattab: research page listing ColBERT, DSPy, GEPA, RLMs

omarkhattab.com

MIT ILP profile: DSPy, STORM, IReRa, PATH, PAPILLON, MIPRO, BetterTogether, ColBERT, Baleen, ARES, PLAID

ilp.mit.edu

Everest Lab: AI-driven data systems, code generation, retrieval, agent operations

dsg.csail.mit.edu

Tim Kraska: DSAIL, Everest, AgentCore, structured knowledge bases, data-science agents

csail.mit.edu

Samuel Madden: Data Systems Group, data systems and ML systems

db.csail.mit.edu

EnCompass: separating agent search strategy from workflow

news.mit.edu

DSPy

ColBERT: late-interaction retrieval

GEPA: Berkeley Sky / cross-lab reflective prompt evolution

Berkeley Sky people page: Lakshya A Agrawal listed as GSR

Recursive Language Models (arXiv)

RLM blog · Alex L. Zhang

alexzhang13.github.io

RLM codebase

Omar Khattab: RLM criteria and dynamic workflows framing

x.com

PEEK: Context Map as an Orientation Cache (arXiv)

Omar Khattab: MIT EECS

eecs.mit.edu

API and agent-side inference optimization.

OpenAI Codex: custom instructions with AGENTS.md

OpenAI: Introducing the Codex app (skills and repeatable workflows)

openai.com

OpenAI Cookbook: context engineering and long-term memory state

OpenAI API prompt caching guide

OpenAI Cookbook: prompt caching 201

OpenAI API cost optimization guide

OpenAI Batch API guide

OpenAI Predicted Outputs guide

OpenAI API full docs text: tool search, deferred tool loading, WebSocket mode

Anthropic / Claude: Introducing dynamic workflows in Claude Code

claude.com

Hacker News: Dynamic Workflows in Claude Code

news.ycombinator.com

Claude Code: memory, CLAUDE.md, imports, and AGENTS.md bridge

Claude Code: costs, token management, code intelligence, hooks, and subagents

Claude Code: slash commands and skills

Claude Code: /context and configuration debugging

Anthropic: improving skill-creator with evals and benchmarks

claude.com

SWE-Skills-Bench: deterministic evals for agent skills

Evaluating AGENTS.md: repository-level context file caveat

Russ Cox: Regular Expression Matching Can Be Simple And Fast

swtch.com

Bell Labs / Dartmouth reader: grep origin from g/re/p

cs.dartmouth.edu

Linux grep manual page

man7.org

ripgrep repository

RTK / Rust Token Killer repository

RTK documentation: hook rewrite and token savings

rtk-ai.app

RTK architecture documentation

RTK technical walkthrough

RTK command module architecture

Zoekt: fast trigram-based code search

Sourcegraph architecture: Zoekt default-branch trigram index

sourcegraph.com

Tree-sitter: incremental parsing system for programming tools

ast-grep: structural search, lint, and rewriting

ast-grep.github.io

Semgrep pattern syntax: metavariables and ellipsis

semgrep.dev

Semgrep repository: static analysis for many languages

Sverklo field study: 14,200 tokens to find one function

sverklo.com

Semble code search benchmark: token-budget comparison

blakecrosley.com

FrugalGPT: model cascade and budget-aware LLM applications

RouteLLM repository

GPTCache repository

vLLM Semantic Router / Cointab repository

RAGAS: automated evaluation for retrieval augmented generation

Lost in the Middle: long-context retrieval behavior

OpenAI Voice Agents guide

OpenAI Realtime VAD guide

OpenAI Realtime costs guide

OpenAI Agents SDK: Twilio realtime voice example

Twilio Media Streams

twilio.com

LiveKit Agents voice AI docs

Pipecat realtime voice and multimodal agent framework

Deepgram endpointing guide

developers.deepgram.com

ElevenLabs WebSocket TTS docs

OpenAI Image Generation guide

OpenAI Images API reference

OpenAI Moderation guide

Google Vertex Imagen API

cloud.google.com

Runway API guide

docs.dev.runwayml.com

Adobe Firefly API docs

developer.adobe.com

Replicate predictions API docs

replicate.com

fal model API queue docs

fal.ai

DeepCache: Accelerating Diffusion Models for Free

Latent Consistency Models

Hugging Face Diffusers: Latent Consistency Models

ComfyUI repository

GPTCache image generation example

gptcache.readthedocs.io

IEEE Access: Cost-Efficiency Metrics for Generative AI Video Models

ieeexplore.ieee.org

USV: Unified Sparsification for Accelerating Video Diffusion Models

DisCa: Learnable Feature Caching for Video Diffusion Transformers

TeaCache: Timestep Embedding Aware Cache

BWCache: Block-Wise Caching for Video Diffusion Transformers

PreciseCache: Precise Feature Caching for Video Generation

FlowCache: Chunkwise Adaptive Caching for Autoregressive Video Generation

Astraea: Token-wise Acceleration Framework for Video Diffusion Transformers

VORTA: Efficient Video Diffusion via Routing Sparse Attention

Sparse Video-Gen: Spatial-Temporal Sparsity for Video Diffusion Transformers

proceedings.mlr.press

CMD: Content-Frame Motion-Latent Decomposition for Efficient Video Diffusion

Lumiere: A Space-Time Diffusion Model for Video Generation

dl.acm.org

Hugging Face Diffusers: T-GATE optimization

Q&C: When Quantization Meets Cache in Efficient Image Generation

TLCM: Training-Efficient Latent Consistency Model

LTS-VoiceAgent: Listen-Think-Speak streaming voice agents

VoiceAgentRAG: Dual-agent memory router for realtime voice RAG

VOXSERVE: Streaming-Centric Serving System for Speech Language Models

TangoFlux: Fast Text-to-Audio Generation with Flow Matching

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation

Hermes Agent skills documentation

OpenClaw skills documentation

docs.openclaw.ai

GStack: Claude Code virtual engineering team

GBrain retrieval architecture

GPU MODE and kernel competitions.

KernelBench: Can LLMs Write Efficient GPU Kernels?

KernelBot: GPU MODE platform for writing heterogeneous GPU code

Alex L. Zhang

alexzhang13.github.io

Hardware diversification & NVIDIA Vera Rubin platform.

NVIDIA Vera CPU

NVIDIA launches the Vera CPU (newsroom)

NVIDIA Vera Rubin platform (newsroom)

NVIDIA: Vera Rubin ramps into full production at GTC Taipei

SemiAnalysis: Vera Rubin - Extreme Co-Design: An Evolution from Grace Blackwell Oberon

NVIDIA: Vera CPU for agents at GTC Taipei

NVIDIA and Microsoft: RTX Spark Windows PCs for personal AI agents

NVIDIA Blog: local AI agents across RTX PCs and DGX Spark

blogs.nvidia.com

NVIDIA: Vera BlueField-4 STX for agentic AI storage security

NVIDIA: Vera Rubin DSX AI Factory reference design

Dell: first systems built on NVIDIA Vera Rubin platform to CoreWeave

CoreWeave: Vera Rubin NVL72 bring-up and validation

investors.coreweave.com

NVIDIA Vera Rubin NVL72 product page

Michael Dell / Dell + NVIDIA Vera Rubin NVL72 for CoreWeave news pickup

digg.com

NVIDIA CMX Context Memory Storage Platform

NVIDIA GB300 NVL72 product page

Dell: Dell delivers market's first NVIDIA GB300 NVL72 to CoreWeave

CoreWeave: first NVIDIA GB300 NVL72 deployment

coreweave.com

Dell: AI infrastructure with NVIDIA Vera Rubin integration

Dell: integrated compute and networking from Dell and NVIDIA

CoreWeave extends its cloud platform with NVIDIA Rubin platform

coreweave.com

CoreWeave Mission Control

coreweave.com

NVIDIA: Spectrum-X Photonics and co-packaged optics networking switches

NVIDIA technical blog: scaling AI factories with co-packaged optics

NVIDIA Silicon Photonics Networking for Agentic AI

Broadcom: Tomahawk 6 / Davisson co-packaged optics switch

broadcom.com

Huawei: Tau Scaling Law and LogicFolding announcement

huawei.com

Serving Large Language Models on Huawei CloudMatrix384

NVIDIA Rubin platform: six new chips (CES newsroom)

NVIDIA Vera CPU technical blog

NVIDIA technical blog: Vera CPU sets a new standard for agentic workloads

NVIDIA Groq 3 LPX / Attention-FFN Disaggregation technical blog

The Register: NVIDIA's Groq-powered LPX racks

theregister.com

Groq LPU architecture: SRAM, compiler control, deterministic execution

groq.com

Groq: What is a Language Processing Unit?

groq.com

Cerebras WSE-3: 4T transistors, 900K AI cores, 44GB on-chip SRAM

cerebras.ai

Cerebras Wafer Scale Engine product page

cerebras.ai/chip

NVIDIA GPU/CPU roadmap to 2028 (The Next Platform)

nextplatform.com

NVIDIA Groq 3 LPX

Cerebras $5.55B IPO

thenextweb.com

Google Ironwood TPU

blog.google

Energy, water, and community constraints.

IEA: Key Questions on Energy and AI executive summary

iea.org

IEA: Energy and AI report (945 TWh by 2030 base case)

iea.org

IEA: data-centre electricity use surged in 2025

iea.org

DOE/LBNL: U.S. data-center electricity use 2014-2028 estimates

energy.gov

CBRE: Global Data Center Trends 2025 (power availability constraint)

cbre.com

The Verge: Utah Stratos Project, power and water-rights backlash

theverge.com

WUNC: North Carolina data centers, water quality, PFAS, and availability

wunc.org

KEDT: Corpus Christi, Sinton, and Texas data-center water demand

kedt.org

North Platte Bulletin / Flatwater Free Press: data-center moratoriums over water and electricity

northplattebulletin.com

Harvard Crimson: Lowell data-center expansion, noise, emissions, and local bills

thecrimson.com

Microsoft: understanding water use at datacenters

local.microsoft.com

Google: 2026 Water Stewardship Project Portfolio

sustainability.google

Meta: restoring water in data-center communities

datacenters.atmeta.com

Generation systems and engines.

Meta KernelEvolve arXiv

Engineering at Meta post

CUDA-Agent

cudaLLM-8B

CUDA-Agent-Ops-6K

CUDA-L1

Sakana AI CUDA Engineer Archive

robust-kbench

AMD GEAK

Makora

SemiAnalysis / Researcher Conversations: Makora interview

Wafer YC launch

ycombinator.com

Together AI coding-agent benchmarks

Inside the Together AI kernels team

Together AI introduces Tri Dao and FlashAttention-2

Together AI: AI Native Conf research announcements

Together AI: ATLAS runtime-learning speculator system

Together AI / CMU / Princeton / Cartesia: Mamba-3

Hugging Face kernels.

Kernel Hub: Enhance Your Models in 5 Minutes

Custom Kernels for All from Codex and Claude (cuda-kernels agent skill)

kernels library docs

Why kernels?

Writing Hub kernels with kernel-builder

kernels benchmark CLI

kernel-builder skills CLI

Kernel Hub

kernels-community org

Open verifiable-reward RL (RLVR) line.

Tülu 3: Pushing Frontiers in Open Language Model Post-Training (arXiv)

Tülu 3: AI2 blog (RLVR introduction)

allenai.org

Tulu3 Reproduction: Open Instruct (RLVR training code)

allenai.github.io

open-instruct repository (AI2)

OLMo 3 (arXiv)

Hermes 4 (arXiv / Hugging Face)

Open-source agent frameworks (2026).

Hermes Agent (Nous Research): built on GEPA

Hermes Agent v0.14.0 release

OpenClaw: Open-Source AI Automation Framework

openclaw.im

OpenClaw docs: Anthropic provider

openclawlab.com

Concrete kernel case studies.

FlashAttention: IO-aware exact attention

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

tridao.me

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling — Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao

arxiv.org · pypi.org

FlashAttention source code: FA4 CuTe-DSL SM100/SM120 paths

GPU MODE: FlashAttention-4 lecture with Ted Zadouri — hosted by Mark Saroufim

Together AI: FlashAttention-4 research post

ThunderKittens: A Simple Embedded DSL for AI kernels

together.ai · arxiv.org

ThunderKittens optimized for NVIDIA Blackwell GPUs

PagedAttention / vLLM paper

NVIDIA TensorRT-LLM multiblock attention on HGX H200

DeepSeek FlashMLA kernel repository

AMD ROCm blog: AITER-enabled MLA layer inference on MI300X

AMD ROCm docs: vLLM optimization with AITER on ROCm

DeepSeek DeepGEMM / Mega MoE repository

NVIDIA cuBLAS documentation: cuBLASLt and GEMM APIs

NVIDIA CUTLASS kernel library

AMD hipBLASLt GEMM documentation

ROCm Gluon GEMM tutorial: MI355X FP16, BF8, MXFP4 GEMM performance

FlashInfer fused MoE API documentation

docs.flashinfer.ai

FlashInfer sampling kernels: faster top-k/top-p sampling

flashinfer.ai · docs.flashinfer.ai

AMD AITER model acceleration library

ROCm docs: model acceleration libraries for inference optimization

Hugging Face Diffusers memory optimization guide

NVIDIA CV-CUDA repository

NVIDIA DALI documentation

Broader pattern.

Berkeley Sky Computing Lab

UC Berkeley CDSS: Sky Computing Lab launch

cdss.berkeley.edu

UC Berkeley EECS: Databricks grew out of AMPLab

eecs.berkeley.edu

UC Berkeley AMPLab overview

amplab.cs.berkeley.edu

SkyPilot documentation

skypilot.readthedocs.io

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

blog.vllm.ai

vLLM releases: v0.22.0 production serving path

vLLM Q2 2026 roadmap

SGLang repository

SkyDiscover

SkyRL: Berkeley Sky Computing Lab project page

SkyRL-SQL documentation: multi-turn Text2SQL recipe

skyrl.readthedocs.io

UCCL project

uccl-project.github.io

Ziming Mao: UCCL and mKernel

maoziming.github.io

Yang Zhou: ML systems and GPU communication

cs.ucdavis.edu

Karpathy autoresearch

AMD wins on real workloads · MI355X vs B200 (2026).

TensorWave: "MI355X Just Flipped the Script on B200 for FP8 DeepSeek Disagg" (Mar 10, 2026)

tensorwave.com

AMD: Single Node & Distributed Inference Performance on MI355X (ATOM, Jan 2026)

SemiAnalysis InferenceX v2: Blackwell vs AMD vs Hopper (Feb 2026)

Hostrunway: B200 vs MI355X 2026 LLM showdown (MLPerf v6.0, May 2026)

hostrunway.com

SemiAnalysis: The Coding Assistant Breakdown: More Tokens Please (May 23, 2026: 174,264 sessions, 42% CPU / 58% GPU)

SemiAnalysis: CPUs are Back: The Datacenter CPU Landscape in 2026 (Gerald Wong, Feb 9, 2026: Fairwater 1:6 CPU:GPU)

SemiAnalysis: InferenceX v2: Blackwell vs AMD vs Hopper (Feb 16, 2026: MTP $2.35→$0.11 21× decrease)

semianalysis.substack.com

SemiAnalysis: GB200 Hardware Architecture & Component Supply Chain BoM (Dylan Patel + Doug O'Laughlin, Jul 2024: $3.3M NVL72)

HPE: NVIDIA GB200 NVL72 data sheet (132 kW rack listing)

hpe.com

SemiAnalysis · Longbridge: Dylan Patel on $10B code-agent revenue + CPU bottleneck

longbridge.com

Gen 3 / ASIC portability · COMPUTEX 2026 · AI TAIWAN Expo.

Spectral Compute: SCALE docs

ai-taiwan.com.tw/exhibitor

Business Insider: Spectral Compute funding (Nov 2025)

businessinsider.com

COMPUTEX 2026: "AI Together"

computextaipei.com.tw

2026 AI TAIWAN Expo: June 24-26 at Taipei Expo Dome

ai-taiwan.com.tw

2026 AI TAIWAN exhibitor page: Global Startup booth, Global Pavilion, AI infrastructure categories

NVIDIA Newsroom: Vera CPU launch ("purpose-built for agentic AI")

web.archive.org/nvidia.com

Three-generation framing · pre-history (CPU + GPU).

NVIDIA: Unveils CUDA (Nov 8, 2006)

NVIDIA: CUDA 1.0 Programming Guide (June 23, 2007)

developer.download.nvidia.com

Krizhevsky, Sutskever, Hinton: AlexNet (NeurIPS 2012)

papers.nips.cc

NVIDIA: Vera CPU product page

NVIDIA: Vera CPU Rack (256 CPUs, 22,528 cores, 22.5K+ envs)

NVIDIA Technical Blog: Vera CPU design (Mar 16, 2026)

Gen 3 · task economics · tokenomics (May 2026 cohort).

Fortune: Microsoft's AI cost problem (May 22, 2026)

fortune.com

The Verge: Microsoft cancels Claude Code licenses

theverge.com

The Information: Uber CTO on Claude Code budget burn

theinformation.com

SemiAnalysis: The Coding Assistant Breakdown: More Tokens Please

SemiAnalysis: AI Dark Output: The Visible Cost of Invisible Output