Open role

Founding Inference Systems Engineer

Why this work matters

Touchdown starts with people doing real work: teach the team, build one useful AI system, manage what runs, and measure what still breaks. We follow repeated limits through software, inference, kernels, memory, hardware, materials, and manufacturing research only when the evidence earns the next layer. The ambition is broad. The work is always bounded by a user, a test, and a receipt.

Overview

You will own the inference path behind useful AI work. The job starts with a real workload and its private quality gate, then follows it through request shape, prompt and context layout, tokenization, prefill, decode, KV cache, routing, batching, serving runtime, dedicated or self-hosted capacity, CPU and GPU execution, memory, networking, and operating cost. You will turn messy runtime behavior into production decisions, policies, and systems. A throughput number is not enough. The change has to preserve the accepted result and improve the system at the workload level.

What you will own

Capture workload distributions, service-level objectives, quality constraints, and cost boundaries before tuning the stack.
Profile time to first token, inter-token latency, throughput, queueing, memory residency, cache behavior, retries, and failure recovery.
Design model, provider, runtime, routing, batching, quantization, and hardware comparisons using the same acceptance contract.
Diagnose CPU preprocessing, tokenization, scheduler, network, storage, GPU, and memory movement bottlenecks.
Build and operate serving paths using vLLM, SGLang, LMCache, TensorRT-LLM, or comparable systems when they fit the workload.
Turn traces and benchmarks into production changes, capacity plans, runbooks, and rollback conditions.

Technical territory

Long-context and multi-turn inference; prefill and decode; KV-cache allocation, reuse, movement, and tiering; continuous batching and scheduling; speculative decoding; quantization; model parallelism; disaggregated serving; CPU-GPU boundaries; storage and network paths; NVIDIA and AMD accelerators; workload-level quality and cost accounting.

Representative outputs

A frozen workload corpus with quality gates, request distributions, concurrency, context lengths, service-level objectives, and failure cases.
Reproducible model, provider, runtime, cache, routing, batching, quantization, and hardware comparison packets.
Deployable serving configurations, routing and cache policies, capacity models, alerts, rollback thresholds, and runbooks.
A joined trace showing where time, memory, network traffic, retries, energy, and dollars enter one accepted task.

What success looks like

The team can explain the dominant latency, memory, reliability, and cost components for the workload.
A proposed optimization reproduces on a frozen workload and preserves the acceptance threshold.
Capacity and hardware decisions follow measured workload behavior rather than peak specifications.
Production diagnostics distinguish a model problem from a runtime, cache, network, or hardware problem.

What you bring

Strong systems fundamentals and experience diagnosing performance or reliability.
Practical understanding of model serving, concurrency, scheduling, memory, and distributed systems.
Ability to design honest benchmarks and separate fixture proof from live runtime proof.
Experience with profiling, telemetry, capacity analysis, and production failure investigation.
Python and at least one systems language or deep runtime specialization.
Ability to connect low-level measurements to the user-visible result and operating economics.

Helpful experience

GPU profiling and performance modeling
KV-cache and memory-tier systems
Distributed inference and networking
Quantization and numerical evaluation
Serving-engine or open-source infrastructure work

How the role works

Full-time role.
San Francisco / Bay Area preferred. Remote within the United States may be considered for the right person.
Scope, start date, and employment details are discussed during the process.

Apply for this role →

Applications are reviewed against the work described here. We do not use a degree, title, or keyword list as a substitute for evidence.