Open role

Founding Kernel Engineer, GPU Kernels

Why this work matters

Touchdown starts with people doing real work: teach the team, build one useful AI system, manage what runs, and measure what still breaks. We follow repeated limits through software, inference, kernels, memory, hardware, materials, and manufacturing research only when the evidence earns the next layer. The ambition is broad. The work is always bounded by a user, a test, and a receipt.

Overview

You will own the low-level path where model operations become executable accelerator work. The role begins with the workload, acceptance test, and measured system bottleneck, not with a kernel looking for a use case. The work includes framework dispatch, graph capture and fusion, compiler lowering, kernel selection and generation, device code, quantization, memory movement, synchronization, and profiling across NVIDIA and AMD systems. A microbenchmark matters only when its numerical behavior is correct and its effect survives inside the representative workload.

What you will own

Identify expensive operators and execution gaps from workload traces rather than isolated benchmark fashion.
Implement, select, fuse, or generate kernels for attention, matrix operations, routing, normalization, communication, data movement, and other relevant primitives.
Trace operations through PyTorch or comparable frameworks into graph capture, compiler IR, generated code, runtime dispatch, and device execution.
Measure launch overhead, occupancy, instruction mix, memory traffic, cache behavior, synchronization, stalls, and numerical error.
Compare vendor libraries, generated kernels, custom kernels, and compiler paths before adding maintenance cost.
Build correctness tests, performance harnesses, architecture guards, fallbacks, and end-to-end receipts.

Technical territory

PyTorch compilation; Triton and accelerator DSLs; CUDA and ROCm/HIP; vendor math, attention, communication, and runtime libraries; compiler IR and code generation; PTX or AMDGPU ISA inspection; Tensor Core or matrix-core execution; HBM, cache, shared/local memory, registers, launch geometry, synchronization, collectives, and multi-GPU effects.

Representative outputs

A correctness-gated benchmark with frozen shapes, dtypes, tolerances, hardware identity, software revisions, warmup, variance, and regression coverage.
Framework, compiler, kernel, device-code, memory-traffic, and profiler evidence that explains the measured bottleneck and change.
An integrated NVIDIA and AMD implementation or an explicit architecture guard, fallback, and documented reason when paths differ.
An end-to-end workload receipt proving whether the local speedup changed latency, throughput, capacity, energy, or cost without breaking quality.

What success looks like

The optimized path is numerically qualified across the precision and shape range it claims to support.
Performance results include reproducible inputs, hardware identity, software revisions, profiler evidence, and fallbacks.
The local optimization improves a representative workload rather than only an isolated kernel.
NVIDIA and AMD paths are treated as first-class engineering targets where the workload and available hardware justify them.

What you bring

Strong C++ and GPU programming fundamentals.
Experience with CUDA, ROCm/HIP, Triton, or comparable accelerator programming.
Ability to reason about memory hierarchy, parallel decomposition, synchronization, precision, and numerical correctness.
Experience using profilers, disassemblers, and benchmark harnesses to validate performance claims.
Understanding of framework dispatch and the compiler/runtime path above a kernel.
Ability to maintain architecture guards, correctness coverage, and portable fallbacks.

Helpful experience

Compiler IR or code generation
Attention, GEMM, MoE, or communication kernels
Collectives and multi-GPU execution
Sub-byte or mixed-precision compute
Upstream open-source contribution experience

How the role works

Full-time role.
San Francisco / Bay Area preferred. Remote within the United States may be considered for the right person.
Scope, start date, and employment details are discussed during the process.

Apply for this role →

Applications are reviewed against the work described here. We do not use a degree, title, or keyword list as a substitute for evidence.