FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Sina Heidari, Dimitrios S. Nikolopoulos

Deep learning compilers and vendor libraries deliver strong baseline performance but their performance is bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute hand-written CUDA or CUTLASS, demanding expertise in GPU microarchitecture and C++ template metaprogramming. Recent LLM-based agents target kernel generation in raw CUDA, forcing rediscovery of optimizations already encoded in mature libraries. We present FACT (Framework for Agentic CUTLASS Transpilation), a three-stage agent-driven workflow optimizing PyTorch modules through multi-pattern composition while grounding synthesis in CUTLASS C++. Pattern discovery inspects the traced graph, matches subgraphs to optimization rules, retrieves vetted examples, and outputs prioritized patterns. Pattern realization implements each pattern as a CUTLASS kernel, verifies, and auto-tunes. Pattern composition assembles extensions into an optimized module for benchmarking. We evaluate the workflow on KernelBench across NVIDIA A100 and H100 GPUs. On Level 1 GEMM problems (square, batched, large-K matrix multiply), auto-tuned CUTLASS kernels achieve 1.06x-1.18x speedups on A100 and 0.84x-1.80x performance variations on H100 over cuBLAS. On Level 3 transformer blocks against PyTorch eager baseline, FACT achieves 2.03x speedup on MiniGPT (vs. Inductor: 1.89x, TensorRT: 1.85x) and 1.41x on Llama 3 8B (vs. Inductor: 1.17x, TensorRT: 1.18x). Our framework couples agentic graph-level pattern discovery with architecture-specific auto-tuning and a dynamic pattern registry, offering a practical path from traced PyTorch modules to deployable kernels.

Read on ELI