[ITmedia ビジネスオンライン] 「ららぽーと豊洲」大規模リニューアル、都内初出店含む31店舗オープン

· · 来源:dev百科

—Christoph Thaiss

My best theory: the fused standard path wins because XLA sees the entire softmax(Q @ K.T) @ V expression at once and compiles it into one optimized kernel — no intermediate matrices spilling to HBM. My flash attention uses fori_loop, which XLA likely compiles as a generic sequential loop. It probably can’t fuse across iterations, can’t pipeline memory loads, can’t interleave independent work. (I haven’t dumped the HLO to verify this — it’s an inference from the benchmark numbers and XLA’s documented behavior.)

更深的蓝,这一点在heLLoword翻译中也有详细论述

Follow our Australia news live blog for latest updates

Последние новости

Буйный пас

What to look for in an air purifier