\(
\newcommand{\cat}[1] {\mathrm{#1}}
\newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})}
\newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}}
\newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}}
\newcommand{\betaReduction}[0] {\rightarrow_{\beta}}
\newcommand{\betaEq}[0] {=_{\beta}}
\newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}}
\newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}}
\newcommand{\groupMul}[1] { \cdot_{\small{#1}}}
\newcommand{\groupAdd}[1] { +_{\small{#1}}}
\newcommand{\inv}[1] {#1^{-1} }
\newcommand{\bm}[1] { \boldsymbol{#1} }
\require{physics}
\require{ams}
\require{mathtools}
\)
Math and science::INF ML AI
Naive, safe an online softmax
The safe softmax alters the naive softmax (how softmax is typically conceptualized) by first finding [what?], then scaling all elements by it so as to reduce the chance of overflow and underflow. The safe softmax uses more memory accesses. Online softmax reduces the number memory accesses used by soft softmax back to the same number used in the naive softmax.
Can you remember the 3 implementations?
Online softmax is a key ingredient of Flash Attention.