Resolve TODOs in implementation

ecaf051c · Daan Sprenkels · 1e7ff36e · ecaf051c
Commit ecaf051c authored 5 years ago by Daan Sprenkels
--- a/implementation.tex
+++ b/implementation.tex
@@ -72,7 +72,7 @@ on $256$-bit \texttt{ymm} vector registers.
 \subheading{Representation of prime-field elements.}
 Using doubles with 53-bit mantissa, we can emulate integer registers of 53 bits. 
 To guarantee that no rounding errors occur in the underlying floating-point arithmetic, 
-we use carry chains\footnote{Also called ``coefficient reduction''} 
+we use carry chains\footnote{Also called ``coefficient reduction''.} 
 to reduce the amount of bits in each register before performing operations that might overflow.
 Building on this approach, \cite{Ber06} recommends---%
 but does not implement---%
@@ -159,16 +159,27 @@ f_8 &\rightarrow &f_9 &\rightarrow &f_{10} &\rightarrow &f_{11} &\rightarrow &f_
 For radix-$2^{21.25}$, we use basic $4\times$ parallel Karatsuba multiplication~\cite{KO62}, using inspiration from~\cite{HS15}. An inconvenience introduced by implementing Karatsuba using floating points, is that the shift-by-128-bit operations cannot be optimized out. Instead, we have to explicitly multiply some limbs by~$2^{\pm128}$. This costs $23$ extra multiplication ops (implemented using $12$ \texttt{vmulpd}s, and $11$ \texttt{vandpd}s).
 Still, the Karatsuba implementation, which contains $131$ \texttt{vmulpd} instructions, was measured to be $8\%$ faster than the schoolbook method (which contains $155$ \texttt{vmulpd} instructions).

-\todo{Tell why we do not batch most of the stuff in Sandy Bridge.}
-\todo{Refer to appendix.}
-
 \subheading{Vectorization strategy.}
-We group the multiplications from both algorithms in three batches each, which have been chosen such that the amount of carry chains is minimized. In particular, we cannot optimize the squaring operations in \Double{} using the $2\alpha\beta = (\alpha + \beta)^2 - \alpha^2 - \beta^2$ rule, because $\alpha + \beta$ has too little headroom to be squared without doing an additional carry chain.
+We group the multiplications from both algorithms in three batches each, which
+have been chosen such that the complexity of the operations
+in-between the multiplications minimized.
+The resulting algorithms are listed in Listings~\ref{alg:add_asm_sandybridge} and~\ref{alg:double_asm_sandybridge}.
+
+In particular, we cannot optimize the squaring operations in \Double{} using the $2\alpha\beta = (\alpha + \beta)^2 - \alpha^2 - \beta^2$ rule, because $\alpha + \beta$ has too little headroom to be squared without doing an additional carry chain.

 Because we cannot perform shift operations on floating-point values, and because the reciprocal throughput of \texttt{vmulpd} and \texttt{v\{add,sub\}pd} are both $1\unit{cc}$, we replace all chained additions by multiplications.
 % This eliminates $\{v_{21}, v_{26}, v_{30}, v_{32}\}$ from \Add{}, and $\{v_{7}, v_{10}, v_{16}, v_{21}, v_{23}\}$ from \Double{}.
 This substitutes $8\mathbf{a}$ for $4\mathbf{m}$ in \Add{}, and $10\mathbf{a}$ for $5\mathbf{m}$ in \Double{}.

+Last, we found that shuffling the \texttt{ymm} registers turns out to be
+relatively weak and expensive.
+That is because Sandy Bridge has no arbitrary shuffle instruction,
+such as \texttt{vpermq}.
+To shuffle every value in a \texttt{ymm} register into the correct lane,
+we would need at least two µops on port 5.
+Then it is cheaper to put all the values in the first lane, and
+accept most of the additions and subtractions are unbatched.
+
 % \begin{Verbatim}
 %     - No dedicated squaring.
 %         - If we apply (a + b)^2 - a^2 - b^2 trick, then (a + b)^2 overflows
@@ -271,10 +282,11 @@ In~\cite{OLH+18}, T.~Oliveira et~al.\ make use of the Bit Manipulation Instructi
 % Though---in return for this speedup---our additions and subtractions become more expensive.
 However, experiments showed that the penalty introduced by more expensive additions/subtractions beat the performance gain achieved by using \texttt{mulx}.

-\todo{Refer to appendix.}
-
 \subheading{Vectorization strategy.}
-Similar to the Sandy Bridge implementation, we batched all multiplications in the \Add{} algorithm into three different batches. In the implementation of \Double{}, we applied the squaring trick described in \Cref{sec:additionformulas}, and rewrite $v_7 = 2XZ = (X+Z)^2 - X^2 - Z^2$. After replacing the multiplication $v_6$ by a squaring, we can replace the first
+Similar to the Sandy Bridge implementation, we batched all multiplications in the \Add{} algorithm into three different batches.
+The Haswell algorithms are listed in Listings~\ref{alg:add_asm_haswell} and~\ref{alg:double_asm_haswell}.
+
+In the implementation of \Double{}, we applied the squaring trick described in \Cref{sec:additionformulas}, and rewrite $v_7 = 2XZ = (X+Z)^2 - X^2 - Z^2$. After replacing the multiplication $v_6$ by a squaring, we can replace the first
 of the three multiplication batches in \Double{} with
 a batched squaring operation, that computes the values $\{(X+Z)^2, v_1, v_2, v_3\}$.