impl: Some tweaks

1e7ff36e · Daan Sprenkels · c9bb527a · 1e7ff36e · 1e7ff36e · 1e7ff36e
Commit 1e7ff36e authored 5 years ago by Daan Sprenkels
--- a/gen_algorithms.py
+++ b/gen_algorithms.py
@@ -383,10 +383,16 @@ sb_double.write(batch(
 sb_double.write(batch(
    r'-',
    ('v_{12}', 'v_{ 2}', 'v_{11}'),
+    (None, None, None),
+    (None, None, None),
+    (None, None, None),
 ))
 sb_double.write(batch(
    r'+',
    ('v_{13}', 'v_{ 2}', 'v_{11}'),
+    (None, None, None),
+    (None, None, None),
+    (None, None, None),
 ))
 sb_double.write(
    batch(
@@ -398,7 +404,7 @@ sb_double.write(
    ))
 sb_double.write(batch(
    r'c',
-    ('v_{32}', 'v_{15}', 'v_{14}'),
+    ('v_{32}', 'v_{15}', 'v_{14}', None),
 ))
 sb_double.write(batch(
    r'-',

--- a/implementation.tex
+++ b/implementation.tex
@@ -19,11 +19,17 @@ The subroutine $\textsc{RecodeSignedWindow}_5$
 computes a vector of coefficients $k' = (k'_0,\dots,k'_{50})$, 
 such that $k = k'_0 + 32k'_1 + \dots, + 2^{250}k'_{50}$ and
 $k'_i \in \{-16,\dots,15\}$.
-\todo{Say something about constant-time lookups and negation.}
+
+The table lookup is implemented in a traditional scanning fashion:
+selecting the required value using a bitwise AND operation.
+Where we use an unsigned representation,
+we compute the conditional negation of $Y$
+by negating $Y$ and selecting the correct result using bitwise operations. When the representation is signed we use a single XOR operation to conditionally flip the sign bit.
+These operations are---as well as the rest of the code---implemented in constant-time.

 \begin{algorithm}[h]
 \caption{Signed double-and-add describe the used functions}\label{alg:doubleandadd}
-\begin{algorithmic}
+\begin{algorithmic}[1]
 \Function{DoubleAndAdd}{$k,P$}%\Comment{Compute $[k]P$}
 \State $\mathbf{T} \gets (\mathcal{O}, P, \ldots, [16]P)$\Comment{Precompute $([2]P, \ldots, [16]P)$}
  \State $k'\gets \textsc{RecodeSignedWindow}_5(k)$
@@ -153,7 +159,10 @@ f_8 &\rightarrow &f_9 &\rightarrow &f_{10} &\rightarrow &f_{11} &\rightarrow &f_
 For radix-$2^{21.25}$, we use basic $4\times$ parallel Karatsuba multiplication~\cite{KO62}, using inspiration from~\cite{HS15}. An inconvenience introduced by implementing Karatsuba using floating points, is that the shift-by-128-bit operations cannot be optimized out. Instead, we have to explicitly multiply some limbs by~$2^{\pm128}$. This costs $23$ extra multiplication ops (implemented using $12$ \texttt{vmulpd}s, and $11$ \texttt{vandpd}s).
 Still, the Karatsuba implementation, which contains $131$ \texttt{vmulpd} instructions, was measured to be $8\%$ faster than the schoolbook method (which contains $155$ \texttt{vmulpd} instructions).

-\subheading{Application of formulas.}
+\todo{Tell why we do not batch most of the stuff in Sandy Bridge.}
+\todo{Refer to appendix.}
+
+\subheading{Vectorization strategy.}
 We group the multiplications from both algorithms in three batches each, which have been chosen such that the amount of carry chains is minimized. In particular, we cannot optimize the squaring operations in \Double{} using the $2\alpha\beta = (\alpha + \beta)^2 - \alpha^2 - \beta^2$ rule, because $\alpha + \beta$ has too little headroom to be squared without doing an additional carry chain.

 Because we cannot perform shift operations on floating-point values, and because the reciprocal throughput of \texttt{vmulpd} and \texttt{v\{add,sub\}pd} are both $1\unit{cc}$, we replace all chained additions by multiplications.
@@ -262,7 +271,9 @@ In~\cite{OLH+18}, T.~Oliveira et~al.\ make use of the Bit Manipulation Instructi
 % Though---in return for this speedup---our additions and subtractions become more expensive.
 However, experiments showed that the penalty introduced by more expensive additions/subtractions beat the performance gain achieved by using \texttt{mulx}.

-\subheading{Application of formulas.}
+\todo{Refer to appendix.}
+
+\subheading{Vectorization strategy.}
 Similar to the Sandy Bridge implementation, we batched all multiplications in the \Add{} algorithm into three different batches. In the implementation of \Double{}, we applied the squaring trick described in \Cref{sec:additionformulas}, and rewrite $v_7 = 2XZ = (X+Z)^2 - X^2 - Z^2$. After replacing the multiplication $v_6$ by a squaring, we can replace the first
 of the three multiplication batches in \Double{} with
 a batched squaring operation, that computes the values $\{(X+Z)^2, v_1, v_2, v_3\}$.

--- a/prelim.tex
+++ b/prelim.tex
@@ -145,7 +145,6 @@ The cost of the doubling formulas is $8\mathbf{M} + 3\mathbf{S} + 2\mathbf{m_b}
 \Procedure{Double}{$(X : Y : Z)$}
 % \Comment{Compute $[2](X : Y : Z)$}
 \vspace{-0.5em}
-
 \begin{multicols}{3}
 \par $v_{ 1} \gets X \cdot X$
 \par $v_{ 2} \gets Y \cdot Y$