From 9858aedb98c3c9a48ca8167d14d8e98d8c45d15f Mon Sep 17 00:00:00 2001 From: Daan Sprenkels <daan@dsprenkels.com> Date: Thu, 3 Oct 2019 13:13:40 +0200 Subject: [PATCH] Tweak text --- implementation.tex | 6 +++--- results.tex | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/implementation.tex b/implementation.tex index 5bef68b..b1f51b2 100644 --- a/implementation.tex +++ b/implementation.tex @@ -24,7 +24,7 @@ The table lookup is implemented in a traditional scanning fashion: selecting the required value using a bitwise AND operation. Where we use an unsigned representation, we compute the conditional negation of $Y$ -by negating $Y$ and selecting the correct result using bitwise operations. When the representation is signed, +by negating $Y$ and selecting the correct result using bitwise operations. When using floating points, we use a single XOR operation to conditionally flip the sign bit. These operations are---as well as the rest of the code---implemented in constant-time. @@ -215,11 +215,11 @@ This substitutes $8\mathbf{a}$ for $4\mathbf{m}$ in \Add{}, and $10\mathbf{a}$ f Last, we found that shuffling the \texttt{ymm} registers turns out to be relatively weak and expensive. That is because Sandy Bridge has no arbitrary shuffle instruction -(like the \texttt{vpermq} instruction from AVX2). +(like the \texttt{vpermq} instruction in AVX2). To shuffle every value in a \texttt{ymm} register into the correct lane, we would need at least two µops on port 5. Then it is cheaper to put all the values in the first lane, and -accept most of the additions and subtractions are not batched. +accept that most of the additions and subtractions are not batched. diff --git a/results.tex b/results.tex index f177f35..46848bc 100644 --- a/results.tex +++ b/results.tex @@ -11,7 +11,7 @@ all Hyper-Threading cores shut down, and with the CPU clocked at the maximum nominal frequency. The STM32F407 device was run with its default settings, as listed in the datasheet~\cite{STM32F407} -(i.e.~clocked from the internal 16\unit{MHz} internal RC-oscillator). +(i.e.~clocked from the 16\unit{MHz} internal RC-oscillator). We list the benchmarking results in Table~\ref{tab:benchmarks}. As expected, none of our implementations exceed the performance of Curve25519. @@ -21,7 +21,7 @@ none of our implementations exceed the performance of Curve25519. caption = {Measured cycle counts of the variable-basepoint scalar-multiplication routines on the {Sandy Bridge} (SB), {Ivy Bridge} (IB), {Haswell} (H) and {Cortex M4} (M4) architectures. - For completeness (by the request of our reviewers), we have included cycle + For completeness, we have included cycle counts for Ed25519 signatures verification (which is the operation in Ed25519 that computes variable-basepoint scalar-multiplication). }, -- GitLab