Commit 12e23f72 authored by Benoit Viguier's avatar Benoit Viguier
Browse files

splitting Preliminaries

parent 70243d3b
......@@ -6,10 +6,6 @@ well-marked appendices, and supplementary material."
- Emphasize:\\
+ the added confidence between in X25519 Correctness\\
+ proof of real world software\\
- Changed made to TweetNaCl:\\
+ we contacted the authors.\\
+ only change made are the one required by VST\\
}
With respect to the reviews:
......
......@@ -118,7 +118,7 @@ Finally in \sref{sec:Conclusion} we discuss the trusted code base of our proofs.
\begin{figure}[h]
\centering
\include{tikz/proof}
\caption{Structure of the proof}
\caption{Structure of the proof.}
\label{tikz:ProofOverview}
\end{figure}
......
......@@ -4,386 +4,3 @@
In this section, we first give a brief summary of the mathematical background
on elliptic curves. We then describe X25519 and its implementation in TweetNaCl.
Finally, we provide a brief description of the formal tools we use in our proofs.
\subsection{Arithmetic on Montgomery curves}
\label{subsec:arithmetic-montgomery}
\begin{dfn}
Given a field \K, and $a,b \in \K$ such that $a^2 \neq 4$ and $b \neq 0$,
$M_{a,b}$ is the Montgomery curve defined over $\K$ with equation
$$M_{a,b}: by^2 = x^3 + ax^2 + x.$$
\end{dfn}
\begin{dfn}
For any algebraic extension $\L \supseteq \K$, we call
$M_{a,b}(\L)$ the set of $\L$-rational points, defined as
$$M_{a,b}(\L) = \{\Oinf\} \cup \{(x,y) \in \L \times \L~|~by^2 = x^3 + ax^2 + x\}.$$
Here, the additional element $\Oinf$ denotes the point at infinity.
\end{dfn}
Details of the formalization can be found in \sref{subsec:ECC-Montgomery}.
For $M_{a,b}$ over a finite field $\F{p}$, the parameter $b$ is known as the ``twisting factor''.
For $b'\in \F{p}\backslash\{0\}$ and $b' \neq b$, the curves $M_{a,b}$ and $M_{a,b'}$
are isomorphic via $(x,y) \mapsto (x, \sqrt{b/b'} \cdot y)$.
\begin{dfn}
When $b'/b$ is not a square in \F{p}, $M_{a,b'}$ is a \emph{quadratic twist} of $M_{a,b}$, i.e.,
a curve that is isomorphic over $\F{p^2}$~\cite{cryptoeprint:2017:212}.
\end{dfn}
Points in $M_{a,b}(\K)$ can be equipped with a structure of an abelian group
with the addition operation $+$ and with neutral element the point at infinity $\Oinf$.
For a point $P \in M_{a,b}(\K)$ and a positive integer $n$ we obtain the scalar product
$$n\cdot P = \underbrace{P + \cdots + P}_{n\text{ times}}.$$
In order to efficiently compute the scalar multiplication we use an algorithm
similar to square-and-multiply: the Montgomery ladder where the basic operations
are differential addition and doubling~\cite{MontgomerySpeeding}.
We consider \xcoord-only operations. Throughout the computation,
these $x$-coordinates are kept in projective representation
$(X : Z)$, with $x = X/Z$; the point at infinity is represented as $(1:0)$.
See \sref{subsec:ECC-projective} for more details.
We define two operations:
\begin{align*}
\texttt{xADD} &: (x_{Q-P}, (X_P:Z_P), (X_Q:Z_Q)) \mapsto \\
&(X_{P + Q}:Z_{P + Q})\\
\texttt{xDBL} &: (X_P:Z_P) \mapsto (X_{2 \cdot P}:Z_{2 \cdot P})
\end{align*}
In the Montgomery ladder, % notice that
the arguments of \texttt{xADD} and \texttt{xDBL}
are swapped depending of the value of the $k^{\text{th}}$ bit. We use a conditional
swap \texttt{CSWAP} to change the arguments of the above function while keeping
the same body of the loop.
Given a pair $(P_0, P_1)$ and a bit $b$, \texttt{CSWAP} returns the pair
$(P_b, P_{1-b})$.
By using the differential addition and doubling operations we define the Montgomery ladder
computing a \xcoord-only scalar multiplication (see \aref{alg:montgomery-ladder}).
\begin{algorithm}
\caption{Montgomery ladder for scalar mult.}
\label{alg:montgomery-ladder}
\begin{algorithmic}
\REQUIRE{\xcoord $x_P$ of a point $P$, scalar $n$ with $n < 2^m$}
\ENSURE{\xcoord $x_Q$ of $Q = n \cdot P$}
\STATE $Q = (X_Q:Z_Q) \leftarrow (1:0)$
\STATE $R = (X_R:Z_R) \leftarrow (x_P:1)$
\FOR{$k$ := $m$ down to $1$}
\STATE $(Q,R) \leftarrow \texttt{CSWAP}((Q,R), k^{\text{th}}\text{ bit of }n)$
\STATE $Q \leftarrow \texttt{xDBL}(Q)$
\STATE $R \leftarrow \texttt{xADD}(x_P,Q,R)$
\STATE $(Q,R) \leftarrow \texttt{CSWAP}((Q,R), k^{\text{th}}\text{ bit of }n)$
\ENDFOR
\RETURN $X_Q/Z_Q$
\end{algorithmic}
\end{algorithm}
\subsection{The X25519 key exchange}
\label{subsec:X25519-key-exchange}
From now on let $\F{p}$ be the field with $p=2^{255}-19$ elements.
We consider the elliptic curve $E$ over $\F{p}$ defined by the
equation $y^2 = x^3 + 486662 x^2 + x$.
For every $x \in \F{p}$ there exists a point $P$ in $E(\F{p^2})$
such that $x$ is the \xcoord of $P$.
The core of the X25519 key-exchange protocol is a scalar\hyp{}multiplication
function, which we will also refer to as X25519.
This function receives as input two arrays of $32$ bytes each.
One of them is interpreted as the little-endian encoding of a
non-negative integer $n$ (see \ref{subsec:integer-bytes}).
The other is interpreted as the little-endian encoding of
the \xcoord $x_P \in \F{p}$ of a point in $E(\F{p^2})$,
using the standard mapping of integers modulo $p$ to elements in $\F{p}$.
The X25519 function first computes a scalar $n'$ by setting bit 255 of $n$
to \texttt{0}, setting bit 254 to \texttt{1}, and setting the lower 3 bits to \texttt{0}.
This operation is often called ``clamping'' of the scalar $n$.
Note that $n' \in 2^{254} + 8\{0,1,\ldots,2^{251}-1\}$.
X25519 then computes the \xcoord of $n'\cdot P$.
RFC~7748~\cite{rfc7748} standardized the X25519 Diffie–Hellman key-exchange algorithm.
Given the base point $B$ where $X_B=9$, each party generates a secret random number
$s_a$ (respectively $s_b$), and computes $X_{P_a}$ (respectively $X_{P_b}$), the
\xcoord of $P_A = s_a \cdot B$ (respectively $P_B = s_b \cdot B$).
The parties exchange $X_{P_a}$ and $X_{P_b}$ and compute their shared secret with
X25519 on $s_a$ and $X_{P_b}$ (respectively $s_b$ and $X_{P_a}$).
\subsection{TweetNaCl specifics}
\label{subsec:Number-TweetNaCl}
As its name suggests, TweetNaCl aims for code compactness (\emph{``a crypto library in 100 tweets''}).
As a result it uses a few defines and typedefs to gain precious bytes while
still remaining human-readable.
\begin{lstlisting}[language=Ctweetnacl]
#define FOR(i,n) for (i = 0;i < n;++i)
#define sv static void
typedef unsigned char u8;
typedef long long i64;
\end{lstlisting}
TweetNaCl functions take pointers as arguments. By convention the first one
points to the output array. It is then followed by the input arguments.
Due to some limitations of the VST, indexes used in \TNaCle{for} loops have to be
of type \TNaCle{int} instead of \TNaCle{i64}. We change the code to allow our
proofs to carry through. We believe this does not affect the correctness of the
original code. A complete diff of our modifications to TweetNaCl can be found in
Appendix~\ref{verified-C-and-diff}.
\subsection{X25519 in TweetNaCl}
\label{subsec:X25519-TweetNaCl}
We now describe the implementation of X25519 in TweetNaCl.
\subheading{Arithmetic in \Ffield.}
In X25519, all computations are performed in $\F{p}$.
Throughout the computation, elements of that field
are represented in radix $2^{16}$,
i.e., an element $A$ is represented as $(a_0,\dots,a_{15})$,
with $A = \sum_{i=0}^{15}a_i2^{16i}$.
The individual ``limbs'' $a_i$ are represented as
64-bit \TNaCle{long long} variables:
\begin{lstlisting}[language=Ctweetnacl]
typedef i64 gf[16];
\end{lstlisting}
Conversion from the input byte array to this representation in radix $2^{16}$ is
done as follows:
\begin{lstlisting}[language=Ctweetnacl]
sv unpack25519(gf o, const u8 *n)
{
int i;
FOR(i,16) o[i]=n[2*i]+((i64)n[2*i+1]<<8);
o[15]&=0x7fff;
}
\end{lstlisting}
The radix-$2^{16}$ representation in limbs of $64$ bits is
highly redundant; for any element $A \in \Ffield$ there are
multiple ways to represent $A$ as $(a_0,\dots,a_{15})$.
This is used to avoid carry handling in the implementations of addition
(\TNaCle{A}) and subtraction (\TNaCle{Z}) in $\Ffield$:
\begin{lstlisting}[language=Ctweetnacl]
sv A(gf o,const gf a,const gf b) {
int i;
FOR(i,16) o[i]=a[i]+b[i];
}
sv Z(gf o,const gf a,const gf b) {
int i;
FOR(i,16) o[i]=a[i]-b[i];
}
\end{lstlisting}
Multiplications (\TNaCle{M}) also heavily exploit the redundancy
of the representation to delay carry handling.
\begin{lstlisting}[language=Ctweetnacl]
sv M(gf o,const gf a,const gf b) {
i64 i,j,t[31];
FOR(i,31) t[i]=0;
FOR(i,16) FOR(j,16) t[i+j]+=a[i]*b[j];
FOR(i,15) t[i]+=38*t[i+16];
FOR(i,16) o[i]=t[i];
car25519(o);
car25519(o);
}
\end{lstlisting}
After the multiplication, the limbs of the result \texttt{o} are
too large to be used again as input.
% which is why
The two calls to
\TNaCle{car25519} at the end of \TNaCle{M} propagate the carries through the limbs:
\begin{lstlisting}[language=Ctweetnacl]
sv car25519(gf o)
{
int i;
FOR(i,16) {
o[(i+1)%16]+=(i<15?1:38)*(o[i]>>16);
o[i]&=0xffff;
}
}
\end{lstlisting}
In order to simplify the verification of this function,
we treat the last iteration of the loop $i = 15$ as a separate step.
Inverses in $\Ffield$ are computed with \TNaCle{inv25519}.
This function uses exponentiation by $2^{255}-21$,
computed with the square-and-multiply algorithm.
Fermat's little theorem gives the correctness.
Notice that in this case the inverse of $0$ is defined as $0$.
\begin{lstlisting}[language=Ctweetnacl]
sv inv25519(gf o,const gf i)
{
gf c;
int a;
set25519(c,i);
for(a=253;a>=0;a--) {
S(c,c);
if(a!=2&&a!=4) M(c,c,i);
}
FOR(a,16) o[a]=c[a];
}
\end{lstlisting}
\TNaCle{sel25519} implements a constant-time conditional swap (\texttt{CSWAP}) by
applying a mask between two fields elements.
\begin{lstlisting}[language=Ctweetnacl]
sv sel25519(gf p,gf q,i64 b)
{
int i;
i64 t,c=~(b-1);
FOR(i,16) {
t= c&(p[i]^q[i]);
p[i]^=t;
q[i]^=t;
}
}
\end{lstlisting}
Finally, we need the \TNaCle{pack25519} function,
which converts from the internal redundant radix-$2^{16}$
representation to a unique byte array representing an
integer in $\{0,\dots,p-1\}$ in little-endian format.
\begin{lstlisting}[language=Ctweetnacl]
sv pack25519(u8 *o,const gf n)
{
int i,j;
i64 b;
gf t,m={0};
set25519(t,n);
car25519(t);
car25519(t);
car25519(t);
FOR(j,2) {
m[0]=t[0]- 0xffed;
for(i=1;i<15;i++) {
m[i]=t[i]-0xffff-((m[i-1]>>16)&1);
m[i-1]&=0xffff;
}
m[15]=t[15]-0x7fff-((m[14]>>16)&1);
m[14]&=0xffff;
b=1-((m[15]>>16)&1);
sel25519(t,m,b);
}
FOR(i,16) {
o[2*i]=t[i]&0xff;
o[2*i+1]=t[i]>>8;
}
}
\end{lstlisting}
As we can see, this function is considerably more complex than the
unpacking function. The reason is that it needs to convert
to a \emph{unique} representation, i.e., also fully reduce modulo
$p$ and remove the redundancy of the radix-$2^{16}$ representation.
\subheading{The Montgomery ladder.}
With these low-level arithmetic and helper functions defined,
we can now turn our attention to the core of the X25519 computation:
the \TNaCle{crypto_scalarmult} API function of TweetNaCl,
which is implemented through the Montgomery ladder.
\begin{lstlisting}[language=Ctweetnacl]
int crypto_scalarmult(u8 *q,
const u8 *n,
const u8 *p)
{
u8 z[32];
i64 r;
int i;
gf x,a,b,c,d,e,f;
FOR(i,31) z[i]=n[i];
z[31]=(n[31]&127)|64;
z[0]&=248;
unpack25519(x,p);
FOR(i,16) {
b[i]=x[i];
d[i]=a[i]=c[i]=0;
}
a[0]=d[0]=1;
for(i=254;i>=0;--i) {
r=(z[i>>3]>>(i&7))&1;
sel25519(a,b,r);
sel25519(c,d,r);
A(e,a,c);
Z(a,a,c);
A(c,b,d);
Z(b,b,d);
S(d,e);
S(f,a);
M(a,c,a);
M(c,b,e);
A(e,a,c);
Z(a,a,c);
S(b,a);
Z(c,d,f);
M(a,c,_121665);
A(a,a,d);
M(c,c,a);
M(a,d,f);
M(d,b,x);
S(b,e);
sel25519(a,b,r);
sel25519(c,d,r);
}
inv25519(c,c);
M(a,a,c);
pack25519(q,a);
return 0;
}
\end{lstlisting}
%XXX: I really want line numbers here to link lines in this code to
% the pseudocode description
\subsection{Coq, separation logic, and the VST}
\label{subsec:Coq-VST}
Coq~\cite{coq-faq} is an interactive theorem prover based on type theory. It
provides an expressive formal language to write mathematical definitions,
algorithms, and theorems together with their proofs. It has been used in the proof
of the four-color theorem~\cite{gonthier2008formal} and it is also the system
underlying the CompCert formally verified C compiler~\cite{Leroy-backend}.
Unlike systems like F*~\cite{DBLP:journals/corr/BhargavanDFHPRR17},
Coq does not rely on an SMT solver in its trusted code base.
It uses its type system to verify the applications of hypotheses,
lemmas, and theorems~\cite{Howard1995-HOWTFN}.
Hoare logic is a formal system which allows reasoning about programs.
It uses triples such as
$$\{{\color{doc@lstnumbers}\textbf{Pre}}\}\texttt{~Prog~}\{{\color{doc@lstdirective}\textbf{Post}}\}$$
where ${\color{doc@lstnumbers}\textbf{Pre}}$ and ${\color{doc@lstdirective}\textbf{Post}}$
are assertions and \texttt{Prog} is a fragment of code.
It is read as
``when the precondition ${\color{doc@lstnumbers}\textbf{Pre}}$ is met,
executing \texttt{Prog} will yield postcondition ${\color{doc@lstdirective}\textbf{Post}}$''.
We use compositional rules to prove the truth value of a Hoare triple.
For example, here is the rule for sequential composition:
\begin{prooftree}
\AxiomC{$\{P\}C_1\{Q\}$}
\AxiomC{$\{Q\}C_2\{R\}$}
\LeftLabel{Hoare-Seq}
\BinaryInfC{$\{P\}C_1;C_2\{R\}$}
\end{prooftree}
Separation logic is an extension of Hoare logic which allows reasoning about
pointers and memory manipulation. This logic enforces strict conditions on the
memory shared such as being disjoint.
We discuss this limitation further in \sref{subsec:with-VST}.
The Verified Software Toolchain (VST)~\cite{cao2018vst-floyd} is a framework
which uses Separation logic proven correct with respect to CompCert semantics
to prove the functional correctness of C programs.
The first step consists of translating the source code into Clight,
an intermediate representation used by CompCert.
For such purpose one uses the parser of CompCert called \texttt{clightgen}.
In a second step one defines the Hoare triple representing the specification of
the piece of software one wants to prove. Then using the VST, one uses a strongest
postcondition approach to prove the correctness of the triple.
This approach can be seen as a forward symbolic execution of the program.
\subsection{Arithmetic on Montgomery curves}
\label{subsec:arithmetic-montgomery}
\begin{dfn}
Given a field \K, and $a,b \in \K$ such that $a^2 \neq 4$ and $b \neq 0$,
$M_{a,b}$ is the Montgomery curve defined over $\K$ with equation
$$M_{a,b}: by^2 = x^3 + ax^2 + x.$$
\end{dfn}
\begin{dfn}
For any algebraic extension $\L \supseteq \K$, we call
$M_{a,b}(\L)$ the set of $\L$-rational points, defined as
$$M_{a,b}(\L) = \{\Oinf\} \cup \{(x,y) \in \L \times \L~|~by^2 = x^3 + ax^2 + x\}.$$
Here, the additional element $\Oinf$ denotes the point at infinity.
\end{dfn}
Details of the formalization can be found in \sref{subsec:ECC-Montgomery}.
For $M_{a,b}$ over a finite field $\F{p}$, the parameter $b$ is known as the ``twisting factor''.
For $b'\in \F{p}\backslash\{0\}$ and $b' \neq b$, the curves $M_{a,b}$ and $M_{a,b'}$
are isomorphic via $(x,y) \mapsto (x, \sqrt{b/b'} \cdot y)$.
\begin{dfn}
When $b'/b$ is not a square in \F{p}, $M_{a,b'}$ is a \emph{quadratic twist} of $M_{a,b}$, i.e.,
a curve that is isomorphic over $\F{p^2}$~\cite{cryptoeprint:2017:212}.
\end{dfn}
Points in $M_{a,b}(\K)$ can be equipped with a structure of an abelian group
with the addition operation $+$ and with neutral element the point at infinity $\Oinf$.
For a point $P \in M_{a,b}(\K)$ and a positive integer $n$ we obtain the scalar product
$$n\cdot P = \underbrace{P + \cdots + P}_{n\text{ times}}.$$
In order to efficiently compute the scalar multiplication we use an algorithm
similar to square-and-multiply: the Montgomery ladder where the basic operations
are differential addition and doubling~\cite{MontgomerySpeeding}.
We consider \xcoord-only operations. Throughout the computation,
these $x$-coordinates are kept in projective representation
$(X : Z)$, with $x = X/Z$; the point at infinity is represented as $(1:0)$.
See \sref{subsec:ECC-projective} for more details.
We define two operations:
\begin{align*}
\texttt{xADD} &: (x_{Q-P}, (X_P:Z_P), (X_Q:Z_Q)) \mapsto \\
&(X_{P + Q}:Z_{P + Q})\\
\texttt{xDBL} &: (X_P:Z_P) \mapsto (X_{2 \cdot P}:Z_{2 \cdot P})
\end{align*}
In the Montgomery ladder, % notice that
the arguments of \texttt{xADD} and \texttt{xDBL}
are swapped depending of the value of the $k^{\text{th}}$ bit.
We use a conditional swap \texttt{CSWAP} to change the arguments of the above
function while keeping the same body of the loop. \label{cswap}
Given a pair $(P_0, P_1)$ and a bit $b$, \texttt{CSWAP} returns the pair
$(P_b, P_{1-b})$.
By using the differential addition and doubling operations we define the Montgomery ladder
computing a \xcoord-only scalar multiplication (see \aref{alg:montgomery-ladder}).
\begin{algorithm}
\caption{Montgomery ladder for scalar mult.}
\label{alg:montgomery-ladder}
\begin{algorithmic}
\REQUIRE{\xcoord $x_P$ of a point $P$, scalar $n$ with $n < 2^m$}
\ENSURE{\xcoord $x_Q$ of $Q = n \cdot P$}
\STATE $Q = (X_Q:Z_Q) \leftarrow (1:0)$
\STATE $R = (X_R:Z_R) \leftarrow (x_P:1)$
\FOR{$k$ := $m$ down to $1$}
\STATE $(Q,R) \leftarrow \texttt{CSWAP}((Q,R), k^{\text{th}}\text{ bit of }n)$
\STATE $Q \leftarrow \texttt{xDBL}(Q)$
\STATE $R \leftarrow \texttt{xADD}(x_P,Q,R)$
\STATE $(Q,R) \leftarrow \texttt{CSWAP}((Q,R), k^{\text{th}}\text{ bit of }n)$
\ENDFOR
\RETURN $X_Q/Z_Q$
\end{algorithmic}
\end{algorithm}
\subsection{The X25519 key exchange}
\label{subsec:X25519-key-exchange}
From now on let $\F{p}$ be the field with $p=2^{255}-19$ elements.
We consider the elliptic curve $E$ over $\F{p}$ defined by the
equation $y^2 = x^3 + 486662 x^2 + x$.
For every $x \in \F{p}$ there exists a point $P$ in $E(\F{p^2})$
such that $x$ is the \xcoord of $P$.
The core of the X25519 key-exchange protocol is a scalar\hyp{}multiplication
function, which we will also refer to as X25519.
This function receives as input two arrays of $32$ bytes each.
One of them is interpreted as the little-endian encoding of a
non-negative integer $n$ (see \ref{subsec:integer-bytes}).
The other is interpreted as the little-endian encoding of
the \xcoord $x_P \in \F{p}$ of a point in $E(\F{p^2})$,
using the standard mapping of integers modulo $p$ to elements in $\F{p}$.
The X25519 function first computes a scalar $n'$ by setting bit 255 of $n$
to \texttt{0}, setting bit 254 to \texttt{1}, and setting the lower 3 bits to \texttt{0}.
This operation is often called ``clamping'' of the scalar $n$.
Note that $n' \in 2^{254} + 8\{0,1,\ldots,2^{251}-1\}$.
X25519 then computes the \xcoord of $n'\cdot P$.
RFC~7748~\cite{rfc7748} standardized the X25519 Diffie–Hellman key-exchange algorithm.
Given the base point $B$ where $X_B=9$, each party generates a secret random number
$s_a$ (respectively $s_b$), and computes $X_{P_a}$ (respectively $X_{P_b}$), the
\xcoord of $P_A = s_a \cdot B$ (respectively $P_B = s_b \cdot B$).
The parties exchange $X_{P_a}$ and $X_{P_b}$ and compute their shared secret with
X25519 on $s_a$ and $X_{P_b}$ (respectively $s_b$ and $X_{P_a}$).
\subsection{TweetNaCl specifics}
\label{subsec:Number-TweetNaCl}
As its name suggests, TweetNaCl aims for code compactness (\emph{``a crypto library in 100 tweets''}).
As a result it uses a few defines and typedefs to gain precious bytes while
still remaining human-readable.
\begin{lstlisting}[language=Ctweetnacl]
#define FOR(i,n) for (i = 0;i < n;++i)
#define sv static void
typedef unsigned char u8;
typedef long long i64;
\end{lstlisting}
TweetNaCl functions take pointers as arguments. By convention the first one
points to the output array. It is then followed by the input arguments.
Due to some limitations of the VST, indexes used in \TNaCle{for} loops have to be
of type \TNaCle{int} instead of \TNaCle{i64}. We change the code to allow our
proofs to carry through. We believe this does not affect the correctness of the
original code. A complete diff of our modifications to TweetNaCl can be found in
Appendix~\ref{verified-C-and-diff}.
\subsection{X25519 in TweetNaCl}
\label{subsec:X25519-TweetNaCl}
We now describe the implementation of X25519 in TweetNaCl.
\subheading{Arithmetic in \Ffield.}
In X25519, all computations are performed in $\F{p}$.
Throughout the computation, elements of that field
are represented in radix $2^{16}$,
i.e., an element $A$ is represented as $(a_0,\dots,a_{15})$,
with $A = \sum_{i=0}^{15}a_i2^{16i}$.
The individual ``limbs'' $a_i$ are represented as
64-bit \TNaCle{long long} variables:
\begin{lstlisting}[language=Ctweetnacl]
typedef i64 gf[16];
\end{lstlisting}
Conversion from the input byte array to this representation in radix $2^{16}$ is
done as follows:
\begin{lstlisting}[language=Ctweetnacl]
sv unpack25519(gf o, const u8 *n)
{
int i;
FOR(i,16) o[i]=n[2*i]+((i64)n[2*i+1]<<8);
o[15]&=0x7fff;
}
\end{lstlisting}
The radix-$2^{16}$ representation in limbs of $64$ bits is
highly redundant; for any element $A \in \Ffield$ there are
multiple ways to represent $A$ as $(a_0,\dots,a_{15})$.
This is used to avoid carry handling in the implementations of addition
(\TNaCle{A}) and subtraction (\TNaCle{Z}) in $\Ffield$:
\begin{lstlisting}[language=Ctweetnacl]
sv A(gf o,const gf a,const gf b) {
int i;
FOR(i,16) o[i]=a[i]+b[i];
}
sv Z(gf o,const gf a,const gf b) {
int i;
FOR(i,16) o[i]=a[i]-b[i];
}
\end{lstlisting}
Multiplications (\TNaCle{M}) also heavily exploit the redundancy
of the representation to delay carry handling.
\begin{lstlisting}[language=Ctweetnacl]
sv M(gf o,const gf a,const gf b) {
i64 i,j,t[31];
FOR(i,31) t[i]=0;
FOR(i,16) FOR(j,16) t[i+j]+=a[i]*b[j];
FOR(i,15) t[i]+=38*t[i+16];
FOR(i,16) o[i]=t[i];
car25519(o);
car25519(o);
}
\end{lstlisting}
After the multiplication, the limbs of the result \texttt{o} are
too large to be used again as input.
% which is why
The two calls to
\TNaCle{car25519} at the end of \TNaCle{M} propagate the carries through the limbs:
\begin{lstlisting}[language=Ctweetnacl]
sv car25519(gf o)
{
int i;
FOR(i,16) {
o[(i+1)%16]+=(i<15?1:38)*(o[i]>>16);
o[i]&=0xffff;
}
}
\end{lstlisting}
In order to simplify the verification of this function,
we treat the last iteration of the loop $i = 15$ as a separate step.