% Thesis title: Combining learning with fuzzing for software deobfuscation

% Draft title: Code-coverage guided learning

% Paper title: Coverage Guideld Learning of Reactive Software Systems

\title{Complementing Model Learning with Fuzzing}

\title{Complementing Model Learning with Mutation-Based Fuzzing}

\author{Mark Janssen\inst{1}\andRick Smetsers\inst{2}\and Joshua Moerman\inst{2}\and Sicco Verwer\inst{1}}%\thanks{Supported by NWO project 628.001.009 on Learning Extended State Machines for Malware Analysis (LEMMA).}}

\tocauthor{Janssen, Mark}

\institute{Delft University of Technology, \email{mark@ch.tudelft.nl, s.e.verwer@tudelft.nl}\andRadboud University Nijmegen, \email{r.smetsers@cs.ru.nl, moerman@cs.ru.nl}}

\author{Rick Smetsers\inst{1}\and Joshua Moerman\inst{1}\and Mark Janssen\inst{2}\and Sicco Verwer\inst{2}}%\thanks{Supported by NWO project 628.001.009 on Learning Extended State Machines for Malware Analysis (LEMMA).}}

\tocauthor{Smetsers, Rick}

\institute{Radboud University Nijmegen, \email{r.smetsers@cs.ru.nl, moerman@cs.ru.nl}\andDelft University of Technology, \email{mark@ch.tudelft.nl, s.e.verwer@tudelft.nl}}

%TODO voor eventuele camera ready versie uitgebreide (verplichte) info in institute. Nu is dat zonde van de ruimte.

\maketitle

...

...

@@ -50,9 +50,9 @@

\begin{abstract}

An ongoing challenge for learning algorithms formulated in the Minimally Adequate Teacher framework is to efficiently obtain counterexamples.

In this paper we compare and combine conformance testing and mutation-based fuzzing methods for obtaining counterexamples when learning finite state machine models for the reactive software systems of the Rigorous Exampination of Reactive Systems (RERS) challenge.

We have found that for the LTL problems of the challenge the fuzzer did not find any additional counterexamples for the learner, compared to those found by the tester.

We have found that for the LTL problems of the challenge the fuzzer provided an independent confirmation that the learning process had been successful, since no additional counterexamples were found.

For the reachability problems of the challenge, however, the fuzzer discovered more reachable error states than the learner and tester, albeit in some cases the learner and tester found some that were not discovered by the fuzzer.

This leads us to believe that in some applications, fuzzing is a viable technique for finding additional counterexamples for a learning setup.

This leads us to believe that these orthogonal approaches are complementary in the context of model learning.

\keywords{model learning, conformance testing, mutation-based fuzzing, finite state machines}

\end{abstract}

...

...

@@ -64,7 +64,7 @@ Software systems are becoming increasingly complex.

Instead of viewing a system via its internal structure, model learning algorithms construct a formal model from observations of a system's behaviour.

One prominent approach for model learning is described in a seminal paper by \citet{angluin1987learning}.

In this work, she proved that one can efficiently learn a model that describes the behaviour of a system if a so-called \emph{Minimally Adequate Teacher} is available.

In this work, she proved that one can effectively learn a model that describes the behaviour of a system if a so-called \emph{Minimally Adequate Teacher} is available.

This teacher is assumed to answer two types of questions about the (to the learner unknown) \emph{target}:

\begin{itemize}

\item In a \emph{membership query} (MQ) the learner asks for the system's output in response to a sequence of inputs.

...

...

@@ -91,7 +91,7 @@ In a test query, similarly to a membership query, the learner asks for the syste

If the system's response is the same as the predicted response (by the hypothesis) for all test queries, then the hypothesis is assumed to be equivalent to the target.

Otherwise, if there is a test for which the target and the hypothesis produce different outputs, then this input sequence can be used as a counterexample.

One of the main advantages of using conformance testing is that it can efficiently identify the hypothesis from the set of all finite state machines of size at most $m$.

One of the main advantages of using conformance testing is that it can distinguish the hypothesis from all other finite state machines of size at most $m$, where $m$ is a user-selected bound on the number of states.

This means that if we know a bound $m$ for the size of the system we learn, we are guaranteed to find a counterexample if there exists one.

Unfortunately, conformance testing has some notable drawbacks.

First, it is hard (or even impossible) in practice to determine an upper-bound on the number of states of the system's target FSM.

...

...

@@ -138,7 +138,7 @@ The RERS challenge consists of two parts:

In addition, we have used a mutation-based fuzzing tool (\emph{fuzzer}) to generate potentially interesting traces independently of the learner and the tester.

We have used these traces as a verification for the learned models and found that

\begin{itemize}

\item For part (1) of the challenge the fuzzer did not find any additional counterexamples for the learner, compared to those found by the tester.

\item For part (1) of the challenge the fuzzer did not find any additional counterexamples for the learner, compared to those found by the tester. Therefore the fuzzer provided an independent confirmation that the learning process had been successful.

\item For part (2) of the challenge the fuzzer discovered more reachable error states than the learner and tester, albeit in some cases the learner and tester found some that were not discovered by the fuzzer.

\end{itemize}

...

...

@@ -238,7 +238,7 @@ In this section, we describe preliminaries on finite state machines, model learn

\subsection{Finite state machines}

A \emph{finite state machine} (FSM) is a model of computation that can be used to design computer programs.

At any time, a FSM is in one of its (finite number of) states, called the \emph{current state}.

Generally, the current state of a computer program is determined by the contents of the memory locations (i.e.\ variables) that it currently has access to.

Generally, the current state of a computer program is determined by the contents of the memory locations (i.e.\ variables) that it currently has access to, and the values of its registers, in particular the program counter.

%The state of the vending machine in \autoref{fig:fsm}, for example, is determined by its current balance.

Changes in state are triggered by an event or condition, and are called \emph{transitions}.

We assume that transitions are triggered based on events, or \emph{inputs}, that can be observed.

...

...

@@ -253,7 +253,7 @@ Observe that a FSM is deterministic and input-enabled (i.e.\ complete) by defini

For $q \in Q$, we use $\lfloor q \rfloor_M$ to denote a \emph{representative} access sequence of $q$, i.e.\ $\delta(q_M, \lfloor q \rfloor_M)= q$.

We extend this notation to arbitrary sequences, allowing to transform them into representative access sequences: for $x \in I^{\ast}$, we define $\lfloor x \rfloor_M =\lfloor\delta(q_M, x)\rfloor_M$.

A \emph{discriminator} for a pair of states $q, q'$ is an input sequence $x \in I^{\ast}$ such that $\lambda(q, x)\neq\lambda(q', x)$.

A \emph{discriminator} for a pair of states $q, q'$ is an input sequence $x \in I^{\ast}$ such that $\lambda(q, x)\neq\lambda(q', x)$\cite{Smetsers2016}.

The behaviour of a FSM $M$ is defined by a \emph{characterization function}$A_M : I^{\ast}\to O^{\ast}$ with $A_M(x)=\lambda(q_M, x)$ for $x \in I^{\ast}$.

FSMs $M$ and $M'$ are \emph{equivalent} if $A_M(x)= A_{M'}(x)$ for $x \in I^{\ast}$.

...

...

@@ -334,7 +334,8 @@ The goal of so-called active model learning algorithms is to learn a FSM $H = (I

\medskip

\noindent The distinguishing characteristic of the \emph{TTT algorithm}\cite{Isberner2014a} is its redundancy-free handling of counterexamples.

The \emph{TTT algorithm} is a novel model learning algorithm formulated in the MAT framework \cite{Isberner2014a}.

The distinguishing characteristic of TTT is its redundancy-free handling of counterexamples.

The TTT algorithm maintains a prefix-closed set $S$ of access sequences to states.

These states correspond to leaves of a \emph{discrimination tree}$T$, in which the inner nodes are labeled with elements from a suffix-closed set of discriminators $E$, and its transitions are labeled with an output.

%The $TTT$ algorithm constructs and maintains the following data structures during its execution:

Every location in the array is represented by a compile-time random value.

When an edge in the control flow is taken, the bitmap is updated at the position of the current location and an xor of the previous location value.

This causes every edge in the control flow to be represented by a different byte in the bitmap.

The intention is that every edge in the control flow is mapped to a different byte in the bitmap.

Note that because the size of the bitmap is finite and the values that represent locations in the code are random, the bitmap is probabilistic: there is a change that collisions will occur.

This is especially the case when the bitmap fills up, which can happen when fuzzing large programs with many edges in their control flow.

...

...

@@ -574,7 +575,7 @@ For our learning and testing experiments, we have used LearnLib, an open-source

As a learner in LearnLib consideres its system under learning as a black-box, we have interfaced LearnLib with a compiled binary of each of the 18 problems.

Below, we list and explain the choices we have made regarding our LearnLib setup.

\begin{description}

\item[Learning algorithm] For our learning algorithm, we have chosen the TTT algoritm as implemented in LearnLib, because previous experiments have shown that it scales up to larger SULs; both in the amount of membership queries asked and in the amount of memory used in the process.

\item[Learning algorithm] For our learning algorithm, we have chosen the TTT algoritm as implemented in LearnLib, because previous experiments have shown that it scales up to larger systems under learning; both in the amount of membership queries asked and in the amount of memory used in the process.

\item[Testing algorithm] For our testing algorithm, we have used our own implementation of the Wp method.

Recall that the Wp method in principle generates a test suite whose size is polynomial in the size of the hypothesis and exponential in the upper bound of states in the system, minus the size of the hypothesis.

Instead of exhausting this test suite, our implementation of the algorithm randomly samples test sequences until it finds a counterexample:

...

...

@@ -615,10 +616,10 @@ The output was compared to that of the program binary using \texttt{diff}.

The results for the learning/testing setup described in \autoref{sec:exp_learning} are shown in \autoref{tab:learn_ltl} and \autoref{tab:learn_reach}.

We are `confident' that the learned models for the LTL problems (1 - 9) are complete, as the last hypothesis was learnt within 1 day and no further counterexamples were found in the following week.

We are confident that the learned models for the LTL problems (1--9) are complete, as the last hypothesis was learnt within 1 day and no further counterexamples were found in the following week.

The same holds for the first of the reachability problems (10), but of course we can never be sure with this technique!

For problems 11 - 18 we are confident that we do \emph{not} have complete models, as the learning algorithm was still very active when stopped/crashed after more than one week (it was still finding new states every ~10 minutes).

For problems 11--18 we are confident that we do \emph{not} have complete models, as the learning algorithm was still very active when stopped/crashed after more than one week (it was still finding new states every ~10 minutes).