main.tex 58.4 KB
Newer Older
Rick Smetsers's avatar
Rick Smetsers committed
1
2
\documentclass{llncs}
\usepackage{makeidx}
3
\usepackage{graphicx}
Rick Smetsers's avatar
Rick Smetsers committed
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
\usepackage[binary-units=true]{siunitx}
\usepackage[numbers]{natbib}
\usepackage{listings}

\usepackage{todonotes}
\usepackage{amsopn}
\usepackage{algorithm}
\usepackage[noend]{algorithmic}
\usepackage{tikz}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage{silence}
\WarningFilter{caption}{Unsupported document class} % caused by subfig with llncs document class
\usepackage{subfig}
% \usepackage{showframe} % temporary
\usepackage[binary-units=true]{siunitx}
20
\usepackage[bookmarks=false]{hyperref}
Rick Smetsers's avatar
Rick Smetsers committed
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
\usetikzlibrary{positioning,shapes.geometric,calc,automata,arrows}
\usepackage{multirow}
% Springer's original splncs03.bst does not support natbib, as a workaround we use the patched bst from 
% http://phaseportrait.blogspot.nl/2011/02/natbib-compatible-bibtex-style-bst-file.html
\bibliographystyle{splncsnat}
\renewcommand\bibsection{\section*{References}}

\newcolumntype{Y}{>{\centering\arraybackslash}X}

\newcommand{\etal}{\textit{et al.\~}}
\newcommand{\learnlib}{LearnLib}
\newcommand{\libafl}{libafl}
\DeclareMathOperator{\row}{row}

\begin{document}

\mainmatter

% Thesis title: Combining learning with fuzzing for software deobfuscation
% Draft title: Code-coverage guided learning
% Paper title: Coverage Guideld Learning of Reactive Software Systems
\title{Coverage Guided Learning\\of Reactive Software Systems}

\author{Mark Janssen\inst{1} \and Rick Smetsers\inst{2} \and Joshua Moerman\inst{2} \and Sicco Verwer\inst{1}}%\thanks{Supported by NWO project 628.001.009 on Learning Extended State Machines for Malware Analysis (LEMMA).}}
\tocauthor{Janssen, Mark}
\institute{Delft University of Technology, \email{mark@ch.tudelft.nl, s.e.verwer@tudelft.nl} \and Radboud University Nijmegen, \email{r.smetsers@cs.ru.nl, moerman@cs.ru.nl}}
%TODO voor eventuele camera ready versie uitgebreide (verplichte) info in institute. Nu is dat zonde van de ruimte.

\maketitle

\begin{abstract}
An ongoing challenge for learning algorithms formulated in the Minimally Adequate Teacher framework is to efficiently obtain counterexamples.
In this paper we combine traditional conformance testing with mutation-based fuzzing for obtaining counterexamples when learning finite state machine models for the reactive software systems of the Rigorous Exampination of Reactive Systems (RERS) Challenge 2016.
The fuzzer evaluates a test based on what code it executes; tests that execute new code are used as a source for mutation.
Iterating this process proves to be effective in finding counterexamples for a learning algorithm.
%In fact, we show that our coverage guided approach significantly outperforms traditional methods on a set of realistic benchmarks.
\keywords{model learning, conformance testing, mutation-based fuzzing, finite state machines}
\end{abstract}

\section{Introduction}

Software systems are becoming increasingly complex.
\emph{Model learning} is quickly becoming a popular technique for reverse engineering such systems.
Instead of viewing a system via its internal structure, model learning algorithms construct a formal model from observations of a system's behaviour.

One prominent approach for model learning is described in a seminal paper by \citet{angluin1987learning}.
In this work, she proved that one can efficiently learn a model that describes the behaviour of a system if a so-called \emph{Minimally Adequate Teacher} is available.
This teacher is assumed to answer two types of questions about the (to the learner unknown) \emph{target}:
\begin{itemize}
\item In a \emph{membership query} (MQ) the learner asks for the system's output in response to a sequence of inputs.
The learner uses the outputs for a set of such queries to construct a \emph{hypothesis}.
\item In an \emph{equivalence query} (EQ) the learner asks if its hypothesis is equivalent to the target. 
If this is not the case, the teacher provides a \emph{counterexample}, which is an input sequence that distinguishes the hypothesis and the target.
The learner then uses this counterexample to refine its hypothesis through a series of membership queries.
\end{itemize}
This process iterates until the learner's hypothesis is equivalent to the target.

\citet{Peled1999} have recognized the avail of Angluin's work for learning models of real-world, reactive systems that can be modeled by a finite state machine (FSM).
Membership queries, on the one hand, are implemented simply by interacting with the system. 
Equivalence queries, on the other hand, require a more elaborate approach, as there is no trivial way of implementing them.
Therefore, an ongoing challenge, and the topic of this paper, is to efficiently obtain counterexamples.

% Current techniques for implementing equivalence queries
Several techniques for obtaining counterexamples have been proposed.
The most widely studied approach for this purpose is \emph{conformance testing} \citep{Dorofeeva2010}.
In the context of learning, the goal of conformance testing is to establish an equivalence relation between the current hypothesis and the target.  
This is done by posing a set of so-called \emph{test queries} to the system.
In a test query, similarly to a membership query, the learner asks for the system's response to a sequence of inputs. 
If the system's response is the same as the predicted response (by the hypothesis) for all test queries, then the hypothesis is assumed to be equivalent to the target.
Otherwise, if there is a test for which the target and the hypothesis produce different outputs, then this input sequence can be used as a counterexample.

One of the main advantages of using conformance testing is that it can discover all counterexamples for a hypothesis with $n$ states, under the assumption that the system can be modelled by a target FSM with at most $m$ states, $m \leq n$.
Unfortunately, conformance testing has some notable drawbacks.
First, it is hard (or even impossible) in practice to determine an upper-bound on the number of states of the system's target FSM ($m$).
Second, it is known that testing becomes exponentially more expensive for higher values of $m$ \citep{Vasilevskii1973}. %, as a complete test set should include each sequence in the so-called \emph{traversal set}, which contains all input sequences of length $l = m - n + 1$ . 
%Moreover, this set should be applied to each state of the hypothesis.
%Therefore, in practice, often a small value for $l$ is chosen, in the hope that the test set will at least contain one counterexample that can be used to increase the size of the hypothesis.
%Then, in the next iteration of the learning algorithm, the same value for $l$ is chosen, so one effectively tests for higher values of $m$ (without it becoming exponentially more expensive).
%In some cases, however, the shortest counterexample for the current hypothesis is simply too long, and 
Therefore, the learner might incorrectly assume that its hypothesis is correct. 
This motivates the search for alternative techniques for implementing equivalence queries.

% Fuzzing as a way of implementing equivalence queries
The field of \emph{mutation-based fuzzing} provides opportunities here.
In essence, \emph{fuzzers} are programs that apply a test (i.e.\ input sequence) to a target program, and then iteratively modify this sequence to monitor whether or not something interesting happens (e.g.\ crash, different output, increased code coverage \ldots).
Fuzzers are mostly used for security purposes, as a crash could uncover an exploitable buffer overflow, for example. 
%In general, two classes of fuzzers exist. 
%On the one hand, 
\emph{Mutation-based} fuzzers randomly replace or append  some inputs to the test query.
%These fuzzers are also known as `dumb' fuzzers, as they lack any understanding of the underlying structure of the program. 
%Therefore, these fuzzers offer no solution to the exponential blow-up of conformance testing. 
%On the other hand, more sophisticated \emph{generation-based} fuzzers modify the test query based on some specification of the target program.
%This is effective in terms of the number interesting test queries generated, but unfortunately also requires one to write this potentially complex specification.
Recently, good results have been achieved by combining mutation-based fuzzing with a genetic (evolutionary) algorithm. 
This requires a fitness function to evaluate the performance of newly generated test query, i.e.\ a measurement of `how interesting' it is. 
In our case, this fitness function is based on what code is executed for a certain test query.
The fittest test cases can then be used as a source for mutation-based fuzzing.
Hence, tests are mutated to see if the coverage of the program is increased.
Iterating this process creates an evolutionary approach which proves to be very effective for various applications \citep{afl-website}.

In this paper, we use such an evolutionary fuzzing approach to generate a test set for use in model learning.
This test set is combined with that of a traditional (albeit slightly modified) conformance testing method for implementing equivalence queries.

%\subsubsection*{Related work}
%\citet{Duchene2012} have used a combination of model learning and evolutionary mutation-based fuzzing to detect web injection vulnerabilities.
%Their function to evaluate the fitness of a test is specific to XSS attacks, however.
%This makes their approach less applicable to a more general class of software systems.
%
%\citet{Cho2011} have proposed to use \emph{concolic testing} for implementing equivalence queries. Concolic testing uses a combination of symbolic and concrete execution to build a model of the system, and then uses the model to guide further exploration of the state-space.
%Similarly to our fuzzing approach, the aim of concolic testing is to generate tests that maximize code coverage.
%%The authors use a priority queue to prioritize concrete sequences that are used for symbolic execution. 
%%The sequences that visit a larger number of new basic blocks, unexplored by prior sequences, have higher priority.
%In contrast to our approach, however, it uses a constraint solver to evaluate the fitness of a test, and therefore requires the system to be a \emph{white-box}.
%If such a white-box system is available, then concolic testing would arguably outperform fuzzing.
%Our fuzzing approach, however, can be applied to learn systems for which only a binary executable is available, as it can use runtime instrumentation to measure the fitness of a test.
%This makes our approach applicable to a broader class of software systems.

%Similarly, \citet{Giannakopoulou2012} have introduced a framework that combines model learning with symbolic execution to automatically generate models for white-box components that include methods with parameters. 
%The transitions of the generated models are labeled with method names and guarded with constraints on the corresponding method parameters. 
%The guards partition the input spaces of parameters, and enable a more precise characterization of legal orderings of method calls.
%The performance bottleneck faced by this approach was that the set of test sequences used to implement an equivalence query grew exponentially. 
%Moreover, to handle methods with parameters completely, each sequence in this set was fully explored symbolically, which further increased the cost of the approach.
%As a consequence, the approach could only be used to learn relatively small models.
%To counteract this, \citet{Howar2013} presented a new framework based on that of \citet{Giannakopoulou2012}, with novel algorithms for performing most of the work at a concrete level.

%\citet{Bertolino2012} have suggested to use \emph{monitoring} for implementing equivalence queries, i.e.\ continuously validating the current hypothesis against traces obtained by observing the system while it is running.
%This approach was proposed as a focus point of the Emergent Connectors for Eternal Software Intensive Networked Systems (CONNECT) European project.
%Whilst such an approach of `life-long learning' seems interesting for some applications, its utility is (to the best of our knowledge) yet to be proven in a convincing case study.
%A major limitation of this approach is that the quality of the learned model heavily relies on the diversity of the traces that are (passively) being observed. 

\section{Preliminaries}
In this section, we describe preliminaries on finite state machines, active learning, conformance testing, and fuzzing.
%Where possible, we will illustrate these concepts by means of their application to the vending machine implementation shown in \autoref{fig:fsm}.

%\begin{figure}%
%\centering
%\begin{minipage}{0.35\textwidth}%
%	\begin{algorithmic}
%	\STATE int balance = 0
%	\LOOP
%	\STATE int coin = scan()
%	% \IF{coin is not valid} continue \ENDIF
%	\STATE balance += coin
%	\IF{$\textrm{balance} \geq 50$}
%	\STATE balance -= 50
%    \PRINT 1
%	\ELSE
%	\PRINT 0
%	\ENDIF
%	\ENDLOOP
%	\end{algorithmic}%
%\end{minipage}%
%\quad
%\begin{minipage}{0.60\textwidth}%
%	\begin{tikzpicture}[shorten >=1pt,node distance=1.5cm,>=stealth']
%    \tikzstyle{every state}=[draw=black,text=black,inner sep=1pt,minimum size=12pt,initial text=]
%    \node[state,initial,initial where=above] (0) {0};
%    \node[state] (1) [right of=0] {10};
%    \node[state] (2) [right of=1] {20};
%    \node[state] (3) [right of=2] {30};
%    \node[state] (4) [right of=3] {40};
%    \path[->]
%    (0) edge [loop below] node [below] {50/1}
%    (0) edge node [above] {10/0} (1)
%    (0) edge [bend left=50] node [above] {20/0} (2)
%    
%    (1) edge [loop below] node [below] {50/1}
%    (1) edge node [above] {10/0} (2)
%    (1) edge [bend left=50] node [above] {20/0} (3)
%    
%    (2) edge [loop below] node [below] {50/1}
%    (2) edge node [above] {10/0} (3)
%    (2) edge [bend left=50] node [above] {20/0} (4)
%
%	(3) edge [loop below] node [below] {50/1}
%    (3) edge node [above] {10/0} (4)
%    (3) edge [bend left=60] node [below] {20/1} (0)
%    
%    (4) edge [loop below] node [below] {50/1}
%    (4) edge [bend right=65] node [above] {10/1} (0)
%    (4) edge [bend left=60] node [below] {20/1} (1);
%     
%	\end{tikzpicture}%   
%\end{minipage}%
%\caption{A simple implementation of a vending machine (left) and a FSM for it (right). 
%The machine accepts coins of value 10, 20 and 50.
%The state represents the current balance of the machine.
%An item costs 50.
%After receiving an input, the machine prints 1 if it vends an item, and 0 otherwise. 
%The machine does not give refunds, but instead saves balance for the next customer.}
%\label{fig:fsm}
%\end{figure}

\subsection{Finite state machines}
A \emph{finite state machine} (FSM) is a model of computation that can be used to design computer programs.
At any time, a FSM is in one of its (finite number of) states, called the \emph{current state}.
Generally, the current state of a computer program is determined by the contents of the memory locations (i.e.\ variables) that it currently has access to.
%The state of the vending machine in \autoref{fig:fsm}, for example, is determined by its current balance.
Changes in state are triggered by an event or condition, and are called \emph{transitions}.
We assume that transitions are triggered based on events, or \emph{inputs}, that can be observed. 
%\autoref{fig:fsm} (right) shows a FSM for our vending machine.

Formally, we define a FSM as a Mealy machine $M = (I, O, Q, q_M, \delta, \lambda)$, where $I, O$ and $Q$ are finite sets of \emph{inputs}, \emph{outputs} and \emph{states} respectively, $q_M \in Q$ is the \emph{start state}, $\delta: Q \times I \rightarrow Q$ is a \emph{transition function}, and $\lambda: Q \times I \rightarrow O$ is an \emph{output function}.
The functions $\delta$ and $\lambda$ are naturally extended to $\delta: Q \times I^{\ast} \to Q$ and $\lambda: Q \times I^{\ast} \to O^{\ast}$.
Observe that a FSM is deterministic and input-enabled (i.e.\ complete) by definition.

For $q \in Q$, we use $\lfloor q \rfloor_M$ to denote a \emph{representative} access sequence of $q$, i.e.\ $\delta(q_M, \lfloor q \rfloor_M) = q$.
We extend this notation to arbitrary sequences, allowing to transform them into representative access sequences: for $x \in I^{\ast}$, we define $\lfloor x \rfloor_M = \lfloor \delta(q_M, x)\rfloor_M$.

A \emph{discriminator} for a pair of states $q, q'$ is an input sequence $x \in I^{\ast}$ such that $\lambda(q, x) \neq \lambda(q', x)$.

The behaviour of a FSM $M$ is defined by a \emph{characterization function} $A_M : I^{\ast} \to O^{\ast}$ with $A_M(x) = \lambda(q_M, x)$ for $x \in I^{\ast}$. 
FSMs $M$ and $M'$ are \emph{equivalent} if $A_M(x) = A_{M'}(x)$ for $x \in I^{\ast}$.

\subsection{Active learning} \label{sec:learning}
The goal of active learning algorithms is to learn a FSM $H = (I, O, Q_H, q_H, \delta_H, \lambda_H)$ for a system whose behaviour can be characterized by a (unknown) FSM $M = (I, O, Q_M, q_M, \delta_M, \lambda_M)$, given the set of inputs $I$ and access to the characterization function $A_M$ of $M$.

%Angluin's original $L^{\ast}$ algorithm \cite{angluin1987learning} incrementally constructs and refines a so-called \emph{observation table}.
%The rows and columns of an observation table are labelled by prefixes and suffixes of membership queries respectively, and the cells are filled with the last output in response to this membership query.
%A hypothesis $H$ can be constructed from the observation table if it is \emph{closed} and \emph{consistent} (see \cite{angluin1987learning} for details).
%This hypothesis is presented in an equivalence query, and a sequence $c$ is returned as a counterexample if $A_H(c) \neq A_M(c)$.
%Several different methods for handling counterexamples exist (see \cite{Steffen2011} for an overview).
%In Angluin's original description, the counterexample and all of its prefixes are added to $S$.
%This violates the closedness or consistency of the observation table (or both), and hence new membership queries are required to fill the table.
%This procedure iterates until no counterexample can be found for an hypothesis presented in an equivalence query.
%At this point, it is assumed that $H$ and $M$ are equivalent.

%Formally, an observation table is a triple $(S, E, \row)$, where:
%\begin{itemize}
%\item $S \subset I^{\ast}$ is a finite, prefix-closed set of \emph{prefixes}, extended to $S \cup (S \cdot I)$,
%\item $E \subset I^{\ast}$ is a finite, suffix-closed set of \emph{suffixes}, and
%\item $\row : (S \cup (S \cdot I) \to E) \to O$ is a function mapping prefixes and suffixes to outputs.
%\end{itemize}
%Initially, $S$ contains the empty sequence $\epsilon$, and $E$ contains the set $I$.
%The learner asks membership queries (using the characterization function $A_M$) to fill this initial table.
%
%A hypothesis FSM can be constructed if the observation table is \emph{closed} and \emph{consistent}:
%\begin{itemize}
%\item The table is \emph{closed} if for all $s \in S \cdot I$ there is a $s' \in S$ such that $\row(s) = \row(s')$.
%If there is a row in $S \cdot I$ for which this is not the case, then this row is added to $S$ and the property is checked again. 
%\item The table is \emph{consistent} if for all $s, s' \in S$ such that $\row(s) = \row(s')$ it holds that $\row(sa) = \row(s'a)$ for all $a \in I$.
%If this is not the case for some $s, s' \in S$ and $a \in I$, then a suffix $ae$ such that $\row(s)(ae) \neq \row(s')(ae)$ is added to $E$, and the property is checked again.
%\end{itemize}
%A hypothesis FSM $H$ is constructed from a closed and consistent observation table $(S, E, \row)$ as follows:
%\begin{itemize}
%\item $I$ is given
%\item $O = \{\row(s)(e) | s \in S, e \in E\}$,
%\item $Q_H = \{\row(s) | s \in S\}$,
%\item $q_H = \row(\epsilon)$,
%\item $\delta_H(\row(s), a) = \row(s \cdot a)$ for $a \in I$ and $s \in S$, and
%\item $\lambda_H(\row(s), a) = \row(s)(a)$ for $a \in I$ and $s \in S$.
%\end{itemize}
%The initial observation table and corresponding first hypothesis for the vending machine of Figure~\ref{fig:fsm} are shown in Figure \ref{fig:hyp}.

%\begin{figure}
%\centering
%\begin{minipage}{0.2\textwidth}%
%	\begin{tabular}{| c r | c c c |}
%    \hline
%    & & \multicolumn{3}{| c |}{$E$} \\
%    & & 10 & 20 & 50 \\
%    \hline
%    $S$ & $\epsilon$ & 0 & 0 & 1 \\
%    \hline
%    \multirow{3}{*}{$S \cdot I$} & 10 & 0 & 0 & 1 \\
%    & 20 & 0 & 0 & 1 \\
%    & 50 & 0 & 0 & 1 \\
%    \hline
%    \end{tabular}
%\end{minipage}%
%\qquad
%\begin{minipage}{0.2\textwidth}%
%	\begin{tikzpicture}[shorten >=1pt,node distance=1.5cm,>=stealth']
%    \tikzstyle{every state}=[draw=black,text=black,inner sep=1pt,minimum size=12pt,initial text=]
%    \node[state,initial,initial where=above] (0) {};
%    \path[->]
%    (0) edge [loop right] node [right] {10/0}
%    (0) edge [loop below] node [below] {20/0}
%    (0) edge [loop left] node [left] {50/1} ();
%     
%	\end{tikzpicture}%   
%\end{minipage}%
%\caption{The initial observation table (left) and first hypothesis (right) for the vending machine in Figure~\ref{fig:fsm}.}
%\label{fig:hyp}
%\end{figure}

%A problem with the $L^{\ast}$ algorithm is that it is not prepared to deal with long counterexamples, as these add a lot of (possibly) redundant information to the observation table.
The distinguishing characteristic of the \emph{$TTT$ algorithm} \cite{Isberner2014a} is its redundancy-free handling of counterexamples.
The $TTT$ algorithm maintains a prefix-closed set $S$ of access sequences to states.
These states correspond to leaves of a \emph{discrimination tree} $T$, in which the inner nodes are labeled with elements from a suffix-closed set of discriminators $E$, and its transitions are labeled with an output.
%------------------------hier gebleven-------------------------------
%The $TTT$ algorithm constructs and maintains the following data structures during its execution:
%\begin{itemize}
%\item A \emph{spanning tree} defines a prefix-closed set $S$ of access sequences to states of the hypothesis.
%The spanning tree is augmented to $S \cdot I$ to represent the hypothesis $H$.
%\item States of the hypothesis correspond to leaves of a \emph{discrimination tree} $T$. 
%The inner nodes of $T$ are labelled with elements from a suffix-closed set $E$ of discriminators, and its transitions are labelled with an output from the set $O$.
%Each inner node has $|O|$ children that are possibly empty.
%For a node labelled with discriminator $x$, the subtree reached by transition labelled $o \in O$ contains the states for which $o$ is the last output in response to $x$.
%Hence, for every pair of states, a discriminator can be obtained by looking at the label of the \emph{least common ancestor} of the corresponding leaves.
%\end{itemize}
%Initially, $S$ contains the empty sequence $\epsilon$, and $E$ is empty.
%Therefore, the spanning tree consists of an initial state $q_H$, and the discrimination tree contains a single (leaf) node labelled by $q_H$.
A hypothesis is constructed by \emph{sifting} the sequences in $S \cdot I$ through the discrimination tree: 
Given a prefix $ua$, with $u \in S$ and $a \in I$, starting at the root of $T$, at each inner node labelled with a discriminator $v \in E$ a membership query $A_M(uav)$ is posed. 
Depending on the last output of this query, we move on to the respective child of the inner node. 
This process is repeated until a leaf is reached.
The state in the label of the leaf becomes the target for transition $\delta(\delta(q_H, u),a)$.
%Since the discrimination tree is empty at first, all transitions lead to $q_H$ in the first hypothesis.
%Therefore, the first hypothesis is the same as in Figure~\ref{fig:hyp}.

The way that the $TTT$ algorithm handles counterexamples is based on the observation by \citet{Rivest1994} that a counterexample $x \in I^{\ast}$ can be decomposed in a prefix $u \in I^{\ast}$, input $a \in I$, and suffix $v \in I^{\ast}$ such that $x = uav$ and $A_M(\lfloor u \rfloor_H av) \neq A_M(\lfloor ua \rfloor_H v)$.
Such a decomposition shows that the state $q = \delta_H(\delta_H(q_H, u), a)$ is incorrect, and that this transition should instead point to a new state $q'$ with access sequence $\lfloor u \rfloor_H a$.
Therefore, this sequence is added to $S$.
Observe that this does not affect the prefix-closedness of $S$.
In the discrimination tree $T$, the leaf corresponding to $q$ is replaced with an inner node labelled by the \emph{temporary} discriminator $v$.
A technique known as \emph{discriminator finalization} is applied to construct the subtree of this newly created inner node, and obtain a minimal discriminator for $q$ and $q'$.
For a description of discriminator finalization, we refer to \cite{Isberner2014a}.


%\begin{figure}
%\centering
%\begin{minipage}{0.2\textwidth}
%\begin{tikzpicture}[shorten >=1pt,node distance=1.5cm,>=stealth']
%    \tikzstyle{every state}=[draw=black,text=black,inner sep=1pt,minimum size=12pt,initial text=]
%    \node[state,initial,initial where=above] (0) {$q_0$};
%    \path[->]
%    (0) edge [loop right] node [right] {10/0}
%    (0) edge [loop below] node [below] {20/0}
%    (0) edge [loop left] node [left] {50/1} (); 
%	\end{tikzpicture}% 
%\end{minipage}\qquad
%\begin{minipage}{0.4\textwidth}%
%\begin{tikzpicture}[shorten >=1pt,>=stealth',node distance=1cm]
%    \tikzstyle{every state}=[draw=black,text=black,inner sep=1pt,minimum size=12pt,initial text=]
%    \node[rectangle, draw=black] (0) {10};
%    \node[rectangle, draw=black] (1) [below left=0.5cm and 0.5cm of 0] {20};
%    \node (2) [below right=0.5cm and 0.5cm of 0] {$\bot$};
%    \node[rectangle, draw=black] (3) [below left=0.5cm and 0.5cm of 1] {50};
%    \node (4) [below right=0.5cm and 0.5cm of 1] {$\bot$};
%    \node (5) [below left=0.5cm and 0.5cm of 3] {$\bot$};
%    \node[state] (6) [below right=0.5cm and 0.5cm of 3] {$q_0$};
%    \path[->]
%    (0) edge node [above left] {0} (1)
%    (0) edge node [above right] {1} (2)
%    (1) edge node [above left] {0} (3)
%    (1) edge node [above right] {1} (4)
%    (3) edge node [above left] {0} (5)
%    (3) edge node [above right] {1} (6);
%	\end{tikzpicture}% 
%\end{minipage}
%\caption{The initial spanning tree and hypothesis (left) and discrimination tree (right) for the vending machine in Figure~\ref{fig:fsm}.}
%\label{fig:ttt}
%\end{figure}

\subsection{Conformance testing} \label{sec:testing}
One of the main advantages of using conformance testing for finding counterexamples is that it can discover all counterexamples for a hypothesis $H = (I, O, Q_H, q_H, \delta_H, \lambda_H)$ with $n$ states, under the assumption that the target FSM $M = (I, O, Q_M, q_M, \delta_M, \lambda_M)$ has at most $m$ states, $m \leq n$.
Several different methods exist for constructing a so-called \emph{$m$-complete} test set (for a complete introduction and an overview of these methods, we refer to \citet{Dorofeeva2010}).
All of these methods require the following information:
\begin{itemize}
\item A set of \emph{access sequences} $S = \{\lfloor q \rfloor_H | q \in Q_H\}$, possibly extended to a \emph{transition cover} set $S \cdot I$.
\item A \emph{traversal set} $I^{l}$ that contains all input sequences of length $l = m - n + 1$, where $m = |Q_M|$ and $n = |Q_H|
$.
\item A means of pairwise distinguishing all states of $H$, such as set of \emph{discriminators} $E$ for all pairs of states in $H$.
\end{itemize}
A test set is then constructed by taking the product of these sets, or subsets of these sets, e.g.\ $S \cdot I^{l} \cdot E$.
The difference between different testing methods is how states are distinguished (i.e.\ the last part).

In the so-called \emph{partial W-method}, or \emph{Wp-method}, \citep{Fujiwara1991} states are distinguished based on the current state of the hypothesis:
390
For each state $q \in Q_H$ a set $E_{q} \subset E$ of discriminators is constructed, such that for each state $q' \in Q \setminus \{q\}$ there is a sequence $w \in E_{q}$ that distinguishes $q$ and $q'$, i.e.\ $\lambda_H(q, w) \neq \lambda_H(q', w)$.
Rick Smetsers's avatar
Rick Smetsers committed
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
Then, each trace $uv, u \in S \cdot I, v \in I^{l}$ is concatenated with the set $E_q$ such that $q = \delta_H(q_H, uv)$.

%The method that uses a set of discriminators for all pairs of states is called the \emph{W-method} \cite{chow1978testing}.

%An alternative way of pairwise distinguishing all states is by means of an \emph{adaptive distinguishing sequence}, which is a single adaptive test for which all states produce a unique output.
%The test is \emph{adaptive} in the sense that at each step of the test, the next input depends on the previously observed output.
%Unfortunately, an adaptive distinguishing sequence does not always exist.
%Recently, however, \citet{Smeenk2015} have described a test method that uses adaptive distinguishing sequences to construct a test sets that (in most cases) is smaller than one constructed by the W-method (although they have the same worst case complexity).
%We will refer to this method as \emph{ADS}.

Conformance testing is typically very expensive due to the exponential size of the traversal set. 
Given a hypothesis $H$ with $n$ states and $k$ inputs, the worst-case length of a test set (i.e. the sum of the length of all sequences) is of order $\mathcal{O}(k^{l}n^3)$ (recall that $l = m - n + 1$, where $m$ is the upper bound on the number of states of $M$).
Moreover, it is hard to estimate an upper bound for $M$ in practice. 
Therefore, instead of iterating over the traversal set $I^{l}$ (for each prefix in $S$), we sample the set $I^{\ast}$ by randomly generating sequences $v \in I^{\ast}$  according to a geometric distribution. 
While generating a sequence, the probability of terminating the sequence is $p = 1/x$.
Therefore, the expected (mean) length of such a sequence is $E(l) = 1/p = x$, and the probability that a sequence is of length $l$ is $(1-p)^xp$.


%often a value close to $n$ is picked for $m$, in the hope that the test set contains at least some counterexamples. 
%Then, for each next hypothesis, $m$ is assumed to be larger than before, and eventually one hopes to obtain a correct estimate for $m$.

412
%In some cases, however, the shortest counterexample for the current hypothesis is simply too long.
Rick Smetsers's avatar
Rick Smetsers committed
413
414
415
416
417
418
419
420
421
%Let us illustrate this with (an adaptation of) our running example (\autoref{fig:fsm}).
%A $3$-complete test set would contain any permutation of the sequence 10 20 20, which is a counterexample for the hypothesis in \autoref{fig:hyp}.
%Therefore, we are likely to find a counterexample in this case.
%Now assume that we are forced to increase the price of an item in our vending machine to 500, due to an (unlikely, albeit not inconceivable) currency devaluation.
%Now, we require at least a $10$-complete test set to find a counterexample (ten coins of 50), which is already inconceivable for any of these methods.
%
%In the hypothetical case described above, instead of applying a $3$-complete test set exhaustively, one can randomly sample a $10$-complete set.
%Whilst this does not guarantee that we would find a counterexample, it at least provides us with the possibility of finding one.
%Still, however, it is easy to see that the probability of finding a counterexample is fairly low.
422
%Therefore, alternative methods for finding counterexamples might yield better results.
Rick Smetsers's avatar
Rick Smetsers committed
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836

\subsection{Fuzzing} \label{sec:fuzzing}
A \emph{fuzzer} is a program that applies a set of tests (i.e.\ input sequences) to a target program, and then iteratively \emph{mutates} (i.e.\ modifies) these tests to monitor if `something interesting' happens.
This could be that the target program crashes, or that its output changes.
The \emph{American Fuzzy Lop} (AFL) fuzzer \citep{afl-website} is interesting for its approach in combining mutation-based test case generation with \emph{code coverage} monitoring.
The AFL fuzzer supports programs written in C, C++, or Objective C and there are variants that allow to fuzz programs written in Python, Go, Rust or OCaml.
These programs can be provided as source code, or as compiled binaries (by either gcc or clang). 
%Therefore, we do not consider AFL to be a purely white-box tool.
%A screenshot of AFL's interface is shown in \autoref{fig:afl-interface}.
%
%\begin{figure}[h]
%	\centering
%	\includegraphics[width=0.7\textwidth]{figures/afl-interface}
%	\caption[AFL fuzzer interface]{Interface of the AFL fuzzer in interactive mode (colours are inverted). 
%   The status screen shows information about the number of discovered test cases, the effectiveness of used mutation strategies and other statistics.}
%	\label{fig:afl-interface}
%\end{figure}

From a high-level, simplified perspective, AFL works by taking a program and a queue of tests, and iteratively mutating these tests to see if the \emph{coverage} of the program is increased; new tests that increase coverage are added to the queue. 
Tests execution speed is increased by using a \emph{fork server}.
In the next paragraphs, we will describe in more detail how coverage is measured by AFL, which mutation strategies are applied, and how the fork server works.

% I have commented out this algorithm, because it is quite easy to capture its purpose in a small piece of text (above).
%\begin{algorithm}[b]
%\caption[High-level overview of AFL]{High-level overview of the American Fuzzy Lop fuzzer}
%\label{alg:afl}
%\begin{algorithmic}
%\REQUIRE A program and a queue of initial test cases
%LOOP
%\STATE take next test case from the queue
%\FORALL{available mutation strategies}
%\STATE mutate test case
%\STATE run target program on test case
%\IF{coverage is increased}
%\STATE add new test case to the queue
%\ENDIF
%\ENDFOR
%\ENDLOOP
%\end{algorithmic}
%\end{algorithm}

\subsubsection*{Measuring coverage}
If a mutated test case results in a higher coverage of the target program, the test case is seen as valuable. 
The intuition behind this is simple: we want to cover as much of the target program's code as possible, which gives us the highest chance of discovering all behaviour (i.e.\ interesting test queries) of the program. 

In order to measure this coverage, AFL uses either compile-time or runtime instrumentation of the control flow of the program (branches, jumps, etc.), to identify which parts of the target program are used in a given test. 
Using this knowledge, AFL can decide which test cases cover behaviour not previously seen in other test cases, simply by comparing the result of the instrumentation. 

Internally, coverage is measured by using a so-called \emph{trace bitmap}, which is a \SI{64}{\kilo\byte} array of memory shared between the fuzzer and the instrumented target. 
This array is updated by the following code every time an edge in the control flow is taken.
\begin{lstlisting}[language=C]
cur_location = <COMPILE_TIME_RANDOM>;
shared_mem[cur_location ^ prev_location]++;
prev_location = cur_location >> 1;
\end{lstlisting}
Every location in the array is represented by a compile-time random value. 
When an edge in the control flow is taken, the bitmap is updated at the position of the current location and an xor of the previous location value. 
This causes every edge in the control flow to be represented by a different byte in the bitmap.

Note that because the size of the bitmap is finite and the values that represent locations in the code are random, the bitmap is probabilistic: there is a change that collisions will occur. 
This is especially the case when the bitmap fills up, which can happen when fuzzing large programs with many edges in their control flow. 
AFL can detect and resolve this situation by applying instrumentation on fewer edges in the target or by increasing the size of the bitmap.
A trace bitmap for the program in \autoref{fig:fsm} is shown in \autoref{fig:afl-cfg}. 
%We see that the input sequence 10 20 20 covers all edges of the program control flow.

\begin{figure}[ht]
	\centering
	\includegraphics[width=0.8\textwidth]{figures/vending-cfg}
	\caption[Visualisation of AFL instrumentation of control flow]{The (instrumented) control flow of the program in \autoref{fig:fsm} (left), and a visual representation of the trace bitmap for several tests (right).  
    When a test is in execution, the instrumentation fills the bitmap to keep track of the edges that are used. 
    Every edge in the control flow graph is represented by a position in the bitmap, which is either taken (coloured) or not taken (uncoloured).}
	\label{fig:afl-cfg}
\end{figure}

As previously mentioned, AFL supports compile-time or runtime instrumentation.
The compile-time instrumentation has the best performance, but requires the source code of the target program to be available. 
When the source code is not available, AFL applies runtime instrumentation, which uses emulation (QEMU or Intel Pin) to achieve the result. 
This, however, is 2-5{$\times$} slower than compile-time instrumentation \citep{afl-website}.

\subsubsection*{Mutation strategies}
At the core of AFL is its `engine' to generate new test cases. 
As mentioned earlier, AFL uses a collection of techniques to mutate existing test cases into new ones, starting with basic deterministic techniques and progressing onto more complex ones. 
The author of AFL has described the following strategies \citep{afl-blog-mutation}:
\begin{itemize}
	\item Performing sequential, ordered \emph{bit flips} to a sequence of one, two, or four bits of the input.
    \item An extension of bit flips to (a sequence of one, two or four) \emph{bytes}.
    \item Applying \emph{simple arithmetic} (incrementing and decrementing) to integers in the input.
    \item \emph{Overwriting} integers in the input by values from set of pre-set integers (such as -1, 1024 and \lstinline!MAX_INT!), that are known to trigger edge conditions in many programs.
    \item When the deterministic strategies (above) are exhausted, randomised \emph{stacked operations} can be applied, i.e.\ a sequence of single-bit flips, setting discovered byte values, addition and subtraction, inserting new random single-byte sets, deletion of blocks, duplication of blocks through overwrite or insertion, and zeroing blocks.
    \item The last-resort strategy involves taking two known inputs from the queue that cover different code paths and \emph{splicing} them in a random location. 
\end{itemize}

\subsubsection*{Fork server}
In general, fuzzers generate a lot of tests.
Therefore, many invocations of the target process are required.
Instead of starting a new process for every test, AFL uses a \emph{fork server} to speed up fuzzing.
The fork server initialises the target process only once, and then forks (clones) it to create a new instance for each test case.

On modern operating systems, a process fork is done in a copy-on-write fashion, which means that any memory allocated by the process is only copied when it is modified by the new instance. 
This eliminates most (slow) memory operations compared to a regular process start \citep{afl-blog-forkserver}, and allows for an execution of approximately \SI{10000} tests per second on a single core of our machine.

\section{Combining learning and fuzzing}
%Because of its excecution speed, different mutation strategies and effective instrumentation for measuring coverage, AFL has been found to generate interesting inputs for several complicated programs.
%One example is the creation a valid JPEG images with only the string ``hello'' used as an initial test case%, as shown in \autoref{fig:afl-jpeg-testcases}.
%AFL's success in these scenarios motivate its application to active learning.

%\begin{figure}
%	\centering
%	\includegraphics[height=40mm]{figures/afl-jpeg-testcases}
%	\caption[Test cases for JPEG library generated by AFL]{A number of test cases generated by fuzzing a JPEG library with AFL. The fuzzer was seeded with only the string ``hello''. After a one or two days, the fuzzer came up with a blank grayscale JPEG image, 3 pixels wide and 784 pixels tall. This, and subsequent other images that were generated are shown in this figure \cite{afl-blog-jpeg}.}
%	\label{fig:afl-jpeg-testcases}
%\end{figure}

In this section we describe how learning algorithms described in Section \ref{sec:learning} and the fuzzer described in Section \ref{sec:fuzzing} can be combined to learn models for reactive software systems.
First, we describe how AFL can be used to find counterexamples, and speed up the execution of membership queries.
Second, we describe the systems presented in the \emph{Rigorous Examination of Reactive Systems challenge 2015} (RERS2015) \cite{rers2015}, which we use as a benchmark for comparing the effectiveness of this approach to the existing techniques described in Section \ref{sec:testing}.
We conclude this section by presenting the results of these experiments. 

\subsection{Experimental setup}
For our experiments, we have combined AFL with LearnLib, an open-source Java library for active learning \citep{merten2011learnlib}\footnote{Code available at \url{https://github.com/praseodym/learning-fuzzing}.}.
To answer equivalence queries using AFL we have implemented a new equivalence oracle, \lstinline!AFLEQOracle!. 
This oracle iteratively loads a test that AFL marks as interesting, and parses them as a sequence of inputs.
As AFL does not use a fixed set of inputs, any inputs that are not known by the learner are skipped. 
The resulting input sequence is posed as a test query to both the target program and the hypothesised model. 
When the outputs of the target and hypothesis do not match, the input sequence is returned as a counterexample.

An overview of the architecture for combining AFL and LearnLib is shown in \autoref{fig:learner-architecture}.
To establish this, we had to tackle the following main issues:
\begin{itemize}
\item As AFL is provided as a standalone tool, we have created a library, libafl, that the learner can communicate with.
\item As LearnLib is written in Java, and AFL (and libafl) are written in C, we needed to bridge all communication between the two.
For this purpose, we have used the Java Native Interface (JNI) programming interface, which is part of the Java language.
JNI allows for code running in the Java Virtual Machine (i.e.\ LearnLib) to interface with platform-specific native binaries or external libraries (i.e.\ libafl).
\item We have added the possibility to embed the target program in AFL's fork server.
For each membership or test query, the fork server creates a new instance of the target process.
This speeds up the execution of learning, independent of the technique used to find counterexamples.
\end{itemize}

\begin{figure}[b]
	\centering
    \begin{tikzpicture}[>=stealth']
    \draw (0,0) rectangle (3,2);
    \node [below] at (1.5,2) {JVM};
    \draw (0.25,0.25) rectangle (2.75,1.5) node[midway] {LearnLib};
    \draw (4.5,0.25) rectangle (7,1.5) node [midway] {libafl};
    \draw (8.5,0) rectangle (11.5,2);
    \node [below] at (10,2) {AFL fork server};
    
	\draw (8.8,0.45) rectangle (11.05,1.45) [fill=white];
	\draw (8.9,0.35) rectangle (11.15,1.35) [fill=white];
    \draw (9,0.25) rectangle (11.25,1.25) [fill=white] node [midway] {Target process};
    \draw [<->] (3,0.75) -- (4.5,0.75) node[midway,fill=white] {JNI};
    \draw [<->] (7,0.5) -- (9,0.5) node[pos=0.4,fill=white] {queries};
    \draw [->] (8.65,0.5) -- (8.65,0.75) -- (8.9,0.75);
    \draw [->] (8.65,0.75) -- (8.65,1) -- (8.8,1);
    \draw [->] (7,1.25) -- (8.5,1.25) node[midway,fill=white] {setup};
    \end{tikzpicture}
	\caption{Architecture for combining LearnLib and AFL.}
	\label{fig:learner-architecture}
\end{figure}

\noindent There were some other issues that we had to address:
\begin{itemize}
\item A design decision in AFL is that it does not care about the target program's output.
Instead only coverage data is used as a measure for test case relevancy. 
The learner, however, relies on output behaviour. 
Therefore, we have extended AFL to always save data from the target's \lstinline{stdout} into a shared memory buffer (shared between libafl and the fork server process). 
The content of this shared memory buffer is returned to LearnLib after a successful query.
\item AFL runs the target program in a non-interactive manner, i.e. it provides the program with input once and then expects it to terminate and reset state. 
This is in contrast to the default behaviour of LearnLib, which expects a \textit{single-st SUL} that repeatedly accepts an input value and returns the associated output, and has an explicit option to reset.
We initially simulated this behaviour in AFL by running the target program once for each prefix of an input sequence.
For our benchmarks, however, we could run each input sequence once, as it was easy to correlate individual inputs to their corresponding outputs.
\end{itemize}

\subsection{The RERS2015 benchmark}
RERS2015 was a program analysis challenge on reactive systems to evaluate the effectiveness of different software verification techniques.
% TODO: Problems 4-6 are runtime verification problems; the sentence below is not entirely correct
% The challenge consisted of three artificially generated white-box problems (problems 1 through 3), i.e.\ programs for which obfuscated source code was made available, and three artificially generated black-box problems (problems 4 through 6), i.e.\ programs for which part of the source code was unavailable.
The challenge consisted of three artificially generated white-box problems (problems 1 through 3); programs for which the source code can be fully viewed and modified, and three artificially generated runtime verification problems (problems 4 through 6); programs for which parts of the source code were not made available until a few days before the contest deadline and for which the source code was not allowed to be modified.
% TODO: http://www.rers-challenge.org/2015/index.php?page=trainingphase says the following
% "The semantics of these external methods are subject to change, and the source file should not be modified for instrumentation purposes."
% For problems 4-6 we did instrument the target binary using compile time code injection, unsure if this violates that rule
In each of the categories, the problems were of increasing difficulty and size.
Problems 1 and 4 accept 5 inputs, problems 2 and 5 accept 10 inputs, and problems 3 and 6 accept 20 inputs.

One of the aims of the challenge was to analyse a set of reachability properties of each of these problems.
These reachability properties were implicitly defined by a set of 100 unique erroneous outputs (exceptions or failed assumptions) that were present in the problems.
Not all error states were reachable, however, and therefore not all erroneous outputs could be produced by the problems' programs.
The goal of the challenge was to classify for each problem which of the erroneous outputs could be produced, i.e.\ to determine which of the \emph{reachability states} were present in the problem's target model.

This proved to be a challenging task, as only one of the two participants managed to correctly classify all properties for the white-box problems.
None of the participants managed to correctly classify all properties for any of the black-box problems.
After the competition had ended, the results for all six problems were made available.

In our experiments, we compare several learning and testing approaches on their ability to discover the `reachability states' that are (now) known to be present in the RERS2015 problems.
We use the approach described in this section for learning models for both the white-box problems and the runtime verification problems, and we compare them against models learned with the conformance testing methods described in \autoref{sec:testing}.
The goal in our experiments is to reach as much of the existing reachability states as possible.
Since a unique erroneous output is generated for each reachability state, this is easy to measure.

Besides comparing the different approaches for their ability to discover the reachability states, we compare the number of states learned.
As a learned model is guaranteed to be minimal, having more states indicates that more behaviour of the target program has been modelled.

\subsection{Results} \label{sec:results}
An overview of the results is shown in \autoref{tab:learning}.
In addition to the number of (reachability) states learned, this table compares learning performance in terms of learning time and the number of queries needed (lower is better).
In all cases, using fuzzing equivalence delivers models with more states and more reachability states found in a shorter learning time. 
For most problems these results are impressive: for Problem 4 for example, we learned about 343{$\times$} more states using fuzzing than using the W-method (21 versus 7402 states) and found 21 reachability labels using fuzzing versus only one using the W-method. 
For other problems, the difference is less striking but still significant: for Problem 3 for example, we learned only about 1.4{$\times$} more states (798 versus 1094 states) and three more reachability labels using fuzzing.

In addition, for Problem 1, we have run an experiment in which we have combined fuzzing with conformance testing. 
In this setup, denoted with an asterisk ($^{\ast}$) in \autoref{tab:learning}, we first used a test set generated by AFL to find counterexamples, and then verified the final hypothesis with a test set generated by the W-method with a traversal set that has depth 1 (i.e.\ contains all sequences of length one -- single inputs).
During this verification phase, more counterexamples were found, which indicates that a combination of these testing methods leads to even better results.
Unfortunately, this experiment took more than 8 days to complete.
Therefore, instead of running this setup for the other (more complicated) problems, we consider this combination of fuzzing and conformance testing a very interesting direction for future work.

For several of the learned models we have rendered pictures that visually show the difference between the models learned with the different methods.
These pictures are shown in \autoref{fig:figures}.
In the rendered figures, we excluded the sink state that is reached when an invalid transition is made by giving an invalid input. 
In this sink state, the program will return \lstinline!invalid_state! as an output for any input. 
Because many other states in the program have a transition to this state for some input value, visualising the models with this state creates a cluttered picture.

For Problems 4, 5 and 6 we used only the TTT learning algorithm in combination with our fuzzing equivalence test, because the L* method did not finish within a reasonable time (several days).
The maximum W-method depth we chose was also limited by the execution time; using a larger maximum depth leads to an exponentially longer equivalence testing time. 
For depth values with reasonable runtime (shorter than a day), increasing the depth did not lead to learning of a significant number of new states.

For the `random' equivalence tester we generated random sequences of between 200 and 300 inputs.
Moreover, for the adaptive distinguishing sequences (ADS), we randomly generated infixes with an average length of 100 to replace the traversal set.
For both of these equivalence testing methods we set a limit of 1 000 000 test queries per hypothesis.
If no counterexample was found at that point, the learning process terminated.

One remark is that the learning time we report only includes the time the learning process ran, not the time that the fuzzer ran. 
We ran the AFL fuzzer on each problem for one day, and the test cases that were generated during that time were used for equivalence testing using the learning process. 
As we have described, however, the performance of other testing methods was not improved if they were run with parameters that yield a runtime shorter than a day.
Therefore, if we include fuzzing time in our comparison, the other testing methods are still outperformed.

\begin{table}[p]
	\caption{Results for the RERS2015 challenge problems on a Intel Xeon CPU E5-2430 v2 @ 2.50GHz (virtualised server), with Oracle Java 8 JVM configured with 4GB heap and 32MB thread stack.}
	\label{tab:learning}
    \sisetup{table-alignment=center,table-figures-decimal=0}
    %\hspace*{-5mm}
	\begin{tabularx}{\textwidth}{lX
				    S[table-figures-integer=5]
				    cr
                    S[table-figures-integer=10]
                    l}
	% Random eq. params: min. length 1, max. length 100, max. queries per round 1 million
    % ADS eq. params: max. k 1, max. length 100, max. queries per round 1 million
	\toprule
		{Target} 		& {Method}				& {States}	& {Reachability} & {Time} 	& {Queries}	& {Fig.}	\\               		
	\midrule
    	Problem 1		& TTT, Random			& 25		& 19/29			& 00:06:17	& 1039184	& - \\
    	Problem 1		& TTT, ADS				& 25		& 19/29			& 00:07:37  & 1001077   & - \\
        Problem 1		& TTT, W-method 1		& 25		& 19/29			& 00:00:04	& 7342		& -	\\
		Problem 1		& L*, W-method 8		& 25	    & 19/29			& 13:49:10	& 246093350 & \ref{fig:rers1-wmethod} \\
   		Problem 1		& TTT, fuzzing			& 334		& 29/29			& 00:00:21	& 16731  	& -	\\
		Problem 1		& \textbf{L*, fuzzing}	& 1027		& 29/29 		& 00:44:40	& 2860990 	& \ref{fig:rers1-afleq} \\
      	Problem 1		& TTT, combined$^{\ast}$ & 1688		& 29/29			& 210:43:23 & 1767418085& - \\
	\midrule
    	Problem 2		& TTT, Random			& 98		& 15/30			& 00:23:58	& 3603639	& - \\
    	Problem 2		& TTT, ADS				& 202		& 15/30			& 00:25:01	& 3057833	& - \\
        Problem 2		& TTT, W-method 1		& 188		& 15/30			& 01:00:05	& 8156730	& -	\\
		Problem 2		& L*, W-method 3		& 195	    & 15/30			& 17:05:59	& 239851191 & \ref{fig:rers2-wmethod} \\
   		Problem 2		& TTT, fuzzing			& 2985		& 24/30			& 00:13:06	& 412340  	& -	\\
		Problem 2		& \textbf{L*, fuzzing}	& 3281		& 24/30			& 13:28:42	& 42120554 	& \ref{fig:rers2-afleq} \\
	\midrule
    	Problem 3		& TTT, Random			& 134		& 16/32			& 00:33:32	& 5046743	& - \\
    	Problem 3		& TTT, ADS				& 747		& 16/32			& 01:09:11	& 6425786	& - \\
		Problem 3		& L*, W-method 1		& 798	    & 16/32			& 110:15:36	& 1421330223& \ref{fig:rers3-wmethod} \\
   		Problem 3		& TTT, fuzzing			& 1054		& 19/32			& 00:13:27	& 698409  	& -	\\
		Problem 3		& \textbf{L*, fuzzing}	& 1094		& 19/32			& 13:28:42	& 23464145 	& \ref{fig:rers3-afleq} \\
	\midrule
    	Problem 4		& TTT, Random			& 21	    & 1/23			& 00:07:01	& 1007331	& - \\
    	Problem 4		& TTT, ADS				& 21	    & 1/23			& 00:06:31	& 1001266	& \ref{fig:rers4-adseq} \\
		Problem 4		& TTT, W-method 7		& 21	    & 1/23			& 04:11:13	& 51760913  & - \\
   		Problem 4		& L*, W-method 7		& 21 	    & 1/23			& 03:10:51	& 51759741  & -	\\
   		Problem 4		& \textbf{TTT, fuzzing}	& 7402		& 21/23			& 00:16:48	& 458763  	& \ref{fig:rers4-afleq} \\
	\midrule
    	Problem 5		& TTT, Random			& 122 		& 15/30			& 00:22:15	& 3359153	& - \\
    	Problem 5		& TTT, ADS				& 183 		& 15/30			& 00:11:36	& 1311556	& \ref{fig:rers5-adseq} \\
		Problem 5		& L*, W-method 1		& 183       & 15/30			& 00:13:17	& 2203813   & - \\
   		Problem 5		& \textbf{TTT, fuzzing}	& 3376		& 24/30			& 00:08:00	& 416943  	& \ref{fig:rers5-afleq} \\    
	\midrule
    	Problem 6		& TTT, Random			& 75 		& 16/32			& 00:21:42	& 3233034	& - \\
    	Problem 6		& TTT, ADS				& 739 		& 16/32			& 02:37:34	& 15045326	& \ref{fig:rers6-adseq} \\
		Problem 6		& L*, W-method 1		& 671       & 16/32			& 93:33:10	& 889677067 & - \\
   		Problem 6		& \textbf{TTT, fuzzing}	& 3909		& 23/32			& 00:45:00	& 1804595  	& \ref{fig:rers6-afleq} \\
	\bottomrule
	\end{tabularx}
\end{table}

\clearpage

\begin{figure}
    \hspace{-0.35\textwidth}
	\begin{tabularx}{1.6\textwidth}{YYYY}

\vspace*{20mm}
\subfloat[P.1, L* + W-method]{
	\includegraphics[height=30mm]{figures/rers1-wmethod8}
	\label{fig:rers1-wmethod}
}

&

\subfloat[P.1, L* + fuzzing]{
 	\includegraphics[height=50mm]{figures/rers1-afleq}
	\label{fig:rers1-afleq}
}

&

\vspace*{5mm}
\subfloat[P.2, L* + W-method]{
	\includegraphics[height=45mm]{figures/rers2-wmethod3}
	\label{fig:rers2-wmethod}
}

&

\subfloat[P.2, L* + fuzzing]{
	\includegraphics[height=50mm]{figures/rers2-afleq}
	\label{fig:rers2-afleq}
}

\\

\vspace*{18mm}
\subfloat[P.3, L* + W-method]{
	\includegraphics[height=32mm]{figures/rers3-wmethod1}
	\label{fig:rers3-wmethod}
}

&

\vspace*{10mm}
\subfloat[P.3, L* + Fuzzing]{
	\includegraphics[height=40mm]{figures/rers3-afleq}
	\label{fig:rers3-afleq}
}

&

\vspace*{10mm}
\subfloat[P.4, TTT + ADS]{
	\includegraphics[height=40mm]{figures/rers4-ttt-adseq}
	\label{fig:rers4-adseq}
}

&

\subfloat[P.4, TTT + Fuzzing]{
	\includegraphics[height=50mm]{figures/rers4-ttt-afleq}
	\label{fig:rers4-afleq}
}

\\

\vspace*{20mm}
\subfloat[P.5, TTT + ADS]{
	\includegraphics[height=30mm]{figures/rers5-ttt-adseq}
	\label{fig:rers5-adseq}
}

&

\subfloat[P.5, TTT + Fuzzing]{
	\includegraphics[height=50mm]{figures/rers5-ttt-afleq}
	\label{fig:rers5-afleq}
}

&

\vspace*{3mm}
\subfloat[P.6, TTT + ADS]{
	\includegraphics[height=47mm]{figures/rers6-ttt-adseq}
	\label{fig:rers6-adseq}
}

&

\subfloat[P.6, TTT + Fuzzing]{
	\includegraphics[height=50mm]{figures/rers6-ttt-afleq}
	\label{fig:rers6-afleq}
}
  
\end{tabularx}

	\caption{Rendering of learned models for RERS2015 challenge problems.}
	\label{fig:figures}
    \vspace{-34pt} % get rid of "float too large" warning
\end{figure}

\clearpage

\section{Conclusion \& Future Work}
An ongoing challenge for learning algorithms formulated in the Minimally Adequate Teacher framework is to efficiently obtain counterexamples.
In this paper we have presented mutation-based fuzzing as a technique for obtaining counterexamples when learning finite state machine models for reactive software systems.
Specifically, we have combined learning algorithm with the American Fuzzy Lop fuzzer, which is interesting for its approach in combining mutation-based test case generation with code coverage monitoring.
This coverage guided approach significantly outperforms traditional conformance testing methods on the RERS2015 benchmarks.

There are several directions in which this work can be extended in future research.
First, as noted in \autoref{sec:results}, combining fuzzing with conformance testing for finding counterexamples might lead to better results than using fuzzing in isolation.
Currently, this approach requires a lot of time.
Therefore, it would be interesting to combine the approaches in a way that harnesses the strengths of both methods (i.e.\ a `small', yet complete test set with maximum coverage).
Second, it is possible to use \emph{passive learning} techniques (i.e. learning from observations) on test cases generated by a mutation-based fuzzer. 
It would be interesting to compare this to the active learning approaches discussed in this paper with regard to learning time and the quality of the learning model in terms of overall correctness. 
Third, it would be interesting to relate transitions in a learned finite state machine to the code that is executed during these transitions.
This could be combined with tainting to show how input values influence behaviour.
Finally, the AFL-LLVM compiler could be adapted to work on LLVM IR \citep{LLVMLangRef} (generally encoded as LLVM Bitcode \cite{LLVMBitCodeFormat}), which would allow it to instrument existing binaries if they can be disassembled to LLVM IR. 
This would give large performance benefits over AFL's QEMU fuzzing, the current go-to solution for fuzzing binary-only programs that cannot be recompiled with instrumentation.

\bibliography{library}

\end{document}