README.md 4.48 KB
Newer Older
1
# Preliminaries
Nils Golembiewski's avatar
Nils Golembiewski committed
2

3
## Install rust
4
5
See: [https://www.rust-lang.org/tools/install](https://www.rust-lang.org/tools/install)

6
## Install anaconda
7
8
See: [https://docs.anaconda.com/anaconda/install/index.html](https://docs.anaconda.com/anaconda/install/index.html)

9
## (Recommended) install mamba for faster virtual environment installations
10
11
See: [https://github.com/mamba-org/mamba](https://github.com/mamba-org/mamba)

12
## Install and activate the virtual environment:
13
14
15
16
17
```bash
<mamba/conda> env create -f conda_environment.yml && \
conda activate natural_computing
```

18
# The dataset
19

20
## Obtain the raw data
21
22
Run the following commands
```bash
23
mkdir -p data
24
25
26
wget https://nilsgolembiewski.nl/public_files/uploads/2IGcXY8HeE69JlgFk1QLCvBh7NRxAV/full_export.txt.gz -O - | gunzip -c > data/raw_data.txt
```

27
## Generate dataset from raw data
28
29
30
31
```bash
cargo run --manifest-path=data_generation/Cargo.toml --release -- -d ./data/raw_data.txt -o ./data/dataset -l 20
```

32
33
This will result in the following dataset:

Nils Golembiewski's avatar
Nils Golembiewski committed
34
35
Folder structure: `folder/<canvas_id>/<canvas_id>_<user_id>_<idx>_<label>_<info>.<data_type>`.

36
### `before.png`
37
38
the state of the canvas as it was before the modification

39
### `delta.png`
40
41
the modifications since the canvas was moved

42
### `mask_points.txt` columns
43
44
`x`, `y`. The first y (y=0) is the top of the image. 

45
### `sequence.txt` columns
46
47
48
Concatenated information of: the latest placed pixel, the previously placed pixel (if any)
placed pixel: `canvas_id`, `user_id`, `x`, `y`, `r`, `g`, `b`, `timestamp`, `is_grief`
previous pixel: `exists`, `user_id`, `r`, `g`, `b`, `timestamp`, if `exists` is zero, all other values are also -1
Nils Golembiewski's avatar
Nils Golembiewski committed
49

50

51
## Downloads
52
53
A pregenerated dataset can be downloaded here: [https://nilsgolembiewski.nl/public_files/uploads/fDhANiJtdVw7EZSoW3sFyunk6mRL9q/dataset.zip](https://nilsgolembiewski.nl/public_files/uploads/fDhANiJtdVw7EZSoW3sFyunk6mRL9q/dataset.zip).

54
55
The corresponding `train_metadata.yml` can be downloaded from [https://nilsgolembiewski.nl/public_files/uploads/dXuJ7lqc6WPKh4ebgfVOw523vnSAjN/train_metadata.yml.gz](https://nilsgolembiewski.nl/public_files/uploads/dXuJ7lqc6WPKh4ebgfVOw523vnSAjN/train_metadata.yml.gz)

Nils Golembiewski's avatar
Nils Golembiewski committed
56
Or use the following commands (unzipping may take a while):
57
58
59
60
61
62
63
64
65
```bash
mkdir -p data
cd data
wget https://nilsgolembiewski.nl/public_files/uploads/fDhANiJtdVw7EZSoW3sFyunk6mRL9q/dataset.zip -O dataset.zip \
    && unzip -q dataset.zip \
    && rm dataset.zip
wget https://nilsgolembiewski.nl/public_files/uploads/dXuJ7lqc6WPKh4ebgfVOw523vnSAjN/train_metadata.yml.gz -O - | gunzip -c > train_metadata.yml
```

66
67
68
69
70
71

# Training

## Vision model
### Train for the folds
```bash
Nils Golembiewski's avatar
Nils Golembiewski committed
72
python train_vision.py -t data/train_metadata.yml -j 4 -e configurations/train_vision/<configuration> -o output/vision_models
73
```
Nils Golembiewski's avatar
Nils Golembiewski committed
74
For the experiments in the report, configuration `resnet_18.yml` and `simple.yml` were used.
75

Nils Golembiewski's avatar
Nils Golembiewski committed
76
77
78
79
## Sequence model
### Train for the folds
```bash
python train_sequence.py -t data/train_metadata.yml -j 4 -e configurations/train_vision/resnet_18.yml -o output/sequence_models
80
```
Nils Golembiewski's avatar
Nils Golembiewski committed
81
82
83
84
85
For the experiments in the report, configuration `simple.yml` and `simple_lstm_400.yml` were used.

## Inspect results
Run:
```bash
86
87
88
89
mlflow ui
```
And view the results in a browser by clicking on the link which is printed. Each fold is a separate run, but they share a common `unique_id`, which can be found in the parameters.

Nils Golembiewski's avatar
Nils Golembiewski committed
90
91
92
93
94
95
The best models for each fold can be found in the output folder (`output/<type>_models/<unique_id>`) if the commands above were used, where `<type>` is either `vision` or `sequence`.

## Run an ensemble
### Put the relevant checkpoints in a directory
This can either be done manually, by copying the checkpoints from the directory.
Or the `scripts/extr_best_models.py` script can be used. WARNING: Best models are extracted based on the filename. If the filenames of the models in the output directory changed, the script won't work.
96

Nils Golembiewski's avatar
Nils Golembiewski committed
97
98
99
100
For example:
```bash
python scripts/extr_best_models.py -i <unique_id> -o models/<type>   
```
101

Nils Golembiewski's avatar
Nils Golembiewski committed
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
Create a `yaml` ensemble configuration file (the files used for the report can be found in `configurations/ensemble`).
Example:
```yml
vision_model_paths:
  - "models/vision"
sequence_model_paths:
  - "models/sequence"
ensemble_strategy: "avg_prob" 
```

### Run the ensemble and generate results
```bash
python ensemble_analysis.py -t data/train_metadata.yml -e '<path to ensemble configuration>'
```

### Inspect the ensemble results
Run (if not done previously):
```bash
mlflow ui
```
And inscpect the results in browser (default http://localhost:5000), in the `ensemble_analyisis` experiment.