Skip to content

Merge latest experiments into main #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
d7e8ca2
Move evaluation to Python script
anjsimmo Sep 15, 2023
c2032f3
Update the experiment. In particular, configure EFDT to leaves to be …
anjsimmo Sep 27, 2023
e69be80
Extract number of nodes and save tree for each batch
anjsimmo Sep 27, 2023
7dd0e1c
Automate experiment on batches of 1000 from all datasets
anjsimmo Sep 30, 2023
6e75ce9
Support multiple batches
anjsimmo Oct 6, 2023
c09c513
Log time, experiment with inf depth trees (slow!)
anjsimmo Oct 7, 2023
7e1b5ae
Revert to depth of 4, but larger alpha (penalise complexity)
anjsimmo Oct 7, 2023
ca69754
Minor optimisations when splitting data
anjsimmo Oct 7, 2023
e7ad795
Fix bug where was calling original tree rather than sklearn tree
anjsimmo Oct 7, 2023
dfeab7f
Go back to trying inf depth now that using sklearn
anjsimmo Oct 7, 2023
7803615
Try unbounded tree depth with low complexity penalty
anjsimmo Oct 7, 2023
aefad97
FIX: Test data leaking into training during tree training process
anjsimmo Oct 7, 2023
0858fcb
reduce to flat 10 penalty
anjsimmo Oct 7, 2023
95a657c
reduce to flat 1 penalty
anjsimmo Oct 7, 2023
282d16e
increase back to 30
anjsimmo Oct 7, 2023
86545fc
Based on experimentation set alpha to flat 10 penalty (seems to work …
anjsimmo Oct 7, 2023
1314a84
Add more exception handling to work around bug in similar tree and EFDT
anjsimmo Oct 7, 2023
39ef189
FIX exception handling logic so get accuracy results for Poker (but n…
anjsimmo Oct 7, 2023
5039717
Add parameter variations for full experiment
anjsimmo Oct 7, 2023
50db5c6
Add preliminary analysis of results
anjsimmo Oct 8, 2023
7d2c581
Change group markers in analysis to use same symbol as scatterplot da…
anjsimmo Oct 8, 2023
347d036
Rerun experiments to focus on alpha=5 rather than alpha=10
anjsimmo Oct 8, 2023
64274da
Add analysis for alpha=5
anjsimmo Oct 8, 2023
9e8d313
Add analysis of how acc,similairy,nodes change as a function of batch…
anjsimmo Oct 8, 2023
e23ddbf
Add vfdt to comparison
anjsimmo Oct 8, 2023
49a8635
Add example of similarity score failure
anjsimmo Oct 9, 2023
b422663
Add notebooks demonstrating bugs in similarity metric
anjsimmo Oct 12, 2023
db9b366
Update notebook demonstrating bug for trees consting of only an empty…
anjsimmo Oct 12, 2023
505491c
Document experiment 2 analysis
anjsimmo Oct 13, 2023
a164a66
Merge pull request #1 from anjsimmo/re-evaluation
anjsimmo Sep 26, 2024
4e76c23
Update README
anjsimmo Sep 26, 2024
9e9ec40
Remove rulecosi (not used in final paper)
anjsimmo Sep 26, 2024
0c8e2f4
Remove unused/broken top level files
anjsimmo Sep 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion 3rdparty/rulecosi
Submodule rulecosi deleted from e913c2
19 changes: 0 additions & 19 deletions Dockerfile

This file was deleted.

8 changes: 0 additions & 8 deletions Dockerfile.Notebook

This file was deleted.

73 changes: 56 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,77 @@
# tree_diff

To get started:
This repository contains code for the paper [Minimising changes to audit when updating decision trees](https://arxiv.org/abs/2408.16321).

# Interactive notebooks

We provide notebooks that can be used to explore how the algorithm works without running the full experiment.

To get started with the interactive notebooks:
* Install poetry (https://python-poetry.org/)
* `poetry install`
* `export PYTHONPATH="$PWD"`
* `poetry run jupyter notebook`
* Navigate to `tree_diff/notebooks` in the browser
* Done!

# Run pipeline
* `poetry run python -m tree_diff assembler=baseline mode=train input_path=<project directory>/input`

# Additional dependencies

We use the `dot` command line tool for generating tree figures:
* `dot` command line tool on path (brew install graphviz / conda install graphviz)
* `rulecosi` algorithm needs to be git pulled into a `3rdparty` directory:
* Run `mkdir 3rdparty` from the project directory
* `cd 3rdparty`
* `git clone https://github.com/jobregon1212/rulecosi.git`

In order to run the experiments in the `tree-dff/notebooks/Evaluation.ipynb`.

Install the dependencies using the command below:
The full experiment was run in a conda environment. In addition to standard conda dependencies (sklearn, pandas, numpy, matplotlib), comparisons to EFDT need the river library. Install the additional dependencies using the command below:

```console
$ pip install -r requirements.txt
```

# Datasets
The following datasets are used in the experiments and evaluation for our model with other decision trees:

* [`Adult`](https://www.kaggle.com/datasets/wenruliu/adult-income-dataset)
* [`Mushroom`](https://www.kaggle.com/datasets/uciml/mushroom-classification)
* [`HIGGS`](https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz)
* [`Android Malware`](https://archive.ics.uci.edu/ml/machine-learning-databases/00622/TUANDROMD.csv)
Datasets for the paper are available from the UCI Machine Learning Repository at https://archive.ics.uci.edu. Extract the datasets to a directory called `datasets`. Your folder structure should look like this:

```
tree_diff/
└...

datasets/
├── covertype
│   ├── covtype.data
│   ├── covtype.info
│   └── old_covtype.info
├── covertype.zip
├── hepmass
│   ├── 1000_test.csv
│   ├── 1000_train.csv
│   ├── all_test.csv
│   ├── all_train.csv
│   ├── not1000_test.csv
│   └── not1000_train.csv
├── hepmass.zip
├── higgs
│   └── HIGGS.csv
├── higgs.zip
├── poker+hand
│   ├── poker-hand.names
│   ├── poker-hand-testing.data
│   └── poker-hand-training-true.data
├── poker+hand.zip
├── skin+segmentation
│   └── Skin_NonSkin.txt
├── skin+segmentation.zip
├── susy
│   └── SUSY.csv
└── susy.zip
```

# Running the experiments

After downloading the datasets, you can run the experiment in the paper:

```
python3 experiment2.py
```

# Analysis of experiment results

The experiment will write results to an output directory (starting with `out` followed by a number). The notebook used to analyse these experiment results is `notebooks/experiment2_analysis.ipynb`

After downloading these datasets, place them in the `tree-diff/notebooks` folder to run the experiments.
3 changes: 3 additions & 0 deletions data_scripts.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
head -n 10000 ../datasets/higgs/HIGGS.csv > HIGGS10000.csv
head -n 10000 ../datasets/susy/SUSY.csv > SUSY10000.csv
head -n 10000 ../datasets/hepmass/all_train.csv > hepmass10000.csv
193 changes: 0 additions & 193 deletions dodo.py

This file was deleted.

Loading