Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] add PermutationForests and FeatureImportanceForests to sktree #125

Merged
merged 79 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from 67 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
5a19fc6
ENH initialize with MIRF_AUC and MIRF_MV
PSSF23 Sep 11, 2023
f1a8e49
ENH add statistic alternatives
PSSF23 Sep 11, 2023
fd0b937
no axis=1 when taking posterior slice in MI
sampan501 Sep 12, 2023
efc2587
ENH add y-label permutation test to MIGHT
PSSF23 Sep 12, 2023
d1f7748
FIX rename import
PSSF23 Sep 12, 2023
d4abb4a
FIX, correct function param
PSSF23 Sep 12, 2023
a342482
FIX remove axis param & TST initialize test file
PSSF23 Sep 12, 2023
69c76a8
Adding modularity
adam2392 Sep 12, 2023
11f38c2
Merge branch 'might' of https://github.com/neurodata/scikit-tree into…
adam2392 Sep 12, 2023
b3ab11d
TST experiment with unit test
PSSF23 Sep 12, 2023
3ef1d7e
Merge branch 'main' into might
PSSF23 Sep 12, 2023
4ed31f8
FIX correct variable name
PSSF23 Sep 12, 2023
9114859
TST remove patch oblique tree tests
PSSF23 Sep 12, 2023
bbdfa33
Merged main
adam2392 Sep 12, 2023
5e1127f
Merge branch 'main' into might
adam2392 Sep 12, 2023
6b52608
WIP
adam2392 Sep 13, 2023
1a5ebe3
Linear model not working?
adam2392 Sep 14, 2023
d8da658
Adding permutation forest that seems to work
adam2392 Sep 15, 2023
3d666af
Upload notebook
adam2392 Sep 17, 2023
aab5df3
FIX correct MI calculation for MIGHT 2-class
PSSF23 Sep 18, 2023
972fe95
Correlated logit model
adam2392 Sep 18, 2023
f26825d
Patch up python code
adam2392 Sep 18, 2023
168935c
Add documentation
adam2392 Sep 18, 2023
01f2085
Adding posterior
adam2392 Sep 19, 2023
e9d07d8
Fix mesonb uild
adam2392 Sep 19, 2023
1dcd4f6
Fix unit-test
adam2392 Sep 19, 2023
37b1648
Fix docs errors
adam2392 Sep 19, 2023
5f9954d
Merge branch 'main' into might
adam2392 Sep 19, 2023
a28a842
Fix unit-test
adam2392 Sep 19, 2023
b88e12e
Clean up API
adam2392 Sep 19, 2023
579003f
Working clean code
adam2392 Sep 19, 2023
2fa68ce
Working clean code
adam2392 Sep 19, 2023
7fbdd38
Working clean code
adam2392 Sep 19, 2023
918f934
Working clean code
adam2392 Sep 19, 2023
bd02877
Fixed unit-tests
adam2392 Sep 19, 2023
1eb8604
Adding coverage
adam2392 Sep 19, 2023
8c1d15c
Updated example
adam2392 Sep 20, 2023
c40e866
Fix bug
adam2392 Sep 20, 2023
a6b1a0c
Fix example
adam2392 Sep 20, 2023
21f3b22
Fix unit-test
adam2392 Sep 20, 2023
c8af371
Remove unnecessary doc string
adam2392 Sep 20, 2023
fd5a63c
DOC correct result evaluation comment
PSSF23 Sep 20, 2023
aa404b0
Merge branch 'might' of https://github.com/neurodata/scikit-tree into…
adam2392 Sep 20, 2023
eee4cbc
Try redirect
adam2392 Sep 20, 2023
527397d
Try again
adam2392 Sep 20, 2023
fe848a4
Try again
adam2392 Sep 20, 2023
9a05235
Improve the checking inputs of feature importance forests
adam2392 Sep 21, 2023
dddacb1
Fix unit test
adam2392 Sep 21, 2023
33c9a74
Try again
adam2392 Sep 21, 2023
64f2017
Fix docs
adam2392 Sep 21, 2023
b8dc3a2
Fix pvalue sampling
adam2392 Sep 28, 2023
028f17d
Fixed pvalue issue
adam2392 Sep 28, 2023
2f06e76
Cleanup
adam2392 Sep 28, 2023
dc6dd29
Fix docs biuld
adam2392 Sep 28, 2023
dbf079e
Fix unit-test
adam2392 Sep 28, 2023
1c0f66b
Add update reshapes
adam2392 Sep 29, 2023
39aef2a
Add some todos and fixes from quick call w/ sambit/hao
adam2392 Oct 2, 2023
48d889b
set covariate_index to None by default and change self.n_classes
sampan501 Oct 3, 2023
bbb5c7c
Fix a few issues and consolidate todos
adam2392 Oct 3, 2023
563be04
Merge branch 'might' of https://github.com/neurodata/scikit-tree into…
adam2392 Oct 3, 2023
5ad1b44
Fix
adam2392 Oct 3, 2023
80ada68
Add clone to get estimators
adam2392 Oct 3, 2023
ff37740
ENH mark all default tests as MI and correct posterior return parameter
PSSF23 Oct 3, 2023
aed9179
FIX unify all variable names so posteriors are not saved twice
PSSF23 Oct 3, 2023
c716440
Add additional testing
adam2392 Oct 3, 2023
f6cb04b
Fix CI
adam2392 Oct 3, 2023
8df008d
Adding parallelization test
adam2392 Oct 3, 2023
7964d99
FIX remove extra print statememts
PSSF23 Oct 4, 2023
8b5a7d1
Remove numpy nanmean warnings and also bug fix of some code (#133)
adam2392 Oct 4, 2023
3a2279a
Add fixes
adam2392 Oct 4, 2023
37e6643
Merge branch 'might' of https://github.com/neurodata/scikit-tree into…
adam2392 Oct 4, 2023
3a4a4b4
Add parallelization to the tree building and predicting posteriors
adam2392 Oct 4, 2023
e91060f
ENH add MIGHT example notebook on AUC
PSSF23 Oct 4, 2023
efbd440
Consolidate parallleization
adam2392 Oct 4, 2023
8718b0f
Merge branch 'might' of https://github.com/neurodata/scikit-tree into…
adam2392 Oct 4, 2023
be16e5a
set default for covariate_index in ForestHT test
sampan501 Oct 5, 2023
26b5b5f
Add unit-test for small sample sizes
adam2392 Oct 5, 2023
80a4304
Final commit
adam2392 Oct 5, 2023
60d9c85
Release v0.2
adam2392 Oct 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .codespellignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
raison
nd
parth
ot
ot
fpr
7 changes: 4 additions & 3 deletions .github/workflows/circle_artifacts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,21 @@ on: [status]
# Restrict the permissions granted to the use of secrets.GITHUB_TOKEN in this
# github actions workflow:
# https://docs.github.com/en/actions/security-guides/automatic-token-authentication
permissions:
statuses: write
permissions: read-all

jobs:
circleci_artifacts_redirector_job:
runs-on: ubuntu-20.04
if: "github.repository == 'neurodata/scikit-tree' && github.event.context == 'ci/circleci: build_docs'"
permissions:
statuses: write
name: Run CircleCI artifacts redirector
steps:
- name: GitHub Action step
uses: larsoner/circleci-artifacts-redirector-action@master
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
api-token: ${{ secrets.CIRCLE_TOKEN }}
api-token: ${{ secrets.CIRCLECI_TOKEN }}
artifact-path: 0/dev/index.html
circleci-jobs: build_docs
job-title: Check the rendered docs here!
Expand Down
113 changes: 110 additions & 3 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,115 @@ jobs:
./spin --help
./spin coverage --help
./spin test --help
./spin coverage
./spin test

- name: debug
run: |
ls $PWD/build-install/usr/lib/python${{matrix.python-version}}/site-packages/
echo "Okay..."
ls $PWD/build
ls ./

- name: Save build
uses: actions/upload-artifact@v3
with:
name: sktree-build
path: $PWD/build

build_and_test_slow:
name: Slow Meson build ${{ matrix.os }} - py${{ matrix.python-version }}
timeout-minutes: 20
needs: [build_and_test]
strategy:
fail-fast: false
matrix:
os: [ubuntu-22.04]
python-version: ["3.11"]
poetry-version: [1.5.0]
runs-on: ${{ matrix.os }}
defaults:
run:
shell: bash
env:
# to make sure coverage/test command builds cleanly
FORCE_SUBMODULE: True
steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v4.6.1
with:
python-version: ${{ matrix.python-version }}
architecture: "x64"
cache: "pip"
cache-dependency-path: "requirements.txt"

- name: show-gcc
run: |
gcc --version

- name: Install Ccache for MacOSX
if: ${{ matrix.os == 'macos-latest'}}
run: |
brew install ccache

- name: Install packages for Ubuntu
if: ${{ matrix.os == 'ubuntu-22.04'}}
run: |
sudo apt-get update
sudo apt-get install -y libopenblas-dev libatlas-base-dev liblapack-dev gfortran libgmp-dev libmpfr-dev libsuitesparse-dev ccache libmpc-dev

- name: Install Python packages
run: |
python -m pip install -r build_requirements.txt
python -m pip install spin
python -m pip install -r test_requirements.txt

- name: Prepare compiler cache
id: prep-ccache
shell: bash
run: |
mkdir -p "${CCACHE_DIR}"
echo "dir=$CCACHE_DIR" >> $GITHUB_OUTPUT
NOW=$(date -u +"%F-%T")
echo "timestamp=${NOW}" >> $GITHUB_OUTPUT

- name: Setup compiler cache
uses: actions/cache@v3
id: cache-ccachev1
# Reference: https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows#matching-a-cache-key
# NOTE: The caching strategy is modeled in a way that it will always have a unique cache key for each workflow run
# (even if the same workflow is run multiple times). The restore keys are not unique and for a partial match, they will
# return the most recently created cache entry, according to the GitHub Action Docs.
with:
path: ${{ steps.prep-ccache.outputs.dir }}
# Restores ccache from either a previous build on this branch or on main
key: ${{ github.workflow }}-${{ matrix.python-version }}-ccache-linux-${{ steps.prep-ccache.outputs.timestamp }}
# This evaluates to `Linux Tests-3.9-ccache-linux-` which is not unique. As the CI matrix is expanded, this will
# need to be updated to be unique so that the cache is not restored from a different job altogether.
restore-keys: |
${{ github.workflow }}-${{ matrix.python-version }}-ccache-linux-

- name: Setup build and install scikit-tree
run: |
./spin build -j 2 --forcesubmodule

- name: Ccache performance
shell: bash -l {0}
run: ccache -s

- name: build-path
run: |
echo "$PWD/build-install/"
export INSTALLED_PATH=$PWD/build-install/usr/lib/python${{matrix.python-version}}/site-packages

- name: Run unit tests and coverage
run: |
./spin --help
./spin coverage --help
./spin test --help
./spin coverage -k "slowtest"
cp $PWD/build-install/usr/lib/python${{matrix.python-version}}/site-packages/coverage.xml ./coverage.xml

- name: debug
Expand All @@ -127,7 +235,6 @@ jobs:
ls ./

- name: Upload coverage stats to codecov
if: ${{ matrix.os == 'ubuntu-22.04' && matrix.python-version == '3.10'}}
uses: codecov/codecov-action@v3
with:
# python spin goes into the INSTALLED path in order to run pytest
Expand All @@ -146,7 +253,7 @@ jobs:
release:
name: Release
runs-on: ubuntu-latest
needs: [build_and_test]
needs: [build_and_test_slow]
if: startsWith(github.ref, 'refs/tags/')
steps:
- name: Checkout repository
Expand Down
51 changes: 19 additions & 32 deletions .spin/cmds.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import os
import shutil
import subprocess
import sys

Expand All @@ -13,39 +12,27 @@ def get_git_revision_hash(submodule) -> str:


@click.command()
@click.option("--build-dir", default="build", help="Build directory; default is `$PWD/build`")
@click.option("--clean", is_flag=True, help="Clean previously built docs before building")
@click.option("--noplot", is_flag=True, help="Build docs without plots")
@click.argument("slowtest", default=True)
@click.pass_context
def docs(ctx, build_dir, clean=False, noplot=False):
"""📖 Build documentation"""
if clean:
doc_dir = "./docs/_build"
if os.path.isdir(doc_dir):
print(f"Removing `{doc_dir}`")
shutil.rmtree(doc_dir)

site_path = meson._get_site_packages()
if site_path is None:
print("No built scikit-tree found; run `./spin build` first.")
sys.exit(1)

util.run(["pip", "install", "-q", "-r", "doc_requirements.txt"])

ctx.invoke(meson.docs)
# os.environ["SPHINXOPTS"] = "-W"
# os.environ["PYTHONPATH"] = f'{site_path}{os.sep}:{os.environ.get("PYTHONPATH", "")}'
# if noplot:
# util.run(["make", "-C", "docs", "clean", "html-noplot"], replace=True)
# else:
# util.run(["make", "-C", "docs", "clean", "html"], replace=True)


@click.command()
@click.pass_context
def coverage(ctx):
def coverage(ctx, slowtest):
"""📊 Generate coverage report"""
pytest_args = ("-o", "python_functions=test_*", "sktree", "--cov=sktree", "--cov-report=xml")
if slowtest:
pytest_args = (
"-o",
"python_functions=test_*",
"sktree",
"--cov=sktree",
"--cov-report=xml",
"-k slowtest",
)
else:
pytest_args = (
"-o",
"python_functions=test_*",
"sktree",
"--cov=sktree",
"--cov-report=xml",
)
ctx.invoke(meson.test, pytest_args=pytest_args)


Expand Down
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,14 @@ scikit-tree is a scikit-learn compatible API for building state-of-the-art decis

Tree-models have withstood the test of time, and are consistently used for modern-day data science and machine learning applications. They especially perform well when there are limited samples for a problem and are flexible learners that can be applied to a wide variety of different settings, such as tabular, images, time-series, genomics, EEG data and more.

We welcome contributions for modern tree-based algorithms. We use Cython to achieve fast C/C++ speeds, while abiding by a scikit-learn compatible (tested) API. Moreover, our Cython internals are easily extensible because they follow the internal Cython API of scikit-learn as well.

Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a fork of scikit-learn at https://github.com/neurodata/scikit-learn when
extending the decision tree model API of scikit-learn. Specifically, we extend the Python and Cython API of the tree submodule in scikit-learn in our submodule, so we can introduce the tree models housed in this package. Thus these extend the functionality of decision-tree based models in a way that is not possible yet in scikit-learn itself. As one example, we introduce an abstract API to allow users to implement their own oblique splits. Our plan in the future is to benchmark these functionalities and introduce them upstream to scikit-learn where applicable and inclusion criterion are met.

Documentation
=============

See here for the documentation for our dev version: https://docs.neurodata.io/scikit-tree/dev/index.html

Why oblique trees and why trees beyond those in scikit-learn?
=============================================================
In 2001, Leo Breiman proposed two types of Random Forests. One was known as ``Forest-RI``, which is the axis-aligned traditional random forest. One was known as ``Forest-RC``, which is the random oblique linear combinations random forest. This leveraged random combinations of features to perform splits. [MORF](1) builds upon ``Forest-RC`` by proposing additional functions to combine features. Other modern tree variants such as Canonical Correlation Forests (CCF), or unsupervised random forests are also important at solving real-world problems using robust decision tree models.
In 2001, Leo Breiman proposed two types of Random Forests. One was known as ``Forest-RI``, which is the axis-aligned traditional random forest. One was known as ``Forest-RC``, which is the random oblique linear combinations random forest. This leveraged random combinations of features to perform splits. [MORF](1) builds upon ``Forest-RC`` by proposing additional functions to combine features. Other modern tree variants such as Canonical Correlation Forests (CCF), Extended Isolation Forests, Quantile Forests, or unsupervised random forests are also important at solving real-world problems using robust decision tree models.

Installation
============
Expand Down Expand Up @@ -105,6 +100,13 @@ Alternatively, you can use editable installs

pip install --no-build-isolation --editable .

Development
===========
We welcome contributions for modern tree-based algorithms. We use Cython to achieve fast C/C++ speeds, while abiding by a scikit-learn compatible (tested) API. Moreover, our Cython internals are easily extensible because they follow the internal Cython API of scikit-learn as well.

Due to the current state of scikit-learn's internal Cython code for trees, we have to instead leverage a fork of scikit-learn at https://github.com/neurodata/scikit-learn when
extending the decision tree model API of scikit-learn. Specifically, we extend the Python and Cython API of the tree submodule in scikit-learn in our submodule, so we can introduce the tree models housed in this package. Thus these extend the functionality of decision-tree based models in a way that is not possible yet in scikit-learn itself. As one example, we introduce an abstract API to allow users to implement their own oblique splits. Our plan in the future is to benchmark these functionalities and introduce them upstream to scikit-learn where applicable and inclusion criterion are met.

References
==========
[1]: [`Li, Adam, et al. "Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks" SIAM Journal on Mathematics of Data Science, 5(1), 77-96, 2023`](https://doi.org/10.1137/21M1449117)
Loading
Loading