Add fit, fit_transform, transform, predict, and score methods #119

richford · 2022-06-18T20:49:16Z

...to AFQDataset

Resolves #118

arokem · 2022-06-19T06:41:06Z

Overall, LGTM. The test failures probably have to do with API changes in new sklearn versions.

I wonder whether we'd like to further automate train/test(/validation?) splitting in the imputation scenarios. Also wondering how this would play with sklearn Pipeline objects.

richford · 2022-06-19T07:27:39Z

I don't think this will play well with the pipeline objects since the API is different. Whereas sklearn fit, transform, etc. expect X, y = None as input, these methods take a model parameter. That is, the sklearn API brings data to a model, whereas our API brings a model to the dataset.

For train/test split, I was thinking of something like

dataset = AFQDataset.from_files(...)
dataset_train, dataset_test = train_test_split(dataset)
imputer = SimpleImputer()
imputer = dataset_train.fit(imputer)
dataset_train = dataset_train.transform(imputer)
dataset_test = dataset_test.transform(imputer)

Since we can already train/test split the datasets themselves.

But I'm certainly open to other suggestions for this workflow.

arokem · 2022-06-19T07:32:08Z

This all makes sense. We'd still need to implement our own train_test_split that wraps around sklearn's to mediate the second line in this code example, right? Do you want to add it to this PR?

richford · 2022-06-19T07:34:20Z

I'm not sure I understand. The second line works as is right now. Do you mean that I should write a wrapper function that implements all of lines 2-6 in my example?

arokem · 2022-06-19T07:36:53Z

OK OK. I think that I understand. The AFQDataset ducktypes the array interface, so sklearn's train_test_split works without the need for anything else. All good. So, what else needs to be done here, apart from updating the sklearn API?

richford · 2022-06-19T07:50:01Z

Yeah, I need to do that. Also, I'm wondering what should be returned by the predict method. In the sklearn API, it returns y_prep, a numpy array. And that's how I've written it already. But this choice doesn't parallel the design choices that we've already made for transform and fit_transform, which return transformed AFQDataset objects. I think I'm okay with this but just wanted to run this by others explicitly.

Also, I need to add tests.

richford · 2022-06-19T10:02:29Z

Ugh, it seems that the sklearn API does not want us to build an object that has both __len__/shape parameters and a also a fit method. Inside of train_test_split, sklearn calls it's own _num_samples method, which first checks that the input does not have a fit method.

One proposed solution is to prepend model_ to each of the current methods fit, transform, predict, fit_transform, and score. So that the example above becomes

dataset = AFQDataset.from_files(...)
dataset_train, dataset_test = train_test_split(dataset)
imputer = SimpleImputer()
imputer = dataset_train.model_fit(imputer)
dataset_train = dataset_train.model_transform(imputer)
dataset_test = dataset_test.model_transform(imputer)

I'll make that change now but welcome other suggested solutions.

arokem · 2022-06-20T07:39:32Z

This API makes sense to me. It's actually sensible that sklearn is preventing us from mimicking the estimator API, because it needs to be clear that AFQDataSet is not an estimator. These method names hopefully make it clear that an estimator is needed as input.

Regarding the output of mode_predict, I agree that a numpy array is sensible. I'd avoid creating even more functionality for model evaluation, instead directly using functions that take numpy arrays as inputs (e.g., r2_score).

…ataset

arokem

A few small questions/comments

afqinsight/cross_validate.py

afqinsight/datasets.py

afqinsight/tests/test_datasets.py

doc/conf.py

Co-authored-by: Ariel Rokem <arokem@gmail.com>

…into enh/fit-on-dataset

arokem · 2022-06-22T09:56:54Z

I'm running into some version incompatibilities between sklearn and skopt.

Running with sklearn 0.23.2 (which is what I had in the env when I fetched this branch), I get:

In [1]: run plot_hbn_site_profiles.py
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/source/AFQ-Insight/examples/plot_hbn_site_profiles.py in <module>
     34 import numpy as np
     35 
---> 36 from afqinsight import AFQDataset
     37 from afqinsight.plot import plot_tract_profiles
     38 from neurocombat_sklearn import CombatModel

~/source/AFQ-Insight/afqinsight/__init__.py in <module>
      2 from . import datasets  # noqa
      3 from . import utils  # noqa
----> 4 from .cross_validate import *  # noqa
      5 from .datasets import *  # noqa
      6 from .pipeline import *  # noqa

~/source/AFQ-Insight/afqinsight/cross_validate.py in <module>
     13 from sklearn.metrics._scorer import _check_multimetric_scoring
     14 from sklearn.model_selection._split import check_cv
---> 15 from sklearn.model_selection._validation import (
     16     _aggregate_score_dicts,
     17     _fit_and_score,

ImportError: cannot import name '_normalize_score_results' from 'sklearn.model_selection._validation' (/Users/arokem/miniconda3/envs/afqinsight/lib/python3.8/site-packages/sklearn/model_selection/_validation.py)

The after installing AFQ-Insight with pip install -e ., which upgrades sklearn to 1.1.1 I get:

In [1]: run plot_hbn_site_profiles.py
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/source/AFQ-Insight/examples/plot_hbn_site_profiles.py in <module>
     34 import numpy as np
     35 
---> 36 from afqinsight import AFQDataset
     37 from afqinsight.plot import plot_tract_profiles
     38 from neurocombat_sklearn import CombatModel

~/source/AFQ-Insight/afqinsight/__init__.py in <module>
      1 """AFQ-Insight is a Python library for statistical learning of tractometry data."""
----> 2 from . import datasets  # noqa
      3 from . import utils  # noqa
      4 from .cross_validate import *  # noqa
      5 from .datasets import *  # noqa

~/source/AFQ-Insight/afqinsight/datasets.py in <module>
     10 from dipy.utils.optpkg import optional_package
     11 from dipy.utils.tripwire import TripWire
---> 12 from groupyr.transform import GroupAggregator
     13 from sklearn.preprocessing import LabelEncoder
     14 

~/miniconda3/envs/afqinsight/lib/python3.8/site-packages/groupyr/__init__.py in <module>
      7 from . import datasets  # noqa
      8 from . import utils  # noqa
----> 9 from .sgl import *  # noqa
     10 from .logistic import *  # noqa
     11 from ._version import version as __version__  # noqa

~/miniconda3/envs/afqinsight/lib/python3.8/site-packages/groupyr/sgl.py in <module>
      9 from scipy import sparse
     10 from scipy.optimize import root_scalar
---> 11 from skopt import BayesSearchCV
     12 from tqdm.auto import tqdm
     13 

~/miniconda3/envs/afqinsight/lib/python3.8/site-packages/skopt/__init__.py in <module>
     53     from .optimizer import gp_minimize
     54     from .optimizer import Optimizer
---> 55     from .searchcv import BayesSearchCV
     56     from .space import Space
     57     from .utils import dump

~/miniconda3/envs/afqinsight/lib/python3.8/site-packages/skopt/searchcv.py in <module>
     14 from sklearn.model_selection._search import BaseSearchCV
     15 from sklearn.utils import check_random_state
---> 16 from sklearn.utils.fixes import MaskedArray
     17 
     18 from sklearn.utils.validation import indexable, check_is_fitted

ImportError: cannot import name 'MaskedArray' from 'sklearn.utils.fixes' (/Users/arokem/miniconda3/envs/afqinsight/lib/python3.8/site-packages/sklearn/utils/fixes.py)

@richford : what versions of skopt, sklearn (other things?) do you have in your env?

arokem · 2022-06-22T10:22:59Z

OK - installing into a clean env works. I don't exactly understand how, but I now get skopt 0.9.0 (not yet on pypi?? https://pypi.org/project/scikit-opt/#history). Where is this coming from? Is this installed from github in one of the AFQ-Insight dependencies? I don't see it in groupyr. At any rate, should we pin these versions here?

richford · 2022-06-22T10:33:11Z

We now require sklearn >= 1.0.0 (see e.g. this line), but you're right that there's no version requirement for skopt. We only get the skopt requirement through groupyr, it's not required in AFQ-Insight.

Actually, I think a good solution would be to use

from dipy.utils.optpkg import optional_package
from dipy.utils.tripwire import TripWire

in groupyr and move skopt to an optional install in groupyr.

I think for now, we can merge this (assuming everything else looks good to you) and then open a groupyr issue about skopt inclusion.

arokem

A couple of really small things related to the examples. I think that we can punt most of these to another PR and go ahead and merge this, though.

examples/plot_hbn_site_profiles.py

examples/demo_afq_dataset.py

arokem

I wonder what it would take to make neurocombat_sklearn fully sklearn-compatible. I remember looking into this at some point and not having fun.

examples/plot_hbn_site_profiles.py

richford · 2022-06-22T16:58:45Z

I'm not sure what it would take exactly but I know they'd at least have to deprecate the positional arguments and rename the data parameter to X. Then I'm sure that check_estimator would probably also have lots of other complaints.

examples/plot_hbn_site_profiles.py

Co-authored-by: Ariel Rokem <arokem@gmail.com>

richford · 2022-06-23T08:59:06Z

I think this may be good to go. I started a new PR for HBN and downloading issues/enhancements.

richford added 16 commits June 21, 2022 09:43

Add fit, fit_transform, transform, predict, and score methods to AFQD…

3342ef4

…ataset

DEP: Update groupyr dependency

da8202a

MAINT: Update .zenodo.json to add Jason and John

41b1537

BF: Add model_ prefix to the fit, transform, predict, etc. methods

7c39696

Use allclose in doctest instead of exact values

a0a1d31

Update cross_validate doctest expected result

7dd8099

BF: Okay really correct the doctest values

d567fc6

Use np.allclose in cross_validate doctest

0a96670

Add tests of model fit, transform, predict, etc.

2c6f1b7

Add AFQDataset to doc/api.rst

068226f

Remove y param from model.transform

ebf7c6d

Add a doc example to show manipulation of AFQDataset

a6cacd1

STY: Fix flake error in demo_afq_dataset.py

110fa5d

Add plot_hbn_site_profiles example

3da02c9

Add s3fs to dev dependencies

2378560

BF: Fix input checking for plot_bundle_profiles function

f54c4c8

richford force-pushed the enh/fit-on-dataset branch from 59bfbb7 to f54c4c8 Compare June 21, 2022 08:44

richford added 3 commits June 21, 2022 09:52

Undo the redundant commits to plot_bundle_profiles

5de5c45

Add test for AFQDataset.copy()

d87fc35

BF: Use equal_nan=True in unit tests for AFQDataset.copy()

ef4e9a2

arokem reviewed Jun 21, 2022

View reviewed changes

Update afqinsight/datasets.py

2df83e0

Co-authored-by: Ariel Rokem <arokem@gmail.com>

richford and others added 7 commits June 21, 2022 11:31

Update afqinsight/datasets.py

644e6d0

Co-authored-by: Ariel Rokem <arokem@gmail.com>

Update afqinsight/datasets.py

877a5a0

Co-authored-by: Ariel Rokem <arokem@gmail.com>

Update afqinsight/datasets.py

31d1bcc

Co-authored-by: Ariel Rokem <arokem@gmail.com>

Update afqinsight/datasets.py

8b0511b

Co-authored-by: Ariel Rokem <arokem@gmail.com>

Use np.allclose to verify deep copy in AFQDataset.copy unit test

e0a3520

Use ellipses and normalize whitespace in doctest for cross_validate

3354d87

Merge branch 'enh/fit-on-dataset' of github.com:richford/AFQ-Insight …

edce6a5

…into enh/fit-on-dataset

richford changed the title ~~WIP: Add fit, fit_transform, transform, predict, and score methods~~ Add fit, fit_transform, transform, predict, and score methods Jun 21, 2022

arokem reviewed Jun 22, 2022

View reviewed changes

examples/plot_hbn_site_profiles.py Show resolved Hide resolved

examples/plot_hbn_site_profiles.py Outdated Show resolved Hide resolved

examples/demo_afq_dataset.py Show resolved Hide resolved

arokem reviewed Jun 22, 2022

View reviewed changes

examples/plot_hbn_site_profiles.py Show resolved Hide resolved

examples/plot_hbn_site_profiles.py Outdated Show resolved Hide resolved

arokem reviewed Jun 22, 2022

View reviewed changes

examples/plot_hbn_site_profiles.py Outdated Show resolved Hide resolved

DOC: Incorporate @arokem's suggestions into autodoc examples

7bcb14c

arokem reviewed Jun 22, 2022

View reviewed changes

examples/plot_hbn_site_profiles.py Outdated Show resolved Hide resolved

richford and others added 3 commits June 22, 2022 18:08

STY: Fix flake8 trailing whitespace error

9e22072

Update examples/plot_hbn_site_profiles.py

4356673

Co-authored-by: Ariel Rokem <arokem@gmail.com>

STY: change 'for' to 'to'

8c6c626

This was referenced Jun 23, 2022

Add HBN #121

Merged

Add Python 3.10 to CI workflows #122

Merged

arokem merged commit 5bdfe0b into main Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fit, fit_transform, transform, predict, and score methods #119

Add fit, fit_transform, transform, predict, and score methods #119

richford commented Jun 18, 2022

arokem commented Jun 19, 2022

richford commented Jun 19, 2022

arokem commented Jun 19, 2022

richford commented Jun 19, 2022

arokem commented Jun 19, 2022

richford commented Jun 19, 2022

richford commented Jun 19, 2022

arokem commented Jun 20, 2022

arokem left a comment

arokem commented Jun 22, 2022

arokem commented Jun 22, 2022

richford commented Jun 22, 2022

arokem left a comment

arokem left a comment

richford commented Jun 22, 2022 •

edited

Loading

richford commented Jun 23, 2022

Add fit, fit_transform, transform, predict, and score methods #119

Add fit, fit_transform, transform, predict, and score methods #119

Conversation

richford commented Jun 18, 2022

arokem commented Jun 19, 2022

richford commented Jun 19, 2022

arokem commented Jun 19, 2022

richford commented Jun 19, 2022

arokem commented Jun 19, 2022

richford commented Jun 19, 2022

richford commented Jun 19, 2022

arokem commented Jun 20, 2022

arokem left a comment

Choose a reason for hiding this comment

arokem commented Jun 22, 2022

arokem commented Jun 22, 2022

richford commented Jun 22, 2022

arokem left a comment

Choose a reason for hiding this comment

arokem left a comment

Choose a reason for hiding this comment

richford commented Jun 22, 2022 • edited Loading

richford commented Jun 23, 2022

richford commented Jun 22, 2022 •

edited

Loading