Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fit, fit_transform, transform, predict, and score methods #119

Merged
merged 31 commits into from
Jun 23, 2022

Conversation

richford
Copy link
Collaborator

...to AFQDataset

Resolves #118

@arokem
Copy link
Collaborator

arokem commented Jun 19, 2022

Overall, LGTM. The test failures probably have to do with API changes in new sklearn versions.

I wonder whether we'd like to further automate train/test(/validation?) splitting in the imputation scenarios. Also wondering how this would play with sklearn Pipeline objects.

@richford
Copy link
Collaborator Author

I don't think this will play well with the pipeline objects since the API is different. Whereas sklearn fit, transform, etc. expect X, y = None as input, these methods take a model parameter. That is, the sklearn API brings data to a model, whereas our API brings a model to the dataset.

For train/test split, I was thinking of something like

dataset = AFQDataset.from_files(...)
dataset_train, dataset_test = train_test_split(dataset)
imputer = SimpleImputer()
imputer = dataset_train.fit(imputer)
dataset_train = dataset_train.transform(imputer)
dataset_test = dataset_test.transform(imputer)

Since we can already train/test split the datasets themselves.

But I'm certainly open to other suggestions for this workflow.

@arokem
Copy link
Collaborator

arokem commented Jun 19, 2022

This all makes sense. We'd still need to implement our own train_test_split that wraps around sklearn's to mediate the second line in this code example, right? Do you want to add it to this PR?

@richford
Copy link
Collaborator Author

I'm not sure I understand. The second line works as is right now. Do you mean that I should write a wrapper function that implements all of lines 2-6 in my example?

@arokem
Copy link
Collaborator

arokem commented Jun 19, 2022

OK OK. I think that I understand. The AFQDataset ducktypes the array interface, so sklearn's train_test_split works without the need for anything else. All good. So, what else needs to be done here, apart from updating the sklearn API?

@richford
Copy link
Collaborator Author

Yeah, I need to do that. Also, I'm wondering what should be returned by the predict method. In the sklearn API, it returns y_prep, a numpy array. And that's how I've written it already. But this choice doesn't parallel the design choices that we've already made for transform and fit_transform, which return transformed AFQDataset objects. I think I'm okay with this but just wanted to run this by others explicitly.

Also, I need to add tests.

@richford
Copy link
Collaborator Author

Ugh, it seems that the sklearn API does not want us to build an object that has both __len__/shape parameters and a also a fit method. Inside of train_test_split, sklearn calls it's own _num_samples method, which first checks that the input does not have a fit method.

One proposed solution is to prepend model_ to each of the current methods fit, transform, predict, fit_transform, and score. So that the example above becomes

dataset = AFQDataset.from_files(...)
dataset_train, dataset_test = train_test_split(dataset)
imputer = SimpleImputer()
imputer = dataset_train.model_fit(imputer)
dataset_train = dataset_train.model_transform(imputer)
dataset_test = dataset_test.model_transform(imputer)

I'll make that change now but welcome other suggested solutions.

@arokem
Copy link
Collaborator

arokem commented Jun 20, 2022

This API makes sense to me. It's actually sensible that sklearn is preventing us from mimicking the estimator API, because it needs to be clear that AFQDataSet is not an estimator. These method names hopefully make it clear that an estimator is needed as input.

Regarding the output of mode_predict, I agree that a numpy array is sensible. I'd avoid creating even more functionality for model evaluation, instead directly using functions that take numpy arrays as inputs (e.g., r2_score).

Copy link
Collaborator

@arokem arokem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small questions/comments

afqinsight/cross_validate.py Outdated Show resolved Hide resolved
afqinsight/datasets.py Outdated Show resolved Hide resolved
afqinsight/datasets.py Outdated Show resolved Hide resolved
afqinsight/datasets.py Outdated Show resolved Hide resolved
afqinsight/datasets.py Outdated Show resolved Hide resolved
afqinsight/datasets.py Outdated Show resolved Hide resolved
afqinsight/tests/test_datasets.py Show resolved Hide resolved
doc/conf.py Show resolved Hide resolved
Co-authored-by: Ariel Rokem <arokem@gmail.com>
@richford richford changed the title WIP: Add fit, fit_transform, transform, predict, and score methods Add fit, fit_transform, transform, predict, and score methods Jun 21, 2022
@arokem
Copy link
Collaborator

arokem commented Jun 22, 2022

I'm running into some version incompatibilities between sklearn and skopt.

Running with sklearn 0.23.2 (which is what I had in the env when I fetched this branch), I get:

In [1]: run plot_hbn_site_profiles.py
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/source/AFQ-Insight/examples/plot_hbn_site_profiles.py in <module>
     34 import numpy as np
     35 
---> 36 from afqinsight import AFQDataset
     37 from afqinsight.plot import plot_tract_profiles
     38 from neurocombat_sklearn import CombatModel

~/source/AFQ-Insight/afqinsight/__init__.py in <module>
      2 from . import datasets  # noqa
      3 from . import utils  # noqa
----> 4 from .cross_validate import *  # noqa
      5 from .datasets import *  # noqa
      6 from .pipeline import *  # noqa

~/source/AFQ-Insight/afqinsight/cross_validate.py in <module>
     13 from sklearn.metrics._scorer import _check_multimetric_scoring
     14 from sklearn.model_selection._split import check_cv
---> 15 from sklearn.model_selection._validation import (
     16     _aggregate_score_dicts,
     17     _fit_and_score,

ImportError: cannot import name '_normalize_score_results' from 'sklearn.model_selection._validation' (/Users/arokem/miniconda3/envs/afqinsight/lib/python3.8/site-packages/sklearn/model_selection/_validation.py)

The after installing AFQ-Insight with pip install -e ., which upgrades sklearn to 1.1.1 I get:

In [1]: run plot_hbn_site_profiles.py
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/source/AFQ-Insight/examples/plot_hbn_site_profiles.py in <module>
     34 import numpy as np
     35 
---> 36 from afqinsight import AFQDataset
     37 from afqinsight.plot import plot_tract_profiles
     38 from neurocombat_sklearn import CombatModel

~/source/AFQ-Insight/afqinsight/__init__.py in <module>
      1 """AFQ-Insight is a Python library for statistical learning of tractometry data."""
----> 2 from . import datasets  # noqa
      3 from . import utils  # noqa
      4 from .cross_validate import *  # noqa
      5 from .datasets import *  # noqa

~/source/AFQ-Insight/afqinsight/datasets.py in <module>
     10 from dipy.utils.optpkg import optional_package
     11 from dipy.utils.tripwire import TripWire
---> 12 from groupyr.transform import GroupAggregator
     13 from sklearn.preprocessing import LabelEncoder
     14 

~/miniconda3/envs/afqinsight/lib/python3.8/site-packages/groupyr/__init__.py in <module>
      7 from . import datasets  # noqa
      8 from . import utils  # noqa
----> 9 from .sgl import *  # noqa
     10 from .logistic import *  # noqa
     11 from ._version import version as __version__  # noqa

~/miniconda3/envs/afqinsight/lib/python3.8/site-packages/groupyr/sgl.py in <module>
      9 from scipy import sparse
     10 from scipy.optimize import root_scalar
---> 11 from skopt import BayesSearchCV
     12 from tqdm.auto import tqdm
     13 

~/miniconda3/envs/afqinsight/lib/python3.8/site-packages/skopt/__init__.py in <module>
     53     from .optimizer import gp_minimize
     54     from .optimizer import Optimizer
---> 55     from .searchcv import BayesSearchCV
     56     from .space import Space
     57     from .utils import dump

~/miniconda3/envs/afqinsight/lib/python3.8/site-packages/skopt/searchcv.py in <module>
     14 from sklearn.model_selection._search import BaseSearchCV
     15 from sklearn.utils import check_random_state
---> 16 from sklearn.utils.fixes import MaskedArray
     17 
     18 from sklearn.utils.validation import indexable, check_is_fitted

ImportError: cannot import name 'MaskedArray' from 'sklearn.utils.fixes' (/Users/arokem/miniconda3/envs/afqinsight/lib/python3.8/site-packages/sklearn/utils/fixes.py)

@richford : what versions of skopt, sklearn (other things?) do you have in your env?

@arokem
Copy link
Collaborator

arokem commented Jun 22, 2022

OK - installing into a clean env works. I don't exactly understand how, but I now get skopt 0.9.0 (not yet on pypi?? https://pypi.org/project/scikit-opt/#history). Where is this coming from? Is this installed from github in one of the AFQ-Insight dependencies? I don't see it in groupyr. At any rate, should we pin these versions here?

@richford
Copy link
Collaborator Author

We now require sklearn >= 1.0.0 (see e.g. this line), but you're right that there's no version requirement for skopt. We only get the skopt requirement through groupyr, it's not required in AFQ-Insight.

Actually, I think a good solution would be to use

from dipy.utils.optpkg import optional_package
from dipy.utils.tripwire import TripWire

in groupyr and move skopt to an optional install in groupyr.

I think for now, we can merge this (assuming everything else looks good to you) and then open a groupyr issue about skopt inclusion.

Copy link
Collaborator

@arokem arokem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of really small things related to the examples. I think that we can punt most of these to another PR and go ahead and merge this, though.

examples/plot_hbn_site_profiles.py Show resolved Hide resolved
examples/plot_hbn_site_profiles.py Outdated Show resolved Hide resolved
examples/demo_afq_dataset.py Show resolved Hide resolved
Copy link
Collaborator

@arokem arokem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what it would take to make neurocombat_sklearn fully sklearn-compatible. I remember looking into this at some point and not having fun.

examples/plot_hbn_site_profiles.py Show resolved Hide resolved
examples/plot_hbn_site_profiles.py Outdated Show resolved Hide resolved
@richford
Copy link
Collaborator Author

richford commented Jun 22, 2022

I'm not sure what it would take exactly but I know they'd at least have to deprecate the positional arguments and rename the data parameter to X. Then I'm sure that check_estimator would probably also have lots of other complaints.

@richford
Copy link
Collaborator Author

I think this may be good to go. I started a new PR for HBN and downloading issues/enhancements.

This was referenced Jun 23, 2022
@arokem arokem merged commit 5bdfe0b into main Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to transform datasets
2 participants