Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AFQDataset work with different input formats and make indexable #105

Merged
merged 11 commits into from
Jan 12, 2022

Conversation

richford
Copy link
Collaborator

@richford richford commented Jan 7, 2022

Resolves #98
This PR

  • makes AFQDataset initialization more general so that user can provide X, y, groups, feature_names, group_names, subjects, sessions, classes explicitly on init.
  • create a new static method from_files() that takes all of the filenames that are in the current AFQDataset init method and returns an AFQDataset object with all of the data read in.
  • Make AFQDataset more array-like by adding __getitem__ and __len__ methods and a shape parameter. This means users can do things like dataset[10:20] and also that AFQDatasets can be used as input to sklearn functions like train_test_split.
  • Adds a lot of detail and doctests to AFQDataset's docstring

Note that I chose a different solution to dataset splitting than the one I proposed in #98. I think making the datasets indexable and therefore interoperable with scikit-learn's already existing, performant, and robust model selection routines is a much better solution than rolling our own split method.

This doesn't resolve the need for imputation methods, which we hint at in #98 but I now think that if we still want those, we should open up another issue and PR for that.

@richford richford added enhancement New feature or request documentation labels Jan 7, 2022
@richford richford requested a review from arokem January 7, 2022 17:52
Copy link
Collaborator

@arokem arokem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks great! I think that a sphinx gallery example of using train_test_split with some real data should be helpful. But we can take that on a separate PR.

afqinsight/datasets.py Outdated Show resolved Hide resolved
afqinsight/tests/test_datasets.py Outdated Show resolved Hide resolved
@arokem arokem merged commit 9836dfb into main Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add capabilities to split and impute AFQDataset
2 participants