[r] Add feature selection methods by variance, dispersion, and mean accessibility #169

immanuelazn · 2024-12-15T03:30:53Z

EDIT: I am putting the current state of the normalization functions here too, to remove a little bit of branching from the PRs, coming from ia/normalizations. The current state of the normalizations are reflected in the first four commits. Nonetheless, a granular diff can be seen in PR #168

For Feature Selection Methods

Details

Create functions to do feature selection, as a foundation for LSI and iterative LSI. These take in the number of features, and an optional function to be passed in for noramlization (if function uses variance or dispersion). The end result is a tibble with columns names, score, and highly_variable.

Tests

Since the interfaces are very similar, I just decided to throw all of them in a loop and test that the tibbles are formed as we expect. I don't know whether it would make sense to test for whether the actual feature selection logic makes sense, because that would just be re-doing the operations on a dgCMatrix. Otherwise, do you have test ideas with better signal on whether these methods perform as we expect?

Notes

I have this merging to the normalization branch, but I just do this to allow for the normalization logic to work within feature selection. I think it would make sense for merging normalizations into main once that is approved, then setting the head to main.

I think the underlying logic is essentially the same between each feature selection method, so I am leaning closer and closer to just putting all of the logic into a single function with a enum param for usage of a specific feature selection method. However, this might clash with LSI/iterative LSI unless we are okay with putting a purrr::partial() eval statement directly in the default args. That is, until we develop the option to do implicit partials.

For Normalization Methods

Details (original)

As discussed, we are looking to add in normalization functions to allow for passing in transformations into orchestrator functions like LSI/iterative LSI. I add in two functions, normalize_tfidf(), and normalize_log() (shown in #167).

There were a few departures from the design doc, in order to provide a little bit more flexibility. Particularly, I was thinking about the case where the feature means are not ordered in the same way. To add in a little bit of safety, I added some logic for index invariant matching for feature means to matrix features.

Other than that, I also provided an option to do a traditional log transform by boolean flag, rather than log1p. As we don't directly expose a log function in BPCells C++ side, I just added a - 1. However, I'm noticing that this isn't translated into a -Inf like in dgCMatrix/generic matrix, and is instead a very small number. Might need to evaluate if this is something we would want to support

Changes

As a result of a round of PR changes, I made a variety of changes. I made the normalize_log() function always use a log1p. normalize_log() also is divided by colSums prior to multiplying by a scale factor, and uses matrix_stats() for multi-threaded calculation. normalize_tfidf() also implements usage of a logartihm transform. Both functions also describe the specific normalization steps they use as a math eqn, which can be viewed within the reference.

immanuelazn added 5 commits December 12, 2024 14:49

[r] add tf-idf and log normalization functions

1ef6091

[r] fix normalization tests

98675d0

[r] add in requested changes

2f83ae6

[r] removed unused variable

6381f74

[r] add feature selection methods

8e80dc5

immanuelazn changed the base branch from main to ia/normalizations December 15, 2024 03:32

immanuelazn changed the title ~~Ia/feature selection~~ [r] Add feature selection methods by variance, dispersion, and mean accessibility Dec 15, 2024

immanuelazn changed the base branch from ia/normalizations to main January 8, 2025 01:05

immanuelazn mentioned this pull request Jan 8, 2025

[r] add tf-idf and log normalization functions #168

Open

[r] update select_features_by_dispersion() to reflect archr defaults

c50ead2

immanuelazn force-pushed the ia/feature-selection branch from c8194ac to c50ead2 Compare January 10, 2025 10:48

bnprks and others added 10 commits January 10, 2025 16:51

Update docs

199ae82

Merge branch 'main' into ia/normalizations

2911cf1

Update NEWS

553f262

Update docs

7511f0b

[r] add partial args to normalizations

435724b

[r] create mechanism for partial calls on explicit args

8dbe8e5

Merge branch 'ia/normalizations' into ia/feature-selection

21af3f9

[r] add partial calls, update feature selection docs

067b540

[r] reorder assertions, add new partial func system

16d5344

[r] change behaviour of num_feats default args, write docs

87eb430

immanuelazn mentioned this pull request Jan 27, 2025

Holistic normalization, feat selection, dimreduction, (iterative) lsi implementation #189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r] Add feature selection methods by variance, dispersion, and mean accessibility #169

[r] Add feature selection methods by variance, dispersion, and mean accessibility #169

immanuelazn commented Dec 15, 2024 •

edited

Loading

[r] Add feature selection methods by variance, dispersion, and mean accessibility #169

Are you sure you want to change the base?

[r] Add feature selection methods by variance, dispersion, and mean accessibility #169

Conversation

immanuelazn commented Dec 15, 2024 • edited Loading

For Feature Selection Methods

Details

Tests

Notes

For Normalization Methods

Details (original)

Changes

immanuelazn commented Dec 15, 2024 •

edited

Loading