-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] Add feature selection methods by variance, dispersion, and mean accessibility #169
Open
immanuelazn
wants to merge
16
commits into
main
Choose a base branch
from
ia/feature-selection
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
immanuelazn
changed the title
Ia/feature selection
[r] Add feature selection methods by variance, dispersion, and mean accessibility
Dec 15, 2024
immanuelazn
force-pushed
the
ia/feature-selection
branch
from
January 10, 2025 10:48
c8194ac
to
c50ead2
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
EDIT: I am putting the current state of the normalization functions here too, to remove a little bit of branching from the PRs, coming from
ia/normalizations
. The current state of the normalizations are reflected in the first four commits. Nonetheless, a granular diff can be seen in PR #168For Feature Selection Methods
Details
Create functions to do feature selection, as a foundation for LSI and iterative LSI. These take in the number of features, and an optional function to be passed in for noramlization (if function uses variance or dispersion). The end result is a tibble with columns names, score, and highly_variable.
Tests
Since the interfaces are very similar, I just decided to throw all of them in a loop and test that the tibbles are formed as we expect. I don't know whether it would make sense to test for whether the actual feature selection logic makes sense, because that would just be re-doing the operations on a dgCMatrix. Otherwise, do you have test ideas with better signal on whether these methods perform as we expect?
Notes
I have this merging to the normalization branch, but I just do this to allow for the normalization logic to work within feature selection. I think it would make sense for merging normalizations into main once that is approved, then setting the head to main.
I think the underlying logic is essentially the same between each feature selection method, so I am leaning closer and closer to just putting all of the logic into a single function with a enum param for usage of a specific feature selection method. However, this might clash with LSI/iterative LSI unless we are okay with putting a
purrr::partial()
eval statement directly in the default args. That is, until we develop the option to do implicit partials.For Normalization Methods
Details (original)
As discussed, we are looking to add in normalization functions to allow for passing in transformations into orchestrator functions like LSI/iterative LSI. I add in two functions,
normalize_tfidf()
, andnormalize_log()
(shown in #167).There were a few departures from the design doc, in order to provide a little bit more flexibility. Particularly, I was thinking about the case where the feature means are not ordered in the same way. To add in a little bit of safety, I added some logic for index invariant matching for feature means to matrix features.
Other than that, I also provided an option to do a traditional log transform by boolean flag, rather than log1p. As we don't directly expose a log function in BPCells C++ side, I just added a - 1. However, I'm noticing that this isn't translated into a -Inf like in dgCMatrix/generic matrix, and is instead a very small number. Might need to evaluate if this is something we would want to support
Changes
As a result of a round of PR changes, I made a variety of changes. I made the
normalize_log()
function always use a log1p.normalize_log()
also is divided by colSums prior to multiplying by a scale factor, and usesmatrix_stats()
for multi-threaded calculation.normalize_tfidf()
also implements usage of a logartihm transform. Both functions also describe the specific normalization steps they use as a math eqn, which can be viewed within the reference.