Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support veRSA #92

Open
wants to merge 9 commits into
base: development
Choose a base branch
from
Open

Support veRSA #92

wants to merge 9 commits into from

Conversation

SergeantChris
Copy link
Collaborator

@SergeantChris SergeantChris commented Jan 28, 2025

The goal of this PR is to support computing RSA on top of voxelwise encoding, to implement the method sometimes mentioned as veRSA. While implementing this though, I ended up making some other changes as well, to support usecases I thought important and to avoid recomputation of transforms.

Altogether, the changes are outlined as follows:

  1. veRSA support: A parameter is added to the encoding functions to optionally create RDMs and compute RSA between predicted and real voxels - for the moment only the very basic RSA is supported with pearson for RDMs and spearman for RSA, this can be extended in the future. When veRSA is enabled, the regression returns a single value instead of a list of coefficients, as it is no longer in voxel space. Thus, the option return_correlations is not supported with veRSA.
  2. Existing code recomputed the PCA transform on the model layer features for every ROI, even though the ROIs do not influence this computation (the same random seed is used for the folds for each ROI, e.g. for 3 folds, 42, 43, and, 44 by default). As the PCA is probably the biggest overhead in this evaluation, and the number of ROIs can easily be in the range of 20-40, this was a major bottleneck. In this PR, the ROI loop is moved inside the encoding_metric function, at the innermost loop. Also, as for different subjects the fold splitting is also not influenced, the way I see this done for multiple subject ROIs is to have them named with unique names e.g. V1_subj1.npy to be processed as ROIs.
  3. In the encode_layer function activations for all samples were currently stacked in memory, and for some models with very large feature vectors this can no longer fit in memory (even 64G of RAM). So an option mem_mode is added for the user to chose either to stack the features or transform them one by one (they are still stacked up to batch_size to fit PCA). Also, the last batch was not processed correctly in case the number of samples was not exactly divided by batch_size.
  4. There is a use case where the user might want to train the cross-validated regressions and then use them to predict voxels in an unseen test set. In this case the regression models of all folds must be saved, as well as the PCA transforms.
  5. (Minor) A shuffle argument is added and integer input is supported in train_test_split: these are to support the option of training and evaluating on an exact (known) train-test split.

Finally, although I implemented these for both the Linar and Ridge regression functions, I haven't tested it on the Ridge one, and I am a bit confused on what takes place in encode_layer_ridge. Is this code complete?

Also, TODO: Update user-guide notebooks with new parameters.
and TODO: Re-target this PR on development after the small_fixes PR is merged.

@ToastyDom
Copy link
Collaborator

Hey, thank you for the PR!

I love the veRSA feature addition - it's a really valuable contribution! The improved computational speed of the encoding functionality is also fantastic. I've tested the code and it works well!

Regarding the encode_layer_ridge function - good point. This is actually a leftover from the linear encoding function, which uses PCA for dimensionality reduction. The ridge regression version simply splits and flattens the activation without PCA. The docstring incorrectly states it's using Incremental PCA. I'll fix this in another push to dev and add averaging across features to make it equivalent to linear encoding.

While replicating some notebooks to test the function, I noticed it only runs when save_model and save_pca are set to true. Otherwise, I get this error:

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

which probably comes from

pca_trn, pca_tst = encode_layer(trn_Idx, tst_Idx, feat_path, layer_id, avg_across_feat, batch_size, n_components, mem_mode=mem_mode, save_pca=save_pca, save_path=f'{prediction_save_path}/pca.pkl' if save_pca else None)

This is probably not planned, right?

As always great contribution, thank you so much!!

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@SergeantChris
Copy link
Collaborator Author

SergeantChris commented Feb 4, 2025

Again thanks for the review!
Good catch with the save_model / save_pca False cases, I guess it escaped my attention - I fixed it now.
Also, I updated the notebook documentation (you can check it on the ReviewNB link above).

@SergeantChris SergeantChris changed the base branch from small_fixes to development February 4, 2025 15:17
@ToastyDom
Copy link
Collaborator

It looks great, thank you so much Christina!

I honestly think its my inability but I still struggle with the save_model / save_pca False cases. I am executing the Cognition Academy Dresden Notebook 2.ipynb to usually test for Linear Encoding (because its quick and has some data) and here the function still only runs when the parameters are set to true. See error below

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

# Relevant traceback:
File ~/Documents/Repositories/Net2Brain/net2brain/evaluations/encoding.py:104, in encode_layer
    if mem_mode == 'saver' or os.path.exists(save_path):

The code is from this section:

# Start Linear Encoding
from net2brain.evaluations.encoding import Linear_Encoding, Ridge_Encoding

# Function to run linear encoding
def run_linear_encoding(models, subjects, hemispheres, rois, n_folds, n_components, batch_size):
    if n_components >= batch_size:
        print("n_components must be smaller than batch_size")
        return

    for subject in subjects:
        for hemisphere in hemispheres:
            for roi in rois:
                roi_file = f'{roi}_{hemisphere}_subj{subject}.npy'
                roi_data_path = os.path.join(roi_data, roi_file)

                config = f"{n_folds}f_{n_components}c_{batch_size}b"

                for model in models:
                    model_name = model + '_feats'

                    # Start Net2Brains Linear Encoding
                    print(os.path.join(current_directory, model_name))
                    print(roi_data_path)

                    Linear_Encoding(
                        feat_path=os.path.join(current_directory, model_name),
                        roi_path=roi_data_path,
                        model_name=f"{model_name}_{config}",
                        trn_tst_split=0.8,
                        n_folds=n_folds,
                        n_components=n_components,
                        batch_size=batch_size,
                        random_state=42,
                        return_correlations=True,
                        save_path=f"Tutorial_LE_Results_Harry/subj{subject}",
                        file_name=f"{model_name}_{roi}_{hemisphere}_{config}",
                        avg_across_feat=True
                    )

                    print("")
                    print(f"Finished running Linear Encoding for subject={subject}, hemisphere={hemisphere}, roi={roi}, model={model_name}")


# Create and display widgets for linear encoding
print(current_directory)
create_linear_encoding_widgets()

Am I missing something?

Thanks again for everything!!

@SergeantChris
Copy link
Collaborator Author

You're right, I missed one case. Sorry about that! Can you check if the notebook runs now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants