-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should data be normalized along time or voxels? #653
Comments
I think it makes sense to normalize each voxel over time before PCA. I don't think the second normalization (across both voxels and time) is necessary, although I'm not sure if it's a problem. I don't think it makes sense to normalize each time-point over voxels. We had a similar issue in #228 with the PCA output. |
I totally agree with @tsalo . Normalizing the time dimension makes sense, but normalizing across voxels doesn't. |
In that case, I've updated an older PR to normalize over time rather than over voxels. The current mapca code is (1) not doing any normalization in the beginning and (2) normalizing over the voxel dimension while subsampling. I've also updated the original comment in this issue. I didn't correctly implement one of the strategies. Now the results are more favorable if normalization is done over the time dimension. |
@notZaki would you be willing to look into whether the second normalization in strategy A ( |
The second step shouldn't substantially change the result because normalizing along rows (or columns, doesn't matter) will also make the global mean close to zero and global std close to 1. As an example, for the three-echo data in the tests: data_z = ((data.T - data.T.mean(axis=0)) / data.T.std(axis=0)).T
data_z.mean() # = 2e-17
1 - data_z.std() # = -2e-16` |
Canonically, I think we'd be interested in spatial ICA. This is the form that Kundu describes in motivating ME-ICA in his 2017 NeuroImage paper. So I'd echo others that strategy A is most appropriate ! |
If the ICA sources are spatial, then I would expect the normalization to be along spatial dimension, i.e. strategy B. That would also explain why GIFT uses strategy B, since it also does spatial ICA (I think). |
Thanks again to everyone for weighing in, here ! This is something that's definitely important to think through. I had a chance to gather my thoughts on this. Just to talk through my thinking, here : Let's consider the fMRI timeseries to be a t x v matrix, where t is equal to the number of timepoints and v is equal to the number of voxels. With spatial ICA, we assume that we have t observations of our spatial random variables that we want to recover (these variables will be a 1 x v vector, but can be reconstructed into our expected volume). In that case, normalizing each voxel over time should improve convergence by making the individual voxel courses on the same scale. The alternative, to normalize each time point over voxels, would result in each voxel having a "jumpy" timecourse (since it's value at each time step would depend not on itself but on all other voxel's values at that same time step). I'd be a bit surprised if GIFT is in fact doing the second normalization. The authors suggest here that they'd advocate for "intensity normalization" which is a method for normalizing each voxel over time. Admittedly it's an older paper (2010), but I'm still having trouble imagining how to justify normalizing each time point independently. But please let me know if I'm misunderstanding something re the normalization options under consideration ! |
I also had some thoughts about this. An important point is that normalizing in space (i.e. making each volume be zero mean and unit-variance) is equivalent to making global signal regression since one is actually forcing the global signal be zero. AS far as my long term memory allows me, I think we discussed this during the Tedana Hackaton in DC (@javiergcas @eurunuela ) and we decided that this should not be done. I might be wrong. Even so, the procedure still need some kind of normalization of the features (input to PCA) and therefore (again my memory might be wrong) we considered temporal normalization more appropriate. Btw, @emdupre thanks for the abstract (excellent search engine skills)... it is really helpful, and it is true that the maPCA paper is an old one. The abstract advocates for just demeaning the voxel time series with no variance normalization. Although I believe in these evaluations, I'm slightly surprised because PCA is usually applied with feature normalization so that they all have the same variance. |
Thank you both @emdupre and @CesarCaballeroGaudes . This whole thing has been driving me nuts for the past month. What GIFT is doing is take each time-point and z-score the map. Mind that for them The abstract actually reports the bad test-retest reliability @notZaki has seen when z-scoring the timecourses of each voxel (which, as @CesarCaballeroGaudes mentioned, we thought was the correct thing to do in the hackathon). Now, the abstract also reports that the best approach for test-retest reliability is the "intensity normalization"; i.e. to divide the timecourses of each voxel by their mean (signal percentage change without removing the mean). However, the abstract does not study the test-retest reliability of z-scoring the maps at each time-point, which as @CesarCaballeroGaudes mentioned is equivalent to performing global signal regression. We still do not know why GIFT is doing this. I agree with @emdupre in that normalizing each voxel over time makes more sense, and that's why we thought of implementing this approach during the hackathon. |
I'm now wondering how much I really understand GIFT's implementation 😅 I also don't have an intuition for why they would perform this kind of GSR-esque normalization, but I think it stands against their own code comment (in L206) which explicitly identifies the normalization as variance normalization. To make sure I wasn't misunderstanding the previously linked OHBM abstract, I went back to the Beckmann and Smith (2004) paper, which introduced variance normalization and describes their motivation as the following:
So, I think we'd be on reasonable ground to move forward with the time-series normalization approach ! |
Thanks for getting back to the Probabilistic ICA paper. A notable difference is that Beckmann & Smith propose variance normalization, not intensity normalization as suggested in Allen & Calhoun's abstract. |
Yes, but my point was that the code comment explicitly identifies the procedure as variance normalization. They describe the normalization methods implemented in GIFT here as:
and performed here. Someone who's more familiar with the GIFT code base -- could you confirm where the |
For GIFT, it looks like the default is to I tried looking into nilearn's CanICA, and it looks like it does temporal normalization during the PCA step, but spatial normalization during ICA. I'll have to dive into it again to make sure I understood it properly. For my own understanding: Would the normalization strategy depend on whether the ICA is spatial or temporal? I assumed it did, so my reasoning was that temporal normalization would be appropriate for temporal ICA, and spatial normalization for spatial ICA. I justified this pairing because I think ICA expects the components to have a zero mean---though I don't know if this is necessarily true. Throwing another hat in the ring: The data could be normalized by only using the global mean/std. This would be a uniform transformation, so neither the time series nor the spatial components will have any new "jumpy" changes. |
I feel much more well-equipped to speak on CanICA than GIFT 😅 I'll give a brief summary of my understanding below, but I think that if the goal is to match the GIFT implementation, this is sufficiently different that it's unlikely to help us here. Instead, if we're able to find a normalization that will give similar results when running the same data through GIFT and maPCA, I think we can move forward with that solution. CanICA is using the voxelwise data directly (without any normalization) into an initial PCA to reduce the data dimensionality across subjects; specifically, to reduce the time dimension. This is done as part of the The stacked data is normalized in MultiPCA for each component (i.e., a scaling factor Now we get to CanICA itself. Using the components returned from |
I totally agree and #636 does that right now; i.e. it z-scores the volume for each time-point. We should merge that in before we cut the release. As I previously said (maybe in the PR, idk), we could further study these differences in the maPCA repo, and why not, it could even be an abstract for OHBM next year. Re CanICA: thank you for the explanation @emdupre I will have to read the two referenced papers to know more about the reasoning behind the 3 different PCAs and their corresponding normalization (or lack of it). |
I think I was confused, since #636 only removes a variable assignment and a code comment. So thanks for clarifying !
Lots of decomp options, for sure 😄 So this is definitely something we can / should come back to ! |
Should we close this and #655 now that #/636 is merged ? |
Adding numbers to what @eurunuela mentioned: I ran GIFT's
Actually, the pre-PR636 branch (no normalization) gave the same result too. Regarding closing this issue---the normalization question affects multiple parts of tedana: maPCA, kundu-PCA, and ICA. |
This is great news as this is what we aimed for during the hackathon. I believe the second normalization step in maPCA is the key one and that's why after #636 has been merged the results are still the same.
I agree that we should keep this issue open. I believe the different normalization steps are worth investigating. Thank you once again @notZaki for bringing all this up. |
If #702 gets merged then this can be closed. The discussion could be resumed in |
Summary
Input data is normalized to zero-mean and unit-variance prior to PCA. This normalization can be done along two different directions, and this choice affects the output.
Additional Detail
Strategy A: Normalize along second dimension (time)
The
data
matrix with shape NxT is normalized along the second dimension here:tedana/tedana/decomposition/pca.py
Lines 186 to 187 in 63070e2
Strategy B: Normalize along first dimension (masked voxels)
The MA-PCA code uses the
sklearn.preprocessing.StandardScaler
which is initialized here:tedana/tedana/decomposition/ma_pca.py
Line 480 in 63070e2
This scaling performed by this scaler is along the first dimension, so it would be expressed as:
The PCA implementation in sklearn is also internally normalizing the data using this approach.
Comparison using the two strategies
These two strategies do not produce the same normalized matrix and the choice changes the downstream results. As an example, here's what happens on the five-echo test data
If strategy A: MA-PCA estimate 72 components. ICA decomposition fails to converge and the seed has to be changed at least 3 times until convergenceFinal classification (on local run ): accepted = 28, rejected = 13, ignored = 31.On CI: ICA fails to converge after 10 attempts [log]Potential points for discussion
The text was updated successfully, but these errors were encountered: