-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Normalize data to zero mean and unit variance before dimension estimation #636
Conversation
Hi @notZaki. Thank you for noticing this bug. @eurunuela and I are actually working on moving the MA-PCA code out of tedana and into another repository, so that folks can use it without having to install tedana. We're just waiting on licensing info from the GIFT devs before we can make that repository public. In the meantime, I would recommend changing the data that |
Sounds good. In that case, I'll hold off on this PR until the new repository becomes public. The matlab code mentions GPLv2 at the top of the script, but the year is listed as 2003-2009, so it's probably better that you're confirming with the devs. |
I re-opened this PR because there's something I needed to check with the CI. |
ddcb82d
to
9b72780
Compare
Codecov Report
@@ Coverage Diff @@
## master #636 +/- ##
==========================================
- Coverage 93.64% 93.64% -0.01%
==========================================
Files 26 26
Lines 2030 2029 -1
==========================================
- Hits 1901 1900 -1
Misses 129 129
Continue to review full report at Codecov.
|
I'm sorry I haven't responded before. We wrote this code on November 2019 and it's been some time so I cannot quite remember the thinking on some of the decisions made during the hackathon. I don't know why we added line 483. In the original GIFT code, they do z-score the data before estimating the number of components to keep, so my guess is we were testing how the z-scoring affected the decomposition and forgot to remove that line. Now, since we're working on Edit: Good catch @notZaki ! Thanks! |
@eurunuela now that you bring that up, I actually do remember that while I was reviewing the code because I'm the one who made you put in that TODO. So that's why it's there. |
I guess jetlag happened and we just forgot we had to do that. Sorry about that. Next steps before we merge this PR should be:
Would you like to do that @notZaki ? Or would you prefer I push changes to this PR? Edit: The second point is mainly for the Edit 2: We could just simply z-score in time without the if statement. |
@eurunuela I'd open an issue in the mapca repo cross-referencing this one so we don't forget a second time ; - ) |
Now that I think about it I'm starting to remember why we used |
Yes, I also remember that. |
I've been looking into the different normalization approaches and how they differ. I've checked the one used in GIFT, the sklearn scaler we added during the hackathon, and the z-score applied to the temporal dimension. Here's an overview of what I've found: I will use this dummy matrix to make the comparisons: array([[4, 1, 8, 8],
[7, 3, 7, 5]]) with dimensions 2 (space) by 4 (time). GIFTThis is how GIFT normalizes variance (see here): for n = 1:size(data, 2)
data(:, n) = detrend(data(:, n), 0) ./ std(data(:, n));
end This normalization returns the following matrix: array([[-0.70710678, -0.70710678, 0.70710678, 0.70710678],
[ 0.70710678, 0.70710678, -0.70710678, -0.70710678]]) If you try to replicate this on python, you will see that the std functions return different values. In order to perform std like Matlab does, you'd need to add the Sklearn scalerThis is how we implemented it during the hackathon for maPCA (see here): scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data) And it returns: array([[-1., -1., 1., 1.],
[ 1., 1., -1., -1.]]) Z-score in temporal dimensionThis is how tedana does z-score in the temporal dimension (see here): data_z = ((data.T - data.T.mean(axis=0)) / data.T.std(axis=0)).T And it returns: array([[-0.42409446, -1.44192118, 0.93300782, 0.93300782],
[ 0.90453403, -1.50755672, 0.90453403, -0.30151134]]) ConclusionIt is clear that the three approaches yield very different results. I've gone through the paper again and having zero mean and unit variance in the temporal dimension is what makes sense. Mathematically speaking, we're interested in calculating the entropy rate of a Gaussian random process (x[n], n=1,2,...,N; where N is the number of samples, E{x[n]} = 0 and E{x^2[n]} = 1). To me, z-scoring in the temporal domain makes the most sense and is what I'd do (and that's what has been suggested so far in this PR). However, I wanted to make sure we all see what exaclty the different approaches are doing. |
Thanks for sharing the results @eurunuela I think the difference between GIFT and StandardScaler will become negligible with a larger matrix. They seem very different on the example data because the first dimension (space) only has two values, so the sample and population standard deviation produce very different values. If z-scoring in the temporal domain is indeed what makes sense, then perhaps this can be relayed to the GIFT group to confirm whether this was a bug or intentional. This would be more relevant for the mapca repo because it would be diverging from GIFT. |
@eurunuela would you mind quickly opening an issue explaining briefly what |
You're right. I didn't think of it as I was just trying to keep the demonstration simple. I've just tested the same commands with a bigger matrix (20000 x 100) and the results between GIFT and sklearn are very similar. The biggest difference is in the order of
There will be no need for that. I've checked that the sklearn scaler z-scores in the temporal domain, so we're safe. |
I thought that the earlier results were showing that sklearn scaler is z-scoring in spatial domain instead of the temporal domain. |
I've looked into that in more depth (see #655). The sklearn scaler appears to be z-scoring both in the spatial and temporal domains. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @notZaki ! I think we should have this PR in our next release to ensure both normalization steps are done.
Thanks everyone for digging into this ! Just to be clear: is this PR ready for reviews ? |
I think it's ready to get merged, so yes. |
@notZaki could you please confirm by switching this PR from "Draft" to "Ready for Review"? |
Revisiting this: If the next release is scheduled for next week, then some kind of change should be made because mdl is the default option and I think everyone agrees that the current version (no normalization) is not ideal. Options are:
I am learning towards option B because I think it's safer to follow an established library (GIFT), but I'm curious what everyone else's thoughts are. |
Why don't we go with option B (safest and it will select more components) and study option A on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @notZaki
Yes, I think we should ! @notZaki had made a great point re: which to use, and I wanted to drag out my old Linear Algebra text to confirm. But I can commit to doing that by Wednesday, which should still leave us enough time to get it merged before cutting on Friday ! |
Sounds great! Right now, the PR mimics GIFT. I will have a look at the papers too just to make sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @notZaki ! 🚀
References #653
Input data is normalized prior to dimension estimation in the gift/matlab implementation, but not in
ma_pca
.There is a normalized array
data_z
but it isn't used as the input toma_pca
:tedana/tedana/decomposition/pca.py
Lines 186 to 193 in 45e5ad7
Not sure if this was intentional, but this PR normalizes the input in
ma_pca
.