[FIX] Normalize data to zero mean and unit variance before dimension estimation #636

notZaki · 2020-12-29T00:07:33Z

References #653

Input data is normalized prior to dimension estimation in the gift/matlab implementation, but not in ma_pca.
There is a normalized array data_z but it isn't used as the input to ma_pca:

tedana/tedana/decomposition/pca.py

Lines 186 to 193 in 45e5ad7

    
           data_z = ((data.T - data.T.mean(axis=0)) / data.T.std(axis=0)).T  # var normalize ts 
        
           data_z = (data_z - data_z.mean()) / data_z.std()  # var normalize everything 
        
           if algorithm in ['mdl', 'aic', 'kic']: 
        
               data_img = io.new_nii_like(ref_img, utils.unmask(data, mask)) 
        
               mask_img = io.new_nii_like(ref_img, mask.astype(int)) 
        
               voxel_comp_weights, varex, varex_norm, comp_ts = ma_pca.ma_pca( 
        
                   data_img, mask_img, algorithm)

Not sure if this was intentional, but this PR normalizes the input in ma_pca.

tsalo · 2020-12-29T17:11:31Z

Hi @notZaki. Thank you for noticing this bug. @eurunuela and I are actually working on moving the MA-PCA code out of tedana and into another repository, so that folks can use it without having to install tedana. We're just waiting on licensing info from the GIFT devs before we can make that repository public.

In the meantime, I would recommend changing the data that tedpca feeds into ma_pca rather than changing ma_pca itself.

notZaki · 2020-12-29T17:51:11Z

Sounds good. In that case, I'll hold off on this PR until the new repository becomes public.

The matlab code mentions GPLv2 at the top of the script, but the year is listed as 2003-2009, so it's probably better that you're confirming with the devs.

notZaki · 2021-01-15T21:50:35Z

I re-opened this PR because there's something I needed to check with the CI.

codecov · 2021-01-17T03:01:38Z

Codecov Report

Merging #636 (2b385db) into master (ca7e145) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #636      +/-   ##
==========================================
- Coverage   93.64%   93.64%   -0.01%     
==========================================
  Files          26       26              
  Lines        2030     2029       -1     
==========================================
- Hits         1901     1900       -1     
  Misses        129      129

Impacted Files	Coverage Δ
tedana/decomposition/ma_pca.py	`97.57% <ø> (-0.01%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca7e145...2b385db. Read the comment docs.

eurunuela · 2021-01-19T18:02:17Z

I'm sorry I haven't responded before.

We wrote this code on November 2019 and it's been some time so I cannot quite remember the thinking on some of the decisions made during the hackathon. I don't know why we added line 483. In the original GIFT code, they do z-score the data before estimating the number of components to keep, so my guess is we were testing how the z-scoring affected the decomposition and forgot to remove that line.

Now, since we're working on maPCA on its own repo, I'd add an if statement to make sure the input data is z-scored, and apply line 482 when it's not.

Edit: Good catch @notZaki ! Thanks!

jbteves · 2021-01-19T18:03:54Z

@eurunuela now that you bring that up, I actually do remember that while I was reviewing the code because I'm the one who made you put in that TODO. So that's why it's there.

eurunuela · 2021-01-19T18:09:11Z

@eurunuela now that you bring that up, I actually do remember that while I was reviewing the code because I'm the one who made you put in that TODO. So that's why it's there.

I guess jetlag happened and we just forgot we had to do that. Sorry about that.

Next steps before we merge this PR should be:

~~Use data_z in pca.py line 190~~ We still want to use data.
Add an if statement that checks the data is already z-scored in mapca.py lines 481-483

Would you like to do that @notZaki ? Or would you prefer I push changes to this PR?

Edit: The second point is mainly for the maPCA repo.

Edit 2: We could just simply z-score in time without the if statement.

jbteves · 2021-01-19T18:10:54Z

@eurunuela I'd open an issue in the mapca repo cross-referencing this one so we don't forget a second time ; - )

eurunuela · 2021-01-19T18:35:48Z

Now that I think about it I'm starting to remember why we used data instead of data_z. For maPCa we only want to z-score in time, while tedana z-scores both in time and space before the PCA step, hence the normalization step inside maPCA.

CesarCaballeroGaudes · 2021-01-19T18:45:23Z

Yes, I also remember that.

eurunuela · 2021-01-19T21:54:04Z

I've been looking into the different normalization approaches and how they differ. I've checked the one used in GIFT, the sklearn scaler we added during the hackathon, and the z-score applied to the temporal dimension. Here's an overview of what I've found:

I will use this dummy matrix to make the comparisons:

array([[4, 1, 8, 8],
       [7, 3, 7, 5]])

with dimensions 2 (space) by 4 (time).

GIFT

This is how GIFT normalizes variance (see here):

    for n = 1:size(data, 2)
        data(:, n) = detrend(data(:, n), 0) ./ std(data(:, n));
    end

This normalization returns the following matrix:

array([[-0.70710678, -0.70710678,  0.70710678,  0.70710678],
       [ 0.70710678,  0.70710678, -0.70710678, -0.70710678]])

If you try to replicate this on python, you will see that the std functions return different values. In order to perform std like Matlab does, you'd need to add the ddof=1 argument to the numpy call.

Sklearn scaler

This is how we implemented it during the hackathon for maPCA (see here):

scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data)

And it returns:

array([[-1., -1.,  1.,  1.],
       [ 1.,  1., -1., -1.]])

Z-score in temporal dimension

This is how tedana does z-score in the temporal dimension (see here):

data_z = ((data.T - data.T.mean(axis=0)) / data.T.std(axis=0)).T

And it returns:

array([[-0.42409446, -1.44192118,  0.93300782,  0.93300782],
       [ 0.90453403, -1.50755672,  0.90453403, -0.30151134]])

Conclusion

It is clear that the three approaches yield very different results.

I've gone through the paper again and having zero mean and unit variance in the temporal dimension is what makes sense. Mathematically speaking, we're interested in calculating the entropy rate of a Gaussian random process (x[n], n=1,2,...,N; where N is the number of samples, E{x[n]} = 0 and E{x^2[n]} = 1).

To me, z-scoring in the temporal domain makes the most sense and is what I'd do (and that's what has been suggested so far in this PR). However, I wanted to make sure we all see what exaclty the different approaches are doing.

notZaki · 2021-01-19T22:31:28Z

Thanks for sharing the results @eurunuela

I think the difference between GIFT and StandardScaler will become negligible with a larger matrix. They seem very different on the example data because the first dimension (space) only has two values, so the sample and population standard deviation produce very different values.

If z-scoring in the temporal domain is indeed what makes sense, then perhaps this can be relayed to the GIFT group to confirm whether this was a bug or intentional. This would be more relevant for the mapca repo because it would be diverging from GIFT.

jbteves · 2021-01-19T22:33:33Z

@eurunuela would you mind quickly opening an issue explaining briefly what tedana's options and current behavior are? I'm afraid that your comment will be missed in a PR rather than an issue.

eurunuela · 2021-01-20T11:55:48Z

I think the difference between GIFT and StandardScaler will become negligible with a larger matrix. They seem very different on the example data because the first dimension (space) only has two values, so the sample and population standard deviation produce very different values.

You're right. I didn't think of it as I was just trying to keep the demonstration simple. I've just tested the same commands with a bigger matrix (20000 x 100) and the results between GIFT and sklearn are very similar. The biggest difference is in the order of 4e-5.

If z-scoring in the temporal domain is indeed what makes sense, then perhaps this can be relayed to the GIFT group to confirm whether this was a bug or intentional. This would be more relevant for the mapca repo because it would be diverging from GIFT.

There will be no need for that. I've checked that the sklearn scaler z-scores in the temporal domain, so we're safe.

notZaki · 2021-01-20T14:22:39Z

There will be no need for that. I've checked that the sklearn scaler z-scores in the temporal domain, so we're safe.

I thought that the earlier results were showing that sklearn scaler is z-scoring in spatial domain instead of the temporal domain.

eurunuela · 2021-01-20T14:46:38Z

I thought that the earlier results were showing that sklearn scaler is z-scoring in spatial domain instead of the temporal domain.

I've looked into that in more depth (see #655). The sklearn scaler appears to be z-scoring both in the spatial and temporal domains.

eurunuela

Thank you @notZaki ! I think we should have this PR in our next release to ensure both normalization steps are done.

emdupre · 2021-01-27T21:22:14Z

Thank you @notZaki ! I think we should have this PR in our next release to ensure both normalization steps are done.

Thanks everyone for digging into this ! Just to be clear: is this PR ready for reviews ?

eurunuela · 2021-01-27T21:26:22Z

Thanks everyone for digging into this ! Just to be clear: is this PR ready for reviews ?

I think it's ready to get merged, so yes.

jbteves · 2021-01-27T21:28:54Z

@notZaki could you please confirm by switching this PR from "Draft" to "Ready for Review"?

notZaki · 2021-01-27T21:32:23Z

On my end, I am not completely sure if the normalization should be done along voxels or time.
If there is no doubt that normalization should be along time, then this is ready to get merged. Otherwise, I think the discussion in #653 and #655 need to be resolved first.

notZaki · 2021-01-29T19:58:20Z

Revisiting this: If the next release is scheduled for next week, then some kind of change should be made because mdl is the default option and I think everyone agrees that the current version (no normalization) is not ideal.

Options are:

(A) Normalize in temporal domain
- Pro: Consistent with how tedana z-scores elsewhere in decomposition/pca
(B) Normalize along spatial dimension
- Pro: This is what GIFT does

I am learning towards option B because I think it's safer to follow an established library (GIFT), but I'm curious what everyone else's thoughts are.

eurunuela · 2021-01-29T20:04:15Z

Why don't we go with option B (safest and it will select more components) and study option A on the maPCA repo?

eurunuela

Thank you @notZaki

eurunuela · 2021-02-01T16:45:24Z

Can we please merge this before we cut the release? It makes sure that we're normalizing the data as the first step to maPCA (even if the results don't vary too much).

@emdupre @tsalo @jbteves

emdupre · 2021-02-01T17:31:56Z

Can we please merge this before we cut the release?

Yes, I think we should ! @notZaki had made a great point re: which to use, and I wanted to drag out my old Linear Algebra text to confirm. But I can commit to doing that by Wednesday, which should still leave us enough time to get it merged before cutting on Friday !

eurunuela · 2021-02-01T17:48:46Z

Can we please merge this before we cut the release?

Yes, I think we should ! @notZaki had made a great point re: which to use, and I wanted to drag out my old Linear Algebra text to confirm. But I can commit to doing that by Wednesday, which should still leave us enough time to get it merged before cutting on Friday !

Sounds great! Right now, the PR mimics GIFT. I will have a look at the papers too just to make sure.

emdupre

Thanks @notZaki ! 🚀

eurunuela · 2021-02-05T15:22:19Z

Thanks @notZaki and @emdupre . I'm merging this PR.

Normalize data to zero mean and unit variance in ma_pca

65917dd

notZaki closed this Dec 29, 2020

notZaki reopened this Jan 15, 2021

Use normalized data as input to ma_pca

d9c3dc2

notZaki added 5 commits January 15, 2021 15:51

Revert first change: do not re-scale input in ma_pca

c43860d

Use consistent normalization strategy in ma_pca

34c2836

Remove outdated comments

c7e475e

Merge branch 'master' into normalizeBeforeMAPCA

68e455e

Update regression test data; normalization produces fewer figures

9b72780

notZaki force-pushed the normalizeBeforeMAPCA branch from ddcb82d to 9b72780 Compare January 17, 2021 02:51

notZaki mentioned this pull request Jan 18, 2021

PCA is applying an inappropriate transformation #654

Closed

eurunuela mentioned this pull request Jan 19, 2021

maPCA in tedana does not currently z-score data before performing PCA ME-ICA/mapca#24

Closed

eurunuela mentioned this pull request Jan 20, 2021

On the different normalization approaches in tedana #655

Closed

eurunuela previously approved these changes Jan 23, 2021

View reviewed changes

Merge branch 'master' into normalizeBeforeMAPCA

c60d57e

notZaki dismissed eurunuela’s stale review via c60d57e January 25, 2021 23:20

eurunuela mentioned this pull request Jan 26, 2021

[TST] Add integration test ME-ICA/mapca#26

Merged

notZaki added 3 commits January 29, 2021 19:03

Merge branch 'master' into normalizeBeforeMAPCA

1d89983

Normalize along spatial dimension in mapca; Matches GIFT

9f3f952

Update test output; updated mapca finds 3 more components

2b385db

eurunuela approved these changes Jan 30, 2021

View reviewed changes

Base automatically changed from master to main February 1, 2021 23:57

eurunuela mentioned this pull request Feb 4, 2021

Should data be normalized along time or voxels? #653

Closed

emdupre approved these changes Feb 5, 2021

View reviewed changes

eurunuela merged commit f8802df into ME-ICA:main Feb 5, 2021

notZaki deleted the normalizeBeforeMAPCA branch February 5, 2021 15:49

jbteves mentioned this pull request Feb 5, 2021

Post-release Newsletter for February 2021 #673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Normalize data to zero mean and unit variance before dimension estimation #636

[FIX] Normalize data to zero mean and unit variance before dimension estimation #636

notZaki commented Dec 29, 2020 •

edited

Loading

tsalo commented Dec 29, 2020

notZaki commented Dec 29, 2020

notZaki commented Jan 15, 2021

codecov bot commented Jan 17, 2021 •

edited

Loading

eurunuela commented Jan 19, 2021 •

edited

Loading

jbteves commented Jan 19, 2021

eurunuela commented Jan 19, 2021 •

edited

Loading

jbteves commented Jan 19, 2021

eurunuela commented Jan 19, 2021

CesarCaballeroGaudes commented Jan 19, 2021

eurunuela commented Jan 19, 2021 •

edited

Loading

notZaki commented Jan 19, 2021 •

edited

Loading

jbteves commented Jan 19, 2021

eurunuela commented Jan 20, 2021

notZaki commented Jan 20, 2021

eurunuela commented Jan 20, 2021 •

edited

Loading

eurunuela left a comment

emdupre commented Jan 27, 2021

eurunuela commented Jan 27, 2021

jbteves commented Jan 27, 2021

notZaki commented Jan 27, 2021

notZaki commented Jan 29, 2021

eurunuela commented Jan 29, 2021

eurunuela left a comment

eurunuela commented Feb 1, 2021

emdupre commented Feb 1, 2021

eurunuela commented Feb 1, 2021

emdupre left a comment

eurunuela commented Feb 5, 2021

	data_z = ((data.T - data.T.mean(axis=0)) / data.T.std(axis=0)).T # var normalize ts
	data_z = (data_z - data_z.mean()) / data_z.std() # var normalize everything

	if algorithm in ['mdl', 'aic', 'kic']:
	data_img = io.new_nii_like(ref_img, utils.unmask(data, mask))
	mask_img = io.new_nii_like(ref_img, mask.astype(int))
	voxel_comp_weights, varex, varex_norm, comp_ts = ma_pca.ma_pca(
	data_img, mask_img, algorithm)

[FIX] Normalize data to zero mean and unit variance before dimension estimation #636

[FIX] Normalize data to zero mean and unit variance before dimension estimation #636

Conversation

notZaki commented Dec 29, 2020 • edited Loading

tsalo commented Dec 29, 2020

notZaki commented Dec 29, 2020

notZaki commented Jan 15, 2021

codecov bot commented Jan 17, 2021 • edited Loading

Codecov Report

eurunuela commented Jan 19, 2021 • edited Loading

jbteves commented Jan 19, 2021

eurunuela commented Jan 19, 2021 • edited Loading

jbteves commented Jan 19, 2021

eurunuela commented Jan 19, 2021

CesarCaballeroGaudes commented Jan 19, 2021

eurunuela commented Jan 19, 2021 • edited Loading

GIFT

Sklearn scaler

Z-score in temporal dimension

Conclusion

notZaki commented Jan 19, 2021 • edited Loading

jbteves commented Jan 19, 2021

eurunuela commented Jan 20, 2021

notZaki commented Jan 20, 2021

eurunuela commented Jan 20, 2021 • edited Loading

eurunuela left a comment

Choose a reason for hiding this comment

emdupre commented Jan 27, 2021

eurunuela commented Jan 27, 2021

jbteves commented Jan 27, 2021

notZaki commented Jan 27, 2021

notZaki commented Jan 29, 2021

eurunuela commented Jan 29, 2021

eurunuela left a comment

Choose a reason for hiding this comment

eurunuela commented Feb 1, 2021

emdupre commented Feb 1, 2021

eurunuela commented Feb 1, 2021

emdupre left a comment

Choose a reason for hiding this comment

eurunuela commented Feb 5, 2021

notZaki commented Dec 29, 2020 •

edited

Loading

codecov bot commented Jan 17, 2021 •

edited

Loading

eurunuela commented Jan 19, 2021 •

edited

Loading

eurunuela commented Jan 19, 2021 •

edited

Loading

eurunuela commented Jan 19, 2021 •

edited

Loading

notZaki commented Jan 19, 2021 •

edited

Loading

eurunuela commented Jan 20, 2021 •

edited

Loading