Should PCA component selection leverage multi-echo information? #101

emdupre · 2018-07-30T01:58:56Z

Before running independent components analysis (ICA), we first run principal components analysis (PCA) and select a limited number of the resulting principal components. This is a common approach across the community, and makes a great deal of sense -- the data need to be whitened, and the dimensionality reduction helps us avoiding over-fitting to noise.

Edited to add: See this conversation for an explanation of why we need both PCA and ICA steps to dimensionality reduce and whiten: #44 (comment)

However, the way that principal components are selected here is not by a noise threshold, but rather via a decision tree similar to the process by which independent components are later selected. This can yield very high dimensional datasets that have a low probability of converging with ICA. Is it worth revisiting whether principal components should be selected by a hard threshold?

dowdlelt · 2018-07-30T14:23:13Z

I have been frustrated by failure to converge in the past, and the tunable parameters kdaw and rdaw also add greatly to complexity. However, I would be concerned about selecting using a hard threshold, based primarily on Figure 10 from the 2017, Multiecho fMRI paper, which suggest that a kappa/rho based PCA approach captures more of the useful variance.

Maybe this is only a particular concern for those that use the High Kappa Timeseries or the tedana coefficient maps for analyses - if BOLD signals are not extracted during PCA, then they would never make it to the ICA. For the optimally combined and denoised timeseries I think this would be a less of a concern, and the benefit of more stable convergence might be more important than less variance explained.

Not sure how often folks just use the denoised vs high kappa data, or component maps. I stick with the denoised data, myself, as the high kappa timeseries can miss brain components or include things that are BOLD but not 'brain activity' (breathing related).

handwerkerd · 2018-07-30T16:33:27Z

My concern is that's a single figure from a single data set and the comparison between the ICA results from each threshold isn't really shown. The threshold cutoffs are also a bit weird because the PCA components are sorted by Kappa, not variance, so it's unclear how many components remain with each threshold method. Also looking at the PCA selection code, it's clear that there are cases when a component would be rejected in the ME-PCA method that would have remained with the strict variance method:

tedana/tedana/decomposition/eigendecomp.py

Lines 218 to 232 in 31504f6

    
           is_hik = np.array(ctb[:, 1] > kappa_thr, dtype=np.int) 
        
           is_hir = np.array(ctb[:, 2] > rho_thr, dtype=np.int) 
        
           is_hie = np.array(ctb[:, 3] > eigelb, dtype=np.int) 
        
           is_his = np.array(ctb[:, 3] > spmin, dtype=np.int) 
        
           is_not_fmax1 = np.array(ctb[:, 1] != F_MAX, dtype=np.int) 
        
           is_not_fmax2 = np.array(ctb[:, 2] != F_MAX, dtype=np.int) 
        
           pcscore = (is_hik + is_hir + is_hie) * is_his * is_not_fmax1 * is_not_fmax2 
        
           if stabilize: 
        
               temp7 = np.array(spcum < 0.95, dtype=np.int) 
        
               temp8 = np.array(ctb[:, 2] > fmin, dtype=np.int) 
        
               temp9 = np.array(ctb[:, 1] > fmin, dtype=np.int) 
        
               pcscore = pcscore * temp7 * temp8 * temp9 
        
           pcsel = pcscore > 0 
        
           dd = u.dot(np.diag(s*np.array(pcsel, dtype=np.int))).dot(v)

The data set in figure 10 (probably short TR) seems to have retained more components with the ME-PCA selection, but I'm not sure if this is universally true and I've suspected that it's the opposite in some cases.
Given this is so central to the code, I think it would be good to have multiple thresholding options for this section of the code so that it would be possible to compare. I've repeatedly had situations where something goes wrong and I wished I could easily see how the section process would have looked different with a different component selection criteria.
If others agree, the question is what selection option would be the default or if this is a parameter where users should need to actively make a choice.

dowdlelt · 2018-07-31T13:23:43Z

Across the board I agree - and I thought the same about the TR - but it looks like that data is actually derived from this PNAS paper, which had a TR of 2.47s, with 4 echoes. Regardless, that image is (probably) from a single subject, and was chosen to highlight the method, thus I certainly don't want to lean on it too much.

I think the best discussion of the motivation for this ME-PCA approach that I have come across is in the supplement of the above paper, here (link may not work) which goes as follows:

Though the code above, of course, is a more accurate and up to date indication of what is happening. Figure S1 is comparison image from the review paper.

At the very least a good start would be the ability to see which (and how many) PCs would have been selected by targeting variance vs ME-PCA in table format. That way we could start answering the question of how universal this approach is and what happens in different datasets. It also dovetails nicely into the other discussions (#86 #84 ) of selection during the ICA routine.

As for the default, that is a tough choice. I would be inclined to stick with ME-PCA, and allow user selection of the other options, but that is based purely on my experience - not rigorous testing across different datasets. Interested to see how others feel on this point.

handwerkerd · 2018-07-31T14:15:34Z

@dowdlelt It sounds like we're on the same page regarding having both options. I like Prantik's reasoning, but I haven't seen this specific step validated across a wide range of data parameters and I do worry about what happens when it messes up. i've been particularly suspicious that this might be the root of some problems in data that has <=150volumes or data with certain types of noise or artifacts.

As for the default, I don't have strong opinions. If you'd ask me today, I'd say we shouldn't have a default and this should be a required parameter. After we use the code for a while, if it seems like ME-PCA really is robust across many types of data sets, I'd set this as a default at a later date. There's still a lot of work to do so we will likely be making breaking changes, but setting a default and later changing the default is the type of thing that can cause users un-needed headaches. This is a weak opinion & I'd defer to others who have stronger opinions.

emdupre · 2018-09-14T13:41:30Z

When running tedana there is a non-negligible number of cases in which the ICA fails to converge. @handwerkerd confirmed that this is most likely resulting from the tedpca step; specifically, from an overly aggressive decision tree which is removing too many components and therefore yielding a failure in the ICA.

Given that we've now moved to using the MLE automatic dimensionality estimation from sklearn, should we just drop the decision tree entirely ? @handwerkerd said he should be able to share some example results.

handwerkerd · 2018-09-14T19:18:39Z

I’ll start with my recommendation that we just switch to a standard variance component selection criterion based method and write the code in a way to make it practical for people to play with other options in the future. The rest of this is just documenting what I discussed with @emdupre earlier today that explains why I’m making this recommendation.

Elizabeth and I were looking at the code together and noticed one change between versions that might have unintentionally introduced problems. In the original code (starting at: https://bitbucket.org/prantikk/me-ica/src/8cc47cfed0203b3d6d187935ad3c2823b3e36a88/meica.libs/tedana.py?at=master&fileviewer=file-view-default#tedana.py-372 ) the component selection for the PCA components is done on the full SVD output. In the newer version, it seems like low variance components are removed before the PCA component decision tree is run.

tedana/tedana/decomposition/eigendecomp.py

Line 144 in a19bd77

ppca = PCA(n_components='mle', svd_solver='full')

While the goal of the PCA decision tree was to potentially keep low variance components that had high kappa or rho, in the newer version, this would never happen because those components seem to have already been removed.

That said, even with the original decision tree, I've noticed issues. In theory, the # of kept PCA components could be the same or higher when kappa and rho are considered rather than just using a variance-based threshold. In reality, I've seen many situations where the number of components is seriously reduced. For example, I ran several scans across multiple days on a Siemens Prisma 3T (TR=2, 160 volumes, 3mm^3, SMS=2, iPat=2, 5 echoes). As background, I collected these scans because I was getting unacceptably few components given the inputted data and I suspected there was a problem with the pulse sequence options until I realized this was an issue with the algorithm. For several of these scans, here are the number of ICA components (same as the # of PCA components) that came out of the tedana currently distributed with AFNI vs a stand-alone sklearn implementation with a variance-based threshold:
29 vs 41
30 vs 39
32 vs 44
29 vs 44
30 vs 43
37 vs 44
30 vs 45
32 vs 43

As you can see, the component selection method in tedana always resulted in fewer components & sometimes non-trivially fewer components. If there was a good rational why the fewer components were better, I’d dig into this more, but short of that, I don’t see much reason to give this decision too much more thought at this time.

emdupre · 2018-09-24T19:42:39Z

I think @tsalo has done a lot of good work in #122 to try to enable tracking on why components are retained (or discarded) across both the PCA and the ICA. It's also a nice place to address this potential "bug" that @handwerkerd pointed out of the dimensionality reduction occurring before the selection tree.

Given that convergence is such a consistent issue, I'm suggesting that we make the automatic dimensionality estimate from MLE the default behavior, but retain the decision tree (on all returned PCs, as implemented in ME-ICA <= v2.5) as a user-accessible option.

What do you think, @dowdlelt @handwerkerd @tsalo (and everyone else) ? Would this be a reasonable solution ? Then, future work could extensively test the decision tree without changing the default behavior -- at least until we have significant evidence that it performs better than the automatic dimensionality reduction in a variety of contexts and for all the relevant result files.

dowdlelt · 2018-09-25T12:37:56Z

I like making variance norm the default, as I've also ran into the same problem as @handwerkerd, in difficulties with convergence. A tiny bit of testing led to similar results, with some data the current default (kappa and rho sorting) led to fewer components compared to a variance based approach. It is frustrating and confusing behavior for 'good' data to not converge. It also naturally flows from the idea of having other options.

I don't want to encourage people to pick and choose their way to optimal data, but it may be valuable for some output to mention the number of components that would be selected with the unchosen method(s). Though I suppose that would be computationally expensive... It would make testing to find the right context slightly easier, so maybe it could be another option.

handwerkerd · 2018-09-26T18:57:31Z

I agree that setting the automatic dimensionality estimate as the default and maintaining a way to run the decision tree is a good path forward. I personally suspect that the decision tree in it's current form should only be used with significant caution, keeping it there will help make sure there's a structure for future development in this area.

In response to @dowdlelt, until we can show there is an empirically validated best path(s) to optimal data we need to let power users pick & choose procession options. To me, that feels like the only way we can eventually converge on a consistent best in practice pipeline.

dowdlelt · 2018-09-26T19:51:40Z

I may have not been clear about that - definitely need to let folks collectively choose the right way forward.

I was optimistically imagining a future in which there are countless options within tedana to select for denoising, combined with tedana outputting information about the number of components for each one. Using this, in theory, a user might could p-hack their way to findings. This is always a concern, but not one that is particularly pertinent to tedana, particular since we are discussing 2(!) options at the moment.

So, that was an undercooked thought. I agree with @handwerkerd - without a validated method, or likely multiple methods depending on the data - informed selection is required.

emdupre · 2018-09-26T20:06:03Z

I like your idea of testing both options across different datasets, @dowdlelt, but I think this is closer to what Dan was calling a "power user" approach. We're hopefully just setting defaults for users to "succeed" in that they are our reasonable, best guess for what is likely to work for most datasets, and therefore what we recommend.

I'm learning to never underestimate the power of defaults, so setting a default that provides both options (beyond being computational constraints) strikes me as likely to confuse an average user, rather than empower them.

It sounds like we're converging on retaining the selection tree as an accessible option, but setting automatic dimensionality estimation (via sklearn's MLE approach) as the default. Does this agree with everyone else's understanding ? Or does anyone else want to chime in on this topic ?

dowdlelt · 2018-09-27T01:50:38Z

Agree on confusion - better to set a strong, presumably robustish default and let the power users play more with the toys as desired. I agree on the automatic dimensionality estimation approach and apologize for any confusion I contributed. Its just fantastic to see these discussions occurring and to see all the input and work!

emdupre · 2018-09-27T15:51:00Z

No, it's great to have this discussion ! Thanks as always for your input ✨

jbteves · 2018-10-17T14:41:38Z

Hi all, late in joining this conversation and can post this elsewhere if it doesn't fit. However, a quick question-- do we know if setting the random seed has an effect on how well the ICA performs? Is it robust enough to display similar behaviors regardless of seed, or does it have trouble with some sets of numbers vs. others? It's unclear to me but I did notice that there's an option for setting one. I chatted briefly with @dowdlelt about this and it's unclear to us. We can perform some tests here at MUSC if they haven't been performed before.

handwerkerd · 2018-10-17T15:24:51Z

The random seed will affect the precise results for the ICA, but it's doubtful if some ICA outputs would be clearly better or merely different. Specifically for this processing stream, some of this variation will be removed when the components are recombined into the denoised time series. As long as portions of total signal variation doesn't shift from between accepted & rejected components, it won't affect the denoised time series at all. I expect some shifting so the denoised time series won't be identical, but, in my experience, they are very similar on repeated runs with different seeds.

Some versions of ICA run the component creation algorithm several times with multiple random seeds and create average components. I'm fairly sure the current version of this code doesn't do that, but if someone shows this is a serious problem, it's worth considering.

That said, one reason the random seed is an input parameter is so that, if issues arise, it would be possible to re-run the algorithm with the same seed to get the identical result.

jbteves · 2019-04-20T14:08:59Z

@tsalo @emdupre this seems like an issue that could be answered by the reliability analysis, maybe we could add a pause tag so that people don't sift through this until we have some more data to inform the discussion?

tsalo · 2019-04-20T17:39:45Z

That works for me.

emdupre · 2019-11-08T16:24:44Z

I think I'd like to close this issue, as it's diverged from its original question pretty substantially.

Maybe we could open an issue on tedana-reliability specifically about testing the new-GIFT style PCA vs the decision-tree PCA ?

stale · 2020-02-16T01:24:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions to tedana:tada: !

tsalo · 2020-02-20T21:34:30Z

MDL, AIC, and KIC are all included in the plans for the validation analysis (ME-ICA/tedana-comparison#4 (comment)), so are we good to close this?

I'd just rather actively close the issue than let stalebot do it.

emdupre added question discussion labels Jul 30, 2018

emdupre mentioned this issue Jul 30, 2018

[ENH] update ICA to sklearn from mdp #44

Merged

emdupre mentioned this issue Jul 30, 2018

[WIP, ENH] Add masking functions and simplify IO #70

Closed

emdupre added this to the 1.0.0 milestone Jul 31, 2018

tsalo mentioned this issue Sep 13, 2018

[ENH] Track PCA and ICA component selection decisions #122

Merged

emdupre mentioned this issue Sep 14, 2018

Remove PCA decision tree #129

Closed

emdupre mentioned this issue Sep 28, 2018

Slow down development and revisit package structure #131

Closed

tsalo mentioned this issue Oct 24, 2018

No convergence after 5000 steps #144

Closed

This was referenced Nov 29, 2018

Audit workflow arguments #162

Closed

[ENH] Split automatic dimensionality detection from decision tree in TEDPCA #164

Merged

dowdlelt mentioned this issue Jan 24, 2019

MLE estimation and convergence #200

Closed

tsalo mentioned this issue Mar 11, 2019

Incorporate features from MELODIC into PCA/ICA #218

Closed

tsalo added the paused label Apr 20, 2019

jbteves added paused and removed paused labels Jul 10, 2019

jbteves changed the title ~~PCA Component Selection~~ Should PCA Component Selection leverage multi-echo information? Jul 10, 2019

tsalo changed the title ~~Should PCA Component Selection leverage multi-echo information?~~ Should PCA component selection leverage multi-echo information? Nov 18, 2019

stale bot added the stale label Feb 16, 2020

stale bot removed the stale label Feb 20, 2020

tsalo closed this as completed Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should PCA component selection leverage multi-echo information? #101

Should PCA component selection leverage multi-echo information? #101

emdupre commented Jul 30, 2018 •

edited

Loading

dowdlelt commented Jul 30, 2018

handwerkerd commented Jul 30, 2018

dowdlelt commented Jul 31, 2018

handwerkerd commented Jul 31, 2018

emdupre commented Sep 14, 2018

handwerkerd commented Sep 14, 2018

emdupre commented Sep 24, 2018

dowdlelt commented Sep 25, 2018 •

edited

Loading

handwerkerd commented Sep 26, 2018

dowdlelt commented Sep 26, 2018

emdupre commented Sep 26, 2018

dowdlelt commented Sep 27, 2018

emdupre commented Sep 27, 2018

jbteves commented Oct 17, 2018

handwerkerd commented Oct 17, 2018

jbteves commented Apr 20, 2019

tsalo commented Apr 20, 2019

emdupre commented Nov 8, 2019

stale bot commented Feb 16, 2020

tsalo commented Feb 20, 2020

Should PCA component selection leverage multi-echo information? #101

Should PCA component selection leverage multi-echo information? #101

Comments

emdupre commented Jul 30, 2018 • edited Loading

dowdlelt commented Jul 30, 2018

handwerkerd commented Jul 30, 2018

dowdlelt commented Jul 31, 2018

handwerkerd commented Jul 31, 2018

emdupre commented Sep 14, 2018

handwerkerd commented Sep 14, 2018

emdupre commented Sep 24, 2018

dowdlelt commented Sep 25, 2018 • edited Loading

handwerkerd commented Sep 26, 2018

dowdlelt commented Sep 26, 2018

emdupre commented Sep 26, 2018

dowdlelt commented Sep 27, 2018

emdupre commented Sep 27, 2018

jbteves commented Oct 17, 2018

handwerkerd commented Oct 17, 2018

jbteves commented Apr 20, 2019

tsalo commented Apr 20, 2019

emdupre commented Nov 8, 2019

stale bot commented Feb 16, 2020

tsalo commented Feb 20, 2020

emdupre commented Jul 30, 2018 •

edited

Loading

dowdlelt commented Sep 25, 2018 •

edited

Loading