Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP, ENH] Add masking functions and simplify IO #70

Closed
wants to merge 13 commits into from

Conversation

tsalo
Copy link
Member

@tsalo tsalo commented May 27, 2018

Changes proposed in this pull request:

  • Enhance tedana.utils.load_data to allow a multi-echo or z-concatenated gii/nii file, or a list of single-echo gii/nii files. Will return a single img-like object.
  • Add masking/unmasking functions for multi-echo gifti or nifti data.
    • Masking functions should take in a multi-echo img-like object and a mask img-like object and should return an M x E x T numpy array.
    • Unmasking functions should take in an M x E x T numpy array and a mask img-like object and should return a multi-echo img-like object.
    • Mask images can either be general (i.e., the same for all echoes) or can vary by echo. Masked data will have NaNs in voxels/vertices outside an echo-specific mask but inside the overall mask in order to maintain the M x E x T shape without splitting the data array by echo.
  • Once this is merged, all unmasked data should be in img-like objects and masked data should be in numpy arrays.

Copy link
Member

@rmarkello rmarkello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! Overall, a good start towards your proposed layout. I've made a few comments in places I think some clarification might be warranted: namely, there are a few spots where I think your gifti file manipulation isn't going to work. Also, and perhaps more importantly, I'm not sure about the utility of echo-specific masks! What's your thinking here? If there's no reason, I think that removing that functionality would make for some much easier code!

data = [nib.load(f) for f in data]

if isinstance(data[0], nib.gifti.gifti.GiftiImage):
arrays = [np.vstack([dat.data for dat in img.darrays]) for img
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd need np.hstack here if you want samples x time arrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing this section wrong in a couple of ways, so I've had to alter it significantly in my newest commit, but thanks for catching this. It convinced me to take a closer look.

arrays = [np.vstack([dat.data for dat in img.darrays]) for img
in data]
arr = np.stack(arrays, axis=1)
out_img = nib.gifti.GiftiImage(img.header, arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe img.header is called before being defined, here!

But, perhaps more importantly, I don't think this will actually generate a working GiftiImage. You'd need to use kwargs when instantiating, as in nib.gifti.GiftiImage(header=img.header, darrays=arr). However, it's worth noting that while that will technically work, it wouldn't be a valid gifti image file since the data is being stored as a standard numpy.ndarray and not a GiftiDataArray instance (trying to save it to a file would raise an error).

You could modify the tedana.utils.new_gii_like function (to account for the echos dimension) to output an appropriate gifti image?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right. I will try to shift the work to tedana.utils.new_gii_like soon, but am pushing a commit that works on some simulated data for now.

img : :obj:`nibabel.gifti.gifti.GiftiImage`
3D (Vertex, Echo, Time) gifti image containing data.
mask_img : :obj:`nibabel.gifti.gifti.GiftiImage`
1D (Vertex) or 2D (Vertex, Echo) gifti image with boolean or bool-like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of how we would use a mask that differed based on echo. I don't think that the current pipeline has such a use case, at least! What was your intended application?

raise TypeError('Provided data img is not a GiftiImage.')

img_arrs = [np.array(darr.data) for darr in img.darrays]
all_imgs_same = np.all(arr.shape == img_arrs[0].shape for arr in img_arrs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need a list comprehension here otherwise you get a generator: all_imgs_same = np.all([arr.shape == img_arrs[0].shape for arr in img_arrs]).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. Nice catch!

masked_arrs = [np.empty((np.sum(union_mask), n_echos)) for _ in
range(len(img_arrs))]
for i_vol in range(n_vols):
masked_arrs[i_vol][:] = np.nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally clear on why we want np.nan in the masked data. Rather, I understand why the np.nan are there, but I think this goes back to my comment above: I'm not sure that we need echo-specific masks! I'm happy to be convinced of their utility, but I can't envision a use case at the current juncture.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that having echo-specific masks is the same as using make_adaptive_mask, but this way only data need to be passed around after initial masking, rather than data + a mask. I'd rather avoid having any masks or mask-like arrays passed around if the data can carry the same information implicitly. It should be easy enough to get the min_mask or the adaptive_mask from an array that has NaNs in voxels from echoes with dropout in those areas.

idx_val in abs_echo_idx])
for j_vol in range(n_vols):
masked_arrs[j_vol][i_echo, rel_echo_idx] = \
img_arrs[j_vol][i_echo, abs_echo_idx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what shape the darrays in the initial gifti image are in. Are we storing them as samples x echos arrays, where each GiftiDataArray is a separate volume? That's what I assumed based on load_data2(), but here it seems they're echo x samples arrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I wasn't as careful about the order of dimensions as I should have been. Do you (or @emdupre) have a preference? Either way, each GiftiDataArray will be a separate volume, but I don't currently have a preference for echo x sample or sample x echo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might prefer sample x echo to be consistent with the original formatting X x Y x Z x echo x time, where X x Y x Z are being collapsed into sample. WDYT ?

@tsalo
Copy link
Member Author

tsalo commented May 29, 2018

I was under the impression that tedana used multi-echo masks. Is that not the case?

@codecov
Copy link

codecov bot commented May 29, 2018

Codecov Report

Merging #70 into master will decrease coverage by 2.97%.
The diff coverage is 1.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
- Coverage   45.81%   42.84%   -2.98%     
==========================================
  Files          29       30       +1     
  Lines        1602     1718     +116     
==========================================
+ Hits          734      736       +2     
- Misses        868      982     +114
Impacted Files Coverage Δ
tedana/utils/masking.py 0% <0%> (ø)
tedana/utils/utils.py 69.19% <5.4%> (-14.66%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31504f6...ef33750. Read the comment docs.

@emdupre
Copy link
Member

emdupre commented Jun 1, 2018

@rmarkello just issued a hotfix for file handling in #72 -- could you merge in master on this, @tsalo ? I'd like to try and have both of these PRs in a 0.0.2 release, if possible ! But I'm planning to cut one soon with at least the corrected file handling.

@tsalo
Copy link
Member Author

tsalo commented Jul 25, 2018

Okay, I think this should work now. The unmask_me function is pretty quick, but the apply_mask_me function is really slow (~30-40 seconds on simulated full-size data with 20 TRs). We can probably make it much faster when an echo-varying ME mask isn't used, which is good, but I don't know how to speed it up for ME masks.

I'm holding off on writing tests until after we've reviewed and agreed on an implementation.

@tsalo
Copy link
Member Author

tsalo commented Jul 27, 2018

@rmarkello @emdupre I have a question about T2* estimation that might significantly affect this PR. It seems that in the t2ss array (the one we have documented with "???"), which is shaped M x E-1, the first column is the voxel-wise estimates of T2* using the first two echoes, the next column uses three echoes, etc. Then, the full T2* map (t2saf) uses T2* estimated from the first two echoes for voxels with good data for only one echo (see here). Does that make sense?

I had assumed that the full map would use one of the following:

  1. Some notation so that make_optcom would use the single echo's value (but it doesn't seem to do that).
  2. Whatever estimated T2* that would correspond to a simple average across echoes.
  3. Or, following the logic that seems to be used in the code, fitting to all of the echoes even though only one is "good."

If data from "bad" echoes is going to be used (as with the T2* estimation method currently implemented or the second and third ideas above), then the whole img-based masking/unmasking approach probably needs to be rethought ☹️

Copy link
Member

@emdupre emdupre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading is that t2ss is the full log-linear fit, t2sa is the limited map only considering those voxels which have good signal from two or more echoes, and t2saf is t2sa but adding back in information from t2ss for the single-echo signal voxels.

I might be totally misinterpreting, so I'd appreciate if you could set me straight, @tsalo ! Or from anyone else, too. It'd be great to get this sorted. It'd also be great if we gave these things better names ! fl is maybe my least favorite of the variable names in the code you linked, though none are great....

@@ -106,6 +107,87 @@ def load_image(data):
return fdata


def load_data2(data, n_echos=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used anywhere, at this point ?

Copy link
Member Author

@tsalo tsalo Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would replace load_data as the input function for multi-echo data files, but it isn't used yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks !! Can we drop the gifti support in there too, then ?


Notes
-----
This works amazingly well, but is quite slow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this much faster by subbing out pandas, if we stick with this approach !

elif len(img.shape) == 5:
_, _, _, n_echos, n_vols = img.shape
else:
raise ValueError('Input img must be 4D (X x Y x Z x E) or '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a valid error to pass back up to the user ? They'll have either passed in a zcat ( X x Y x Z x E x T) or individual echo files, so I don't think they would interact with a (X x Y x Z x E) image..?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_data2 would load zcat (X x Y x Z*E x T), the preferred 5D (X x Y x Z x E x T) and individual echo files all into the same X x Y x Z x E x T-ordered image type, so masking and unmasking would respectively take in and return only that image type. We could allow other image types, but I think that it's unnecessary. Multiple individual echo images could just be masked/unmasked with nilearn's functions, and we should have to deal with zcat images (even when we do have zcat files) because we'll have load_data2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I misinterpreted your point. I guess it would be rare to feed in an image that isn't X x Y x Z x E x T or X x Y x Z x E, but I think that just makes this error unlikely (but not invalid).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being unclear ! To clarify: shouldn't it be a X x Y x Z x T file, not a X x Y x Z x E file ? The former would be an individual echo file, the latter is something I don't think users will interact with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Images with X x Y x Z x T data shouldn't be used with the multi-echo masking/unmasking functions. In that situation, nilearn's functions are more appropriate. Plus, load_data would take in multiple X x Y x Z x T files and return an X x Y x Z x E x T image, so X x Y x Z x T images shouldn't occur with multi-echo data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree, but I'm not sure how this error would make sense to a user, since they'll have only supplied the individual echo files or the zcat image (as far as I understand) ! The X x Y x Z x E file would then be created "under-the-hood" in a workflow.

@tsalo
Copy link
Member Author

tsalo commented Jul 30, 2018

From what I can tell, your interpretation of t2ss isn't quite correct. It seems to be the log-linear fit for ascending numbers of echoes, starting with 2, which is why it's shaped M x E-1 (because 1-echo fit is not done/included). One problem is that it thus has values for voxels using more echoes than are good. For example, imagine a voxel j that has good data for only one echo (of 4). t2ss is going to have the beta for the model fitted on the first two echoes at t2ss[j, 0], the beta for the first three echoes at t2ss[j, 1], and the beta for all four at t2ss[j, 2]. Then, the value in t2sa[j] is going to be 0 or NaN (or whatever we set it to), but the value in t2saf[j] is going to be the fit using the first two echoes from t2ss (i.e., t2ss[j, 0]).

It doesn't make a ton of sense to me, and I think any of the three possibilities I mentioned above make more sense.

I can make a new branch with a notebook to show what I mean.

@emdupre
Copy link
Member

emdupre commented Jul 30, 2018

I think the critical line is here and I can remember we had a discussion about it, though I don't remember where or the resolution ! Will look for it. The main issue is that it seems that when we subtract two from the echo number it wraps back around for the first echo, filling that value for the last echo in the list. Is that also your understanding ?

@tsalo
Copy link
Member Author

tsalo commented Jul 30, 2018

I think that's meant to be a feature, not a bug. I.e., the last column is filled first and then overwritten because the author didn't want to use those values at all. At least that's my interpretation based on the fact that t2ss is preallocated with n_echos - 1 columns. If the single-echo fit was going to be kept around and used, then the array would have been made with n_echos columns.

t2ss = np.zeros([n_samp, n_echos - 1])
s0vs = np.zeros([n_samp, n_echos - 1])

But yeah, we could change the preallocation, the lines you referenced, and usage of start_echo, which may be an input but is dealt with like a constant throughout, in order to add in the single-echo fit to t2ss and t2saf (and their S0 equivalents). However, do we want to use the single-echo fits? The values look funny (very small and low variance), which make sense given that it's fitting to multiple values (from diff volumes) of a single input (the one echo). We could use the full all-echo fit (including bad echoes) in t2saf or find some way to use the echo-averaged or single-echo value in make_optcom.

@emdupre
Copy link
Member

emdupre commented Jul 30, 2018

I see. Thanks for helping me remember all of this ! 👍

Yes, I think you're right that it's a (desired) feature to overwrite the first echo, probably for the reasons you outlined i.e., because those values aren't useful.

WDYT are the next steps, then ? I'm rather reluctant to change something so big as the T2* generation without a very good reason and a broader discussion....

@tsalo
Copy link
Member Author

tsalo commented Jul 30, 2018

I agree that we should discuss this with everyone before changing the behavior at all. It could be something for the group to discuss in person at INCF or we could bring it up on Gitter or over email.

My concern for this PR is that the current approach doesn't work with my proposed changes because, if we want to use data from bad echoes for full or even two-echo fits (as is currently implemented), then explicitly masking the bad echoes with the multi-echo masks I've been working on will prevent it.

@emdupre
Copy link
Member

emdupre commented Jul 30, 2018

Ok -- WDYT of closing this for now, then, and opening a new issue to continue the discussion on the T2* map generation ? This, #101, and #14 seem like things we should decide upon for our 1.0.0 release, but we have a little time between now and then, IMHO.

@tsalo
Copy link
Member Author

tsalo commented Jul 30, 2018

I'm happy to close this for the foreseeable future, and do think that changing the fitting behavior on a conceptual level should wait until a major release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants