Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnose ds210 failure #2731

Merged
merged 5 commits into from
Apr 21, 2022
Merged

Conversation

effigies
Copy link
Member

@effigies effigies commented Mar 3, 2022

Starting by setting a random seed to verify that it can be reproduced.

@effigies
Copy link
Member Author

effigies commented Mar 4, 2022

Looks like that does it. Rerunning to confirm. Will try to reproduce locally.

@mgxd
Copy link
Collaborator

mgxd commented Apr 13, 2022

@effigies were you able to replicate this? I just tried locally (with the same random seed) and it completed..

@effigies
Copy link
Member Author

Sorry, should have updated. I did try locally and couldn't replicate. If we upload the entire working directory as artifacts, that might be the best bet for tracking it down.

@mgxd
Copy link
Collaborator

mgxd commented Apr 14, 2022

Update - I reran, this time using the fast-track anatomicals provided, and was able to replicate the crash. As predicted, the BOLD mask is awful. After looking into this, I think the failure point is final_boldref_wf.enhance_and_skullstrip_bold_wf.n4_correct (see attachment)
Screen Shot 2022-04-14 at 9 03 22 AM

Two things:

  • N4Correct sets dimensionality to 3, but the input is a 4D file
  • Shouldn't RobustAverage be producing a single volume?

cc @effigies @oesteban

@oesteban
Copy link
Member

Wow great job catching it.

Your two hypotheses are correct. The question is what is wrong in the fast track that ends up giving N4 a 4D file (?)

Can you check what are the inputs to final_boldref_wf.enhance_and_skullstrip_bold_wf that change between fast-track / regular-track?

@mgxd
Copy link
Collaborator

mgxd commented Apr 14, 2022

Two things:

  • N4Correct sets dimensionality to 3, but the input is a 4D file
  • Shouldn't RobustAverage be producing a single volume?

I think I got confused across working directories, because this didn't seem to be the case after retesting. But I think I have found a simple solution - we have not been passing the already calculated bold mask into the final boldref workflow. I submitted a quick fix (3911638) for the particular pathway our ds210 test takes (ME, SDC), but the others are unaccounted for and thus failing.

@mgxd mgxd marked this pull request as ready for review April 14, 2022 20:58
@mgxd mgxd requested a review from oesteban April 14, 2022 20:58
@mgxd
Copy link
Collaborator

mgxd commented Apr 14, 2022

@effigies i can't request you as a reviewer since it's your PR, but would appreciate a quick look (if you find some time)

@@ -993,6 +994,9 @@ def init_func_preproc_wf(bold_file, has_fieldmap=False):
(bold_hmc_wf, bold_bold_trans_wf, [
("outputnode.xforms", "inputnode.hmc_xforms"),
]),
(initial_boldref_wf, final_boldref_wf, [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the idea of having a second go-round was to get a better mask.

Copy link
Member

@oesteban oesteban Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be IMHO okay, but to me, the problem is that we do not know yet what's making the fast-track fall into this while the regular track is fine for the given seed.

We first need to bisect the issue and determine the point where inputs and (or) outputs deviate from the regular track.

We may live with this just fine, but if we do this not knowing why the error occurs, the regression will be almost certain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the (silent) failure point is n4_correct, though from looking at the inputs it is not very clear why one case is failing while the other is fine. I've compiled the working directories of fast-track(ft)/no fast-track(noft) within a tarball (ds210_n4_error.tar.gz) if either of you would like to take a look.

@effigies
Copy link
Member Author

@mgxd Very unlikely today, as we'll be wrapping our sprint. I'll have some time on the plane and train, though.

@oesteban
Copy link
Member

The masks are different. That doesn't justify the crash (as in, it's crazy the mask makes such a big difference for N4), but at least we know where the problem is coming from, right?

Screen Shot 2022-04-15 at 4 57 43 PM

Some thoughts that I'm skeptical will fully solve the issue but that together will give more reliability to the full process:

Only when these things have been tested and the outputs of N4 in both conditions are more similar, I would then consider whether we also want to use a prior mask to avoid these steps altogether.

However, these steps are going to happen anyways in the first time around, so we want to make sure the workflow is more reliable.

WDYT?

@oesteban
Copy link
Member

I'm digging up more on the inputs.

  • The masks are float32 (which is not great for a mask).
  • The _average input images are very close to one another, but not exactly the same. It would be good to trace that back to the fast-track divergence.
  • The _average images should be clipped for N4 to perform appropriately. Currently they have a range between 0 through 12000, with a median of ~6.5. I'm sure reducing the dynamic range to 0-255 would basically compress all those low intensities within one bin, making the job easier for N4.

That all said, there is also appeal in giving SynthStrip a go, and replace all of this if it works.

@oesteban
Copy link
Member

  • The _average input images are very close to one another, but not exactly the same. It would be good to trace that back to the fast-track divergence.

Q: Is this happening with single-echo too?

@mgxd
Copy link
Collaborator

mgxd commented Apr 15, 2022

I would guess the difference is because the anatomical derivatives have been run on the high res ds210, whereas we're calculating the T1w/MNI transform using the downsampled outputs (and using --sloppy). I tried testing with the latest ANTs version (2.3.5), but no luck. I'm rerunning now using your suggestions (#2731 (comment))

That all said, there is also appeal in giving SynthStrip a go, and replace all of this if it works.

Yes, it would be good to test. But we'd still want to backport a fix for 21.0.x

Q: Is this happening with single-echo too?

Potentially (see #2761) but I haven't personally replicated / seen it

@oesteban
Copy link
Member

oesteban commented Apr 15, 2022

Are we using FSL 6? Could the reason for these errors bubble up be an upgrade and this code https://github.com/nipreps/niworkflows/blob/e1f4267eb5fd878b2a001f184e1bddbb3f4a6843/niworkflows/func/util.py#L378-L386 introducing weirdness?

Please check some of the ideas in nipreps/niworkflows#707.

I'm pretty positive we want to do the clipping before registration and N4, but don't have time now to have a stab at it.

@mgxd
Copy link
Collaborator

mgxd commented Apr 15, 2022

I just ran with the following changes: nipreps/niworkflows@maint/1.4.x...mgxd:dbg/ds210-failure

and was able to successfully complete, with normal looking reports.

@oesteban
Copy link
Member

and was able to successfully complete, with normal looking reports.

Yup, that's the weights instead of the mask, I'm sure of that.

I have updated my PR with further changes (esp. removing the binarization and binary dilation for the no-premask pathway).

@mgxd
Copy link
Collaborator

mgxd commented Apr 20, 2022

interestingly, we're now running into

220420-17:06:37,576 nipype.workflow ERROR:
	 Saving crash info to /out/fmriprep/sub-02/log/20220420-161238_95a4bf05-53b1-4a12-965b-3f1ff395c91f/crash-20220420-170637-UID1001-skullstrip_first_pass-4731a275-68da-44b0-8314-e8ce701ce2f9.txt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
    result["result"] = node.run(updatehash=updatehash)
  File "/opt/conda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 516, in run
    result = self._run_interface(execute=True)
  File "/opt/conda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 635, in _run_interface
    return self._run_command(execute)
  File "/opt/conda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 741, in _run_command
    result = self._interface.run(cwd=outdir)
  File "/opt/conda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 428, in run
    runtime = self._run_interface(runtime)
  File "/opt/conda/lib/python3.8/site-packages/nipype/interfaces/fsl/preprocess.py", line 163, in _run_interface
    runtime = super(BET, self)._run_interface(runtime)
  File "/opt/conda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 822, in _run_interface
    self.raise_exception(runtime)
  File "/opt/conda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 749, in raise_exception
    raise RuntimeError(
RuntimeError: Command:
bet vol0000_xform-00000_clipped_merged_average_corrected.nii vol0000_xform-00000_clipped_merged_average_corrected_brain.nii.gz -f 0.20 -m
Standard output:
/opt/fsl-6.0.5.1/bin/bet failed during command:vol0000_xform-00000_clipped_merged_average_corrected.nii vol0000_xform-00000_clipped_merged_average_corrected_brain.nii.gz -f 0.20 -m
Standard error:
/opt/fsl-6.0.5.1/bin/bet: line 399:  1528 Segmentation fault      (core dumped) ${FSLDIR}/bin/bet2 $IN $OUT $bet2opts
Return code: 1

@mgxd mgxd added this to the 21.0.2 milestone Apr 21, 2022
@mgxd
Copy link
Collaborator

mgxd commented Apr 21, 2022

@effigies @oesteban I think the latest niworkflows changes have fixed this. Are we fine with merging (after removing the random seed) and cutting 21.0.2?

@mgxd mgxd force-pushed the diagnose-ds210-failure branch from 3e5b0f2 to 916022f Compare April 21, 2022 18:15
@mgxd
Copy link
Collaborator

mgxd commented Apr 21, 2022

merging since i'd like to get a release in before the weekend

@mgxd mgxd merged commit 1ec8149 into nipreps:maint/21.0.x Apr 21, 2022
@effigies
Copy link
Member Author

Thanks for going ahead. No objections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants