Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited Use of Multithreading during SyN Registration (b-spline syn) #1017

Closed
NitramNimod opened this issue Jun 4, 2020 · 60 comments
Closed

Comments

@NitramNimod
Copy link

Hi!
When using b-spline syn during ANTs registration only one cpu core is working, while normal SyN behaves in the way it should behave, when properly setting and using the ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS variable. I can see in the resource monitor a short cpu work load spike right at the beginning of each convergence step, but 99% of the time it's single core processing.
This behavior is reproducible on a Mac running El Capitan (10.11.6) or an Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-53-generic x86_64).
The Mac is running the latest ANTs version (2.3.3.dev168-g29bdf) compiled this morning, on Ubuntu slightly older (2.3.1.dev120-g2f09e) compiled in March.
I've found an older similar thread which is already closed because the problem was solved by compiling against a newer ITK, which here obviously doesn't work. [https://github.com//issues/604]
As the b-spline syn takes a lot more time a multi-threading would be highly appreciated! :)

Thanks and best regards,
Martin

@ntustison
Copy link
Member

All multi-threading is handled by ITK. The SyN and B-spline SyN classes are derived from the same parent class with the only difference being the use of B-spline smoothing and that B-spline smoothing class has been there for ~10 years. Did you try compiling a fresh checkout directly on your machine?

@NitramNimod
Copy link
Author

Yes, I was using the small installation script that is offered on the ANTs website. I suppose, the git command there will checkout the latest working version. For the Mac a few hours ago…
There were no errors during compilation, everything else is working normal.
How to solve this? Or what can I do to help?

@ntustison
Copy link
Member

It's really hard to say what's going on because, as I said, multi-threading is handled by the ITK foundation. You might want to ask over at the ITK discussion forum.

@cookpa
Copy link
Member

cookpa commented Jun 9, 2020

I've not come across similar problems since that old thread. Happy to test again with a reproducible example. But my knowledge of the underlying ITK behavior is limited, so we'd probably have to ask over there.

@NitramNimod
Copy link
Author

Example in the sense of "data" or "parameters" for an antsRegistration-call?

@cookpa
Copy link
Member

cookpa commented Jun 10, 2020

A reproducible example would include data and code.

@NitramNimod
Copy link
Author

Ok, I've assembled a simple example, I just used the famous "anatomy of an antsRegistration call" call and changed the SyN stage to BSplineSyN:

antsRegistration --dimensionality 3 --float 0 --verbose 1 \
--output [T1w_to_T2_Flair_,T1w_to_T2_Flair_Warped.nii.gz] \
--interpolation Linear \
--winsorize-image-intensities [0.005,0.995] \
--use-histogram-matching 0 \
--initial-moving-transform [sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1] \
--transform Rigid[0.1] \
--metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \
--convergence [1000x500x250x100,1e-6,10] \
--shrink-factors 8x4x2x1 \
--smoothing-sigmas 3x2x1x0vox \
--transform Affine[0.1] \
--metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \
--convergence [1000x500x250x100,1e-6,10] \
--shrink-factors 8x4x2x1 \
--smoothing-sigmas 3x2x1x0vox \
--transform BSplineSyN[0.1,1.5,0,3] \
--metric CC[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,4] \
--convergence [100x70x50x20,1e-6,10] \
--shrink-factors 8x4x2x1 \
--smoothing-sigmas 3x2x1x0vox

As can be seen I'm trying to coregister a T1-weighted image with a T2 Flair image. For the linear stages the cpu workload always went to the allotted load, but during BSplineSyN it goes to one core on average, spiking in between to 2 or 3 cores.
Here's the data, stripped for skull/face and meta data.
sub-01_T1w_brain.nii.gz
sub-01_FLAIR_brain.nii.gz

Thank you very much, best regards,
Martin

@ntustison
Copy link
Member

Thanks for providing this. I just ran it on my machine using multithreading and the number of threads stayed the same through all 3 stages. How are you monitoring thread use?

@ntustison
Copy link
Member

However, I just noticed that it's running unexpectedly slower than typical and I noticed that you're using a spline distance of 1.5mm for human brains. Where did you get this parameter choice?

@gdevenyi
Copy link
Contributor

I can also confirm I'm seeing single-thread processing during the BSpline Stage:
image

@gdevenyi
Copy link
Contributor

And regular SyN uses all cores:
image

@ntustison
Copy link
Member

Well, this is one possible difference that might point towards an explanation. As I mentioned, all the threading is handled by ITK, so if this is indeed the culprit, there's really nothing we can do on the ANTs side.

@cookpa
Copy link
Member

cookpa commented Jun 11, 2020

Looks like many threads are being created, but they can't all run in parallel for some reason.

@ntustison
Copy link
Member

I take that back. If that were the case, then N4 should see a similar throttling and I haven't noticed anything but should probably check. @gdevenyi --- if you get a chance, can you just verify N4 isn't seeing something similar? And, then, can you run the following on just some random displacement field and see what the thread usage is like?

# Gaussian smoothing
$ SmoothDisplacementField 3 $displacementField $output 3.0 
# Bspline smoothing
$ SmoothDisplacementField 3 $displacementField $output 1x1x1 4

@cookpa
Copy link
Member

cookpa commented Jun 11, 2020

I adapted the script slightly. I found a similar result, multiple threads are spawned but only one is active. The CPU utilization goes up as I increase the spline spacing. I also turned on --float 1 to see if memory has an impact, but I don't think that made much difference.

export ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4

if [[ ! -f T1w_to_T2_Flair_0GenericAffine.mat ]]; then
  $HOME/bin/ants-2.3.4/antsRegistration --dimensionality 3 --float 0 --verbose 1 \
    --output [T1w_to_T2_Flair_,T1w_to_T2_Flair_Warped.nii.gz] \
    --interpolation Linear \
    --winsorize-image-intensities [0.005,0.995] \
    --use-histogram-matching 0 \
    --initial-moving-transform [sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1] \
    --transform Rigid[0.1] \
    --metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \
    --convergence [1000x500x250x100,1e-6,10] \
    --shrink-factors 8x4x2x1 \
    --smoothing-sigmas 3x2x1x0vox \
    --transform Affine[0.1] \
    --metric MI[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,32,Regular,0.25] \
    --convergence [1000x500x250x100,1e-6,10] \
    --shrink-factors 8x4x2x1 \
    --smoothing-sigmas 3x2x1x0vox
fi

$HOME/bin/ants-2.3.4/antsRegistration --dimensionality 3 --float 1 --verbose 1 \
  --output [T1w_to_T2_Flair_Spline,T1w_to_T2_Flair_SplineWarped.nii.gz] \
  --initial-moving-transform T1w_to_T2_Flair_0GenericAffine.mat \
  --transform BSplineSyN[0.1,32,0,3] \
  --metric CC[sub-01_T1w_brain.nii.gz,sub-01_FLAIR_brain.nii.gz,1,4] \
  --convergence [100x70x50x20,1e-6,10] \
  --shrink-factors 8x4x2x1 \
  --smoothing-sigmas 3x2x1x0vox

@ntustison
Copy link
Member

Just so people have an idea of the pieces involved:

I wrote much of the B-spline code in ITK except for the interpolator and original FFD-style image registration. All of the higher-level B-spline applications (e.g., N4, B-spline smoothing, B-splinev4/BsplineSyN image registration) use this filter as the underlying fitting/smoothing workhorse. I'm pretty sure that I remember correctly that this is the only filter that has explicit multithreading code. With SyN vs. B-spline SyN, although there are some differences in point-set handling, the only differences for the displacement field-only case is the actual smoothing. SyN uses Gaussian smoothing and BSplineSyN calls this analogous function which calls this filter which simply uses the B-spline workhorse mentioned above. As I mentioned above, N4 uses the same multi-threaded workhorse so if that's working properly, it's really puzzling to me that B-spline SyN wouldn't also work.

@gdevenyi
Copy link
Contributor

SmoothDisplacementField 3 $displacementField $output 3.0
SmoothDisplacementField 3 T1w_to_T2_Flair_1Warp.nii.gz smoothed.nii.gz 1x1x1 4

Both seem to alternate between all cores loaded and a single thread loaded.

The second does more than once, but I think that's because of the "4" levels, whereas the other call only has one.

@ntustison
Copy link
Member

I've tried to reproduce for a different number of threads. Attached are two screenshots during the affine stage and bsplinesyn stage where I've specified 4 threads and it looks as expected. So, @cookpa , since you use a Mac as well, I'm guessing you see "n" threads activated in the Threads column of the Activity Monitor but only see a single bar active on your CPU usage monitor, correct?
![Uploading affine.png…](Affine stage)
![Uploading bsplinesyn.png…](B-spline stage)

@gdevenyi
Copy link
Contributor

Looks like you hit the comment button before the picture uploads completed :)

@cookpa
Copy link
Member

cookpa commented Jun 11, 2020

I'm remoting to my mac, using ps -cM [pid]

USER    PID   TT   %CPU STAT PRI     STIME     UTIME COMMAND
pcook   871 s000  100.0 R    31T   0:01.94   5:04.04 antsRegistration
        871         0.0 S    31T   0:00.37   0:08.00 
        871         0.0 S    31T   0:00.37   0:08.10 
        871         0.0 S    31T   0:00.38   0:08.09 
        871         0.0 S    31T   0:00.37   0:08.03 

There's multiple threads but only one is running, and top shows 100% CPU usage.

@ntustison
Copy link
Member

![Uploading affine.png…](Affine stage)
bsplinesyn

bsplinesyn

@ntustison
Copy link
Member

@cookpa @gdevenyi Thanks.

@gdevenyi
Copy link
Contributor

Limiting the threading with ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4 does impact affine and SyN, but BSplineSyN is still one thread.

@ntustison
Copy link
Member

@gdevenyi I don't understand what you're saying. The second image is of B-spline SyN. Actually both images are of B-spline SyN (my mistake).

@gdevenyi
Copy link
Contributor

**When I change ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4 on my system, affine and SyN are properly adjusted, BSpineSyN is still single threaded.

@ntustison
Copy link
Member

@gdevenyi Okay, I wasn't disputing that. I was simply pointing out that I can't reproduce on my machine.

@gdevenyi
Copy link
Contributor

Okay, so I'm wondering if this is a platform/build system detail here.

I'm on Linux 18.04, with gcc-10.1. Are you using gcc or clang-as-gcc on OSX?

I've also tested ITK_GLOBAL_DEFAULT_THREADER=Platform and ITK_GLOBAL_DEFAULT_THREADER=Pool with the same results. I'll have to rebuild with TBB to see if its different, however TBB with ANTs still segfaults randomly so I don't bother hacking it in right now.

@cookpa
Copy link
Member

cookpa commented Jun 11, 2020

@ntustison using the script I posted I see CPU usage similar to yours. With the spacing of 1.5mm as in the original example, it hovers around 100%.

@ntustison
Copy link
Member

@gdevenyi

(base) [ntustison@Infinite-Resignation Thu Jun 11 11:13:06] $ gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.0 (clang-1100.0.33.8)
Target: x86_64-apple-darwin19.5.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

Okay, @cookpa , interesting. So with four threads it doesn't drop to a single CPU but you do see a disparity

Screen Shot 2020-06-11 at 11 16 44 AM

@gdevenyi
Copy link
Contributor

Will try a build with clang-11 here

@ntustison
Copy link
Member

SmoothDisplacementField wraps this functionality. Does it exhibit the same behavior (i.e., slow, single-threaded)? Be sure to specify a n-D vector (e.g., 10x10x10 for a 3-D displacement field) for the variance_or_mesh_size_base_level option or it'll smooth with a Gaussian kernel.

@cookpa
Copy link
Member

cookpa commented Oct 3, 2022

@chrisadamsonmcri what data and spline spacing are you using to observe the slowdown?

If I recall these experiments correctly, the appropriate number of threads were created but they did not execute in parallel as the spline resolution increased. Once the spline spacing got sufficiently small, the CPU usage approached 100% - making it appear to be single-threaded, but really all the threads were there, they were just running one at a time.

@chrisadamsonmcri
Copy link
Contributor

Here is more information about the call I used

antsRegistration -v -d 3 -u 1 --verbose 1 --float 1 --collapse-output-transforms 1 \
	--initial-moving-transform [${OUTPUTPREFIX}_alberts_to_native_affine_reg0GenericAffine.mat,0] \
	--transform BSplineSyN[0.25,5,0,3] \
	--metric CC[ ${OUTPUTPREFIX}_t2w_restore_brain_dilated.nii.gz,$TEMPLATEDIR/template-40_brain_dilated.nii.gz,0.2,2 ] \
 	--convergence [ 50x0,1e-6,100 ] --shrink-factors 2x1 --smoothing-sigmas 2x1vox --masks ${OUTPUTPREFIX}_brain_mask_dilated.nii.gz] \
	--output ${OUTPUTPREFIX}_alberts_to_native_skullstrip_reg

The moving image,$TEMPLATEDIR/template-40_brain_dilated.nii.gz is

dim1		117
dim2		159
dim3		126
dim4		1
pixdim1		0.859375
pixdim2		0.859375
pixdim3		0.859375

The fixed image ${OUTPUTPREFIX}_t2w_restore_brain_dilated.nii.gz is

dim1		182
dim2		243
dim3		177
dim4		1
pixdim1		0.629961
pixdim2		0.629961
pixdim3		0.629961

I can't send the specific images used for this test but I do get consistent behaviour across different images. Hope this helps.

@chrisadamsonmcri
Copy link
Contributor

chrisadamsonmcri commented Oct 3, 2022

@ntustison I ran SmoothDisplacementField on a warp that was previously computed with the same size as the fixed image in the previous message

dim1		182
dim2		243
dim3		177
dim4		1
pixdim1		0.629961
pixdim2		0.629961
pixdim3		0.629961
pixdim4		0.000000

I used the following call:

SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 26x34x26
Elapsed time: 30.6202
RMSE = 0.4206
  rmse[0] = 0.262028
  rmse[1] = 0.241287
  rmse[2] = 0.223518

I took 26x34x26 from the number of control points in my testing post from earlier.
Observing top there was a brief period at the start with >500% CPU utilisation but it dropped to 100% for the majority of execution.

Using fewer control points was much faster and remained more multithreaded:

SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 10x10x10
Elapsed time: 6.84312
RMSE = 0.743995
  rmse[0] = 0.413912
  rmse[1] = 0.436904
  rmse[2] = 0.43606

@ntustison
Copy link
Member

Okay, I'd like to focus on the multi-threading (elapsed time is correlated with the number of control points). What do you mean it "remained more multi-threaded"? The number of threads isn't explicitly coded to change during the execution of the filter.

@chrisadamsonmcri
Copy link
Contributor

chrisadamsonmcri commented Oct 3, 2022

A bit more information on what I mean by that.
I ran top while running the smoother.

At the very start of the smoothing step all threads are active, for about 0.1 seconds:

  15012 addo      20   0 2538072 567748  12368 S  28.0   0.9   0:00.85 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15004 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15005 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15006 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15007 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15008 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15009 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15010 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15015 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15017 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15019 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15020 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15021 addo      20   0 2538072 567748  12368 S  27.6   0.9   0:00.84 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15000 addo      20   0 2538072 567748  12368 S  27.3   0.9   0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15011 addo      20   0 2538072 567748  12368 S  27.3   0.9   0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15016 addo      20   0 2538072 567748  12368 S  27.3   0.9   0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15018 addo      20   0 2538072 567748  12368 S  27.3   0.9   0:00.83 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15003 addo      20   0 2538072 567748  12368 S  27.0   0.9   0:00.82 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15013 addo      20   0 2538072 567748  12368 S  27.0   0.9   0:00.82 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1War+ 
  15014 addo      20   0 2538072 567748  12368 S  20.7   0.9   0:00.63 /home/addo/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Wa

But after that there is only one active thread

 15249 addo      20   0 2538072 568056  12676 R  99.7   0.9   0:07.34 SmoothDisplacem...

It isnt the case that multiple threads are active and each one taking up very little CPU. Perhaps there is a loop that is running in serial? I'll try to put some timing information in the filter itself, will report back.

@ntustison
Copy link
Member

Perhaps there is a loop that is running in serial?

Nothing that should take significant time. The displacement field B-spline smoothing filter is basically a wrapper for the more generalized B-spline approximation filter. A for loop in the former iterates through the input field and collects the data into a point-set data structure which calls the latter for fitting and sampling of the B-spline object. In the latter, both the approximation part and reconstruction part should be explicitly multi-threaded.

@chrisadamsonmcri
Copy link
Contributor

It is this loop that is slow ITKv5/Modules/Filtering/ImageGrid/include/itkBSplineScatteredDataPointSetToImageFilter.hxx in BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::UpdatePointSet() at around line 990:

  while (ItIn != this->m_InputPointData->End())
  {
    
    PointType point;
    point.Fill(0.0);

    input->GetPoint(ItIn.Index(), &point);

    for (unsigned int i = 0; i < ImageDimension; ++i)
    {
      U[i] = static_cast<RealType>(totalNumberOfSpans[i]) * static_cast<RealType>(point[i] - this->m_Origin[i]) /
             (static_cast<RealType>(this->m_Size[i] - 1) * this->m_Spacing[i]);

      if (itk::Math::abs(U[i] - static_cast<RealType>(totalNumberOfSpans[i])) <= epsilon[i])
      {
        U[i] = static_cast<RealType>(totalNumberOfSpans[i]) - epsilon[i];
      }
      if (U[i] < NumericTraits<RealType>::ZeroValue() && itk::Math::abs(U[i]) <= epsilon[i])
      {
        U[i] = NumericTraits<RealType>::ZeroValue();
      }

      if (U[i] < NumericTraits<RealType>::ZeroValue() || U[i] >= static_cast<RealType>(totalNumberOfSpans[i]))
      {
        itkExceptionMacro("The collapse point component "
                          << U[i] << " is outside the corresponding parametric domain of [0, " << totalNumberOfSpans[i]
                          << ").");
      }
    }
    
    for (int i = ImageDimension - 1; i >= 0; i--)
    {
      if (Math::NotExactlyEquals(U[i], currentU[i]))
      {
        for (int j = i; j >= 0; j--)
        {
          this->CollapsePhiLattice(collapsedPhiLattices[j + 1], collapsedPhiLattices[j], U[j], j);
          currentU[j] = U[j];
        }
        break;
      }
    }

    this->m_OutputPointData->CastToSTLContainer()[ItIn.Index()] = collapsedPhiLattices[0]->GetPixel(startPhiIndex);
    ++ItIn;
    
  }

This runs in serial. I ran this command

~/dev/ANTs/ANTs-build/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 26x34x26
Smoothing starting
DisplacementFieldToBSplineImageFilter<TInputImage, TInputPointSet, TOutputImage>::GenerateData()
bsplinePhysicalDomainField: 0
if (this->m_EnforceStationaryBoundary && !this->m_UseInputFieldToDefineTheBSplineDomain): 33
if (inputField) 565
if (inputPointSet) 0
bspliner setup: 0
void BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData()
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); output->Allocate(): 0
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->AfterThreadedGenerateData() 954
this->UpdatePointSet(); for (unsigned int i = 0; i < ImageDimension; ++i) 0
this->m_InputPointData->Size() 8064504
this->UpdatePointSet(); while (ItIn != this->m_InputPointData->End()) 28935ms

The output is all the timings I added in. The final 28935ms is the total elapsed time of the loop.

@ntustison
Copy link
Member

This is interesting. I have a pretty good guess as to what's happening (although not the "why"). But first can you do me a favor and send me the screen dump from the following (with this image):

N4BiasFieldCorrection -d 3 -i mprage.nii -v 1

using ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=4?

@ntustison
Copy link
Member

As a follow-up, I'll explain what I think is happening here. In this section of code, the B-spline object is being evaluated at each input parametric point. This is a relatively computational intense process as one has to find the weighted average based on the local neighborhood control point values and corresponding B-spline basis functions evaluated at that parametric point. For this situation involving 3-D B-spline objects and third order B-splines, evaluation at each point involves the sum of 4 * 4 * 4 = 64 control point * basis function products. (Cf. for the corresponding function I wrote in ITK). One can imagine that doing this at each voxel in an image can be quite time consuming.

However, for computational purposes in the BSplineScatteredDataApproximation filter, we took advantage of the fact that the B-spline parametric domain (i.e., ITK image) is a rectilinear grid and iterating along the same row or column maintains constant parametric values in all but one of the parametric directions such that now evaluation requires only the weighted average 4 control point * basis function products with some minor overhead when we change slices or rows during our iterating. That overhead is what the CollapsePhiLattice function is doing. This check is what is used to determine if we've gone to the next rows/columns or slices necessitating recalculation of the basis functions. If this check is erroneously met at every single point, then evaluation will reduce to the naive (i.e., more intense) computation.

I'd like to see what running N4 produces given that the same issue would apply.

@chrisadamsonmcri
Copy link
Contributor

Will do. Is there any code that you would like timed or debug information to be printed?

@ntustison
Copy link
Member

Yeah, just let me know the elapsed time given at the end of the screen dump from the command posted above.

@chrisadamsonmcri
Copy link
Contributor

Elapsed time: 16.0856
CPU utilisation was about 350 consistently.

@cookpa
Copy link
Member

cookpa commented Oct 6, 2022

I also ran the N4 command with the above parameters, and it uses about 350% CPU with 4 threads on my Intel Mac (2.3 GHz Quad-Core Intel Core i7).

Elapsed time: 28.4026
N4BiasFieldCorrection -d 3 -i mprage_hippmapp3r.nii.gz -v 1  100.36s user 0.73s system 353% cpu 28.623 total

If I reduce the spline distance to 5mm, as used in the registration call from @chrisadamsonmcri above, the CPU utilization drops slightly, but isn't bad

Elapsed time: 16.6224
N4BiasFieldCorrection -d 3 -b [ 5 ] -c [ 100,0 ] -i mprage_hippmapp3r.nii.gz   54.78s user 0.69s system 328% cpu 16.862 total

But if I drop the spline resolution to 1.5mm (as in the OP's command at the top of this issue), the utilization is reduced more substantially.

Elapsed time: 45.3176
N4BiasFieldCorrection -d 3 -b [ 1.5 ] -c [ 100,0 ] -i mprage_hippmapp3r.nii.gz  86.11s user 2.91s system 195% cpu 45.566 total

@chrisadamsonmcri
Copy link
Contributor

Ok so I think I get what the optimisation is @ntustison

    for (int i = ImageDimension - 1; i >= 0; i--)
    {
      if (Math::NotExactlyEquals(U[i], currentU[i]))
      {
        for (int j = i; j >= 0; j--)
        {
          this->CollapsePhiLattice(collapsedPhiLattices[j + 1], collapsedPhiLattices[j], U[j], j);
          currentU[j] = U[j];
        }
        break;
      }
    }

In a lattice organisation of control points most of the CollapsePhiLattice evaluations will be skipped since when you move one row or one column 2 of the 3 U[j]'s will be unmodified so you skip it. I counted the times that happened and you get this:

~/dev/ANTs/ANTs-build-with-timings-for-bsplinesyn/ANTS-build/Examples/SmoothDisplacementField 3 blah_alberts_to_native_skullstrip_reg1Warp.nii.gz tmp.nii.gz 26x34x26
DisplacementFieldToBSplineImageFilter<TInputImage, TInputPointSet, TOutputImage>::GenerateData()
bsplinePhysicalDomainField: 0
if (this->m_EnforceStationaryBoundary && !this->m_UseInputFieldToDefineTheBSplineDomain): 31
if (inputField) 553
if (inputPointSet) 0
bspliner setup: 0
void BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData()
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); output->Allocate(): 0
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->AfterThreadedGenerateData() 949
this->UpdatePointSet(); for (unsigned int i = 0; i < ImageDimension; ++i) 0
this->m_InputPointData->Size(): 8064504
CollapsePhiLatticeEvaluations: 8064504
NotEqualEvaluations: 13125704
this->UpdatePointSet(); while (ItIn != this->m_InputPointData->End()) 28531
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->UpdatePointSet(); 28531
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData();   if (this->m_DoMultilevel) 0
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->m_GenerateOutputImage: 1
BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData(); this->SetPhiLatticeParametricDomainParameters(); 30
void BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::GenerateData() end 
bspliner->Update(); 29559
Elapsed time: 30.1587
RMSE = 0.4206
  rmse[0] = 0.262028
  rmse[1] = 0.241287
  rmse[2] = 0.223518

NotEqualEvaluations is the count of skipped evaluations which are indeed more than the number of points and indicative that the optimisation is being activated. However, it seems without multithreading performing that many evaluations is slow.

@ntustison
Copy link
Member

Hey, thanks for the additional information. Let me mentally process this info a bit more before I take a deeper dive. Right now, one obvious question might be why UpdatePointSet() is not multi-threaded. I wrote this code over 10 years ago and I'm having difficulty remembering why but given all the multi-threading I wrote for the surrounding code, I'm guessing there's a reason. I had a procedure today so I'll postpone any code writing in the short term---previous attempts at sedated code writing were not very successful.

@ntustison
Copy link
Member

Can you check this minor change on your end?

Modules/Filtering/ImageGrid/include/itkBSplineScatteredDataPointSetToImageFilter.hxx
diff --git a/Modules/Filtering/ImageGrid/include/itkBSplineScatteredDataPointSetToImageFilter.hxx b/Modules/Filtering/ImageGrid/include/itkBSplineScatteredDataPointSetToImageFilter.hxx
index 77b03198df..8f595b56be 100644
--- a/Modules/Filtering/ImageGrid/include/itkBSplineScatteredDataPointSetToImageFilter.hxx
+++ b/Modules/Filtering/ImageGrid/include/itkBSplineScatteredDataPointSetToImageFilter.hxx
@@ -269,10 +269,10 @@ BSplineScatteredDataPointSetToImageFilter<TInputPointSet, TOutputImage>::Generat
   multiThreader->SingleMethodExecute();
   this->AfterThreadedGenerateData();
 
-  this->UpdatePointSet();
-
   if (this->m_DoMultilevel)
   {
+    this->UpdatePointSet();
+
     this->m_PsiLattice->SetRegions(this->m_PhiLattice->GetLargestPossibleRegion());
     this->m_PsiLattice->Allocate();
     PointDataType P{};

On my machine

% echo $ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS 
4
% PrintHeader warp.nii.gz| head -n 8
 Spacing [0.629961, 0.629961, 0.629961]
 Origin [0, 0, 0]
 Direction 
1 0 0
0 1 0
0 0 1

 Size : 182 243  177
  • Current
% SmoothDisplacementField 3 warp.nii.gz testPre.nii.gz 10x10x10
Elapsed time: 7.50856
% SmoothDisplacementField 3 warp.nii.gz testPre.nii.gz 40x40x40
Elapsed time: 7.67891
% SmoothDisplacementField 3 warp.nii.gz testPre.nii.gz 80x80x80
Elapsed time: 7.97037
  • Proposed change
% SmoothDisplacementField 3 warp.nii.gz testPost.nii.gz 10x10x10   
Elapsed time: 6.89464
% SmoothDisplacementField 3 warp.nii.gz testPost.nii.gz 40x40x40
Elapsed time: 6.95888
% SmoothDisplacementField 3 warp.nii.gz testPost.nii.gz 80x80x80
Elapsed time: 7.04218

@ntustison
Copy link
Member

Okay, here are the links to a proposed update of the B-spline filter (file.h and file.hxx). There are two main improvements:

  1. Eliminate unnecessary calls to UpdatePointSet().

  2. Multi-thread UpdatePointSet() (Now called ThreadedGenerateDataForUpdatePointSetValues()).

I stand corrected in that I apparently wrote this code back in 2005. It's hard to remember what I was thinking back then but my guess is that I had in mind that the approximated point set values would also be an output and the inefficiencies are related to that. However, one doesn't need those values if only a single approximation level is being performed and that's the case when we run both N4 and BSplineSyN registration. And I have no idea why UpdatePointSet() was never multi-threaded.

I'll do some more testing and submit this to ITK hopefully early next week but would certainly welcome any feedback.

@chrisadamsonmcri
Copy link
Contributor

chrisadamsonmcri commented Oct 10, 2022

Success @ntustison . Here are benchmark results for the SmoothDisplacementField test that I posted about before. After the update it took 1.6 seconds compared to 30.2 seconds on a 5900X (24 threads). Also it seems to produce the same output given that it reproduces the same rmse values and the new version produces the same output file as the old one. I'm sure we have all done "sedated code writing" at some point in time.

Original:
Elapsed time: 30.2396
RMSE = 0.4206
rmse[0] = 0.262028
rmse[1] = 0.241287
rmse[2] = 0.223518

After update:
Elapsed time: 1.65599
RMSE = 0.4206
rmse[0] = 0.262028
rmse[1] = 0.241287
rmse[2] = 0.223518

@ntustison
Copy link
Member

Awesome. Glad it seems to be working now.

@gdevenyi
Copy link
Contributor

Thanks to @chrisadamsonmcri for running down this optimization!

@NitramNimod
Copy link
Author

Yeah, nice work @chrisadamsonmcri! Thank you all for all that effort.

@chrisadamsonmcri
Copy link
Contributor

Thank you all! Thanks to @ntustison for implementing the whole class and speedup. Implementing and speeding up BSplineSyN, N4 is a big deal.

@ntustison
Copy link
Member

Okay, I'm closing this issue as the fix was merged and I updated the ANTs ITK git tag.

@cookpa
Copy link
Member

cookpa commented Oct 12, 2022

I ran a regression test with antsRegistration - identical results before and after this fix.

@ntustison
Copy link
Member

Awesome. Thanks @cookpa .

@cookpa cookpa removed the unsolved mysteries Issues that require further investigation to diagnose label Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants