Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing #8

Merged
merged 11 commits into from
May 18, 2023
Merged

Preprocessing #8

merged 11 commits into from
May 18, 2023

Conversation

LorenzLamm
Copy link
Collaborator

This branch implements the

  1. pixel size matching: Fourier cropping / Fourier extension to achieve the specified tomogram pixel size.
    For both cropping and extension, an ellipsoid mask with cosine decay to zero is applied to avoid artifacts.
  2. Spectral matching: This was adapted from the implementation of DeePict (https://github.com/ZauggGroup/DeePiCt/tree/main/spectrum_filter). I adjusted some details to avoid artifacts and division by values close to zero.

Let me know what you think :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor adjustments to normalize tomograms and return pixel size from the header.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a small error in the augmentation script: if "prob_to_one" (i.e. maximum augmentations) is specified, it should apply the flipping with probability 0.5 instead of 1.0. Otherwise the flipping always happens and is redundant.
(Probably I should have added this in another branch, but I hope it's also okay here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good here for sure!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another small training adjustment (sorry for placing it here).
This multiplication by the number of elements leads to the correct computation of training loss for logging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from the DeePict repository. I'm not sure how to properly credit them. I copied these comments to the top of the file and hope that is okay. Is it?

The script opens the tomogram, normalizes it, extracts the rotationally averaged Fourier spectrum, and stores it in a csv file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the script for matching the pixel sizes.
It first computes the output shape, and then, depending on whether the output pixel size is smaller than the input pixel size, it will choose to perform either Fourier cropping (i.e. crop Fourier components to the shape of the output tomogram) or Fourier extension (Padding the Fourier space with zeros to receive the desired shape).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the interface script for matching the spectrum of two tomograms. It is also adjusted from the DeePict repository.

It first loads the tomogram and the target spectrum (csv) and then matches the input tomogram spectrum to the target spectrum.

I adjusted the script to additionally accept arguments "--shrink_excessive_value" to avoid too large matching vector components and "--almost_zero_cutoff", which cuts off the matching vector at the first Fourier coefficient that is below 0.1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from the DeePict repository.
Some utilities to deal with the spectrum vector.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utils for the pixel size matching:

  • Fourier cropping
  • Fourier extension
  • ellipsoid mask (maximally fit into tomogram shape with certain border)
  • smooth cosine decay

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also mainly copied from DeePict, but adjusted in some cases to avoid artifacts.

almost_zero_cutoff_value = np.maximum(
np.minimum(np.min(almost_zeros_input) - 4, np.min(almost_zeros_target) - 4), 0
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cuts off the FFT values from the first value that is below 0.1 in the input spectrum. The target spectrum will be divided by the input spectrum, so low values in the input spectrum can be problematic.

try:
equal_v[cutoff:] = 0
except IndexError:
warnings.warn("Flat cutoff is higher than maximum frequency")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted this smoothing also: It is normally used for a sigmoid decay to zero that starts at the "cutoff" value. However, I think it's more intuitive if after cutoff, all values are zeros. So I shifted the sigmoid to smaller values and set all values above cutoff to 0.


if shrink_excessive_value:
equal_v[equal_v > shrink_excessive_value] = shrink_excessive_value
# Create the equalization kernel
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another security check to avoid too large factors (by default, it's limited by shrink_excessive_value=50).

@LorenzLamm LorenzLamm marked this pull request as ready for review May 16, 2023 13:45
Copy link
Contributor

@alisterburt alisterburt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @LorenzLamm

First - this is really awesome, I'm glad we have this functionality in here. The PR is too big for comments on everything individually to not be a slow, frustrating experience to iterate on - instead I suggest we merge this and I will provide some overarching comments that can guide improving things as we move forward

  • x/y/z ordering of images is a little strange, could you explain what's going on there? :)

  • matching utils should probably be a subpackage of the preprocessing package, if writing this myself I would probably the following organisation

  • preprocessing

    • pixel_size_matching
      • _cli.py
      • match_pixel_size.py
    • amplitude_spectrum_matching
      • _cli.py
      • match_amplitude_spectrum.py

Rather than .py scripts which have to be located/added to path/executed, you can install scripts during package installation automatically with the project.scripts block in the pyproject.toml file - here is a PR to a different project where I template/discuss this for someone else bbarad/ETSegTools#1

Discussed in the PR above, I really like using Typer for turning a simple type annotated function into a script which can be executed from the command line - worth trying!

In general it would be great to have more explicit function names e.g. radial_average_3d rather than rad_avg

Does the spectrum matching take a while because of the large fft it computes or is it no big deal? If so we might consider calculating the sum of spectra over a number of smaller 3D patches to do the estimation - this can also be used to increase signal by taking overlapping patches

def load_tomogram(filename, return_header=False, normalize_data=False):
def load_tomogram(
filename, return_pixel_size=False, return_header=False, normalize_data=False
):
"""
Loads data and transposes s.t. we have data in the form x,y,z.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Loads data and transposes s.t. we have data in the form x,y,z.
Loads data and transposes s.t. we have data in the form x,y,z.

I hadn't noticed this before - could you explain why the transpose is necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mrcfile package loads the tomograms by default in format (z, x, y) if I remember correctly. It's not really necessary to transpose the axes. I just find it more intuitive to have the axes ordered (x, y, z). Removing this transpose should not matter for any functionalities, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good here for sure!

src/membrain_seg/dataloading/memseg_augmentation.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good!

@alisterburt
Copy link
Contributor

to be clear - the path forward here is to merge and iterate! 🙂

Simplify redundant condition

Co-authored-by: alisterburt <alisterburt@gmail.com>
@LorenzLamm
Copy link
Collaborator Author

hey @LorenzLamm

First - this is really awesome, I'm glad we have this functionality in here. The PR is too big for comments on everything individually to not be a slow, frustrating experience to iterate on - instead I suggest we merge this and I will provide some overarching comments that can guide improving things as we move forward

  • x/y/z ordering of images is a little strange, could you explain what's going on there? :)
  • matching utils should probably be a subpackage of the preprocessing package, if writing this myself I would probably the following organisation
  • preprocessing
    • pixel_size_matching
      • _cli.py
      • match_pixel_size.py
    • amplitude_spectrum_matching
      • _cli.py
      • match_amplitude_spectrum.py

Rather than .py scripts which have to be located/added to path/executed, you can install scripts during package installation automatically with the project.scripts block in the pyproject.toml file - here is a PR to a different project where I template/discuss this for someone else bbarad/ETSegTools#1

Discussed in the PR above, I really like using Typer for turning a simple type annotated function into a script which can be executed from the command line - worth trying!

In general it would be great to have more explicit function names e.g. radial_average_3d rather than rad_avg

Does the spectrum matching take a while because of the large fft it computes or is it no big deal? If so we might consider calculating the sum of spectra over a number of smaller 3D patches to do the estimation - this can also be used to increase signal by taking overlapping patches

Thanks a lot for your suggestions. I'll definitely try to implement them in the next iteration. The project.scripts block and the Typer function for the command line interface sounds like they can make the whole package much more convenient to use! :)

Regarding the timing of the FFT: It does indeed take a while to compute the FFTs for the large tomograms. So you would propose a sliding window approach for extraction / matching of the frequencies? Since FFT scales with O(n*log(n)), it should be more efficient to compute a lot of smaller FFTs than computing the FFt of the entire volume, right?
But is the transform then still roughly equivalent?
I guess I'll do some experiments on this! :)

@alisterburt
Copy link
Contributor

Regarding the timing of the FFT: It does indeed take a while to compute the FFTs for the large tomograms. So you would propose a sliding window approach for extraction / matching of the frequencies? Since FFT scales with O(n*log(n)), it should be more efficient to compute a lot of smaller FFTs than computing the FFt of the entire volume, right?
But is the transform then still roughly equivalent?

I wouldn't match on a window directly, I would average the FFTs over the sliding windows then do my spectrum matching on that average spectrum - dealing with the smaller spectrum should be a little easier and if needed the FFTs of the windows could be evaluated in parallel.

A quick look at complexity suggests batching should be quicker (n*log(n/8)), actual tests don't seem to show a huge benefit there...

a = torch.rand((256, 256, 256))
b = torch.rand((8, 128, 128, 128))
%timeit torch.fft.fftn(a, dim=(-3, -2, -1))
%timeit torch.fft.fftn(b, dim=(-3, -2, -1))
148 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
117 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@LorenzLamm LorenzLamm merged commit 09571cb into main May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants