Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merfish #33

Merged
merged 18 commits into from
Jun 20, 2023
Merged

Merfish #33

merged 18 commits into from
Jun 20, 2023

Conversation

quentinblampey
Copy link
Contributor

Adding a MERFISH reader (latest vizgen folder structure, with parquet boundary files).

Comments:

  • Vigzen has a post-processing tool called VPT (https://vizgen.github.io/vizgen-postprocessing/) that helps the user improve the default segmentation. Since it's very flexible, the user can define in which folder the new files will be stored, and can also define a new name for each file. For now, I thought that the easiest way to deal with that is: (i) if the user is using the default segmentation, then vpt_outputs remains None as default ; (ii) if the user is using VPT with standard file names, it only has to provide a directory path as vpt_outputs ; (iii) if the user has chosen new filenames for VPT, then vpt_outputs is a dict with all the file paths.
  • Sometimes, vizgen data contains some "blank" gene names/fields (but still has some counts), so I moved such fields into adata.obsm to not disturb the analysis performed on adata.X
  • The images are pseudo-3D because its composed of seven 2D-images separated by few microns on the z-axis. For now, I made multiple 2D-images, whose name are z<i> where <i> is the z-layer index.

Important note:
For now, it seems that the points are not aligned on the images/shapes when using napari-spatialdata. The transformation seems correct and work perfectly fine for shapes, but not for points. I tried switching the axis, inversing the transformation, but didn't fixed the issue. I don't understand why it works for shapes but not for points although they are defined on the same space and have the same x/y coordinates range. (Currently discussing about this with @LucaMarconato)

@codecov-commenter
Copy link

codecov-commenter commented Apr 20, 2023

Codecov Report

Merging #33 (71f075a) into main (664fa44) will decrease coverage by 1.02%.
The diff coverage is 43.01%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #33      +/-   ##
==========================================
- Coverage   42.98%   41.96%   -1.02%     
==========================================
  Files          14       16       +2     
  Lines         663      853     +190     
==========================================
+ Hits          285      358      +73     
- Misses        378      495     +117     
Impacted Files Coverage Δ
src/spatialdata_io/readers/merscope.py 27.39% <27.39%> (ø)
src/spatialdata_io/__init__.py 100.00% <100.00%> (ø)
src/spatialdata_io/_constants/_constants.py 100.00% <100.00%> (ø)

... and 3 files with indirect coverage changes

@LucaMarconato
Copy link
Member

Thanks, look amazing, and I think that the use case around vpt_outputs is a very handy one.

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @quentinblampey thank you so much for this, this is super useful and looks really good! apologies that it took us so long to get back to you.

A major question I have regarding the scope is the name of the module/function. As in, if it's only specific to merscope type of platform, that I would rename everything to merscope as it is done for the other platforms (we call it curio, not slide-seq for example).

if it's more general then just merscope, then I would split it in 2 modules, one for merscope, and one for general merfish (some pointers would be useful to what kind of tech that would be). They could ofc share majority of the code (if not all) but the user would see 2 separate functions.

wdyt ? also @LucaMarconato @kevinyamauchi

also, looks like most the pre-commits missing are type errors, let me know if you need any help in tackling those as they can be quite painful.

thanks again and looking forward to get this in!

src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merfish.py Outdated Show resolved Hide resolved
@quentinblampey
Copy link
Contributor Author

Hey @giovp, thanks for your comments on the PR! I will make a new commit soon

Yep, the function can be renamed "merscope" (the reader is indeed specific to the merscope). As far as I know, only the merscope is merfish-based, so this was why I used this name, but I agree that this is ambiguous (and new merfish technologies may be developed) :)

@giovp
Copy link
Member

giovp commented Jun 12, 2023

Hey @giovp, thanks for your comments on the PR! I will make a new commit soon

Yep, the function can be renamed "merscope" (the reader is indeed specific to the merscope). As far as I know, only the merscope is merfish-based, so this was why I used this name, but I agree that this is ambiguous (and new merfish technologies may be developed) :)

sounds great! I would then rename also the module and follow same structure as the one for other readers. Thanks a lot again for this and looking forward to getting this in!

@quentinblampey
Copy link
Contributor Author

Hi again @giovp, I renamed the function and fixed the mypy/flake8 pre-commit errors
We now still have to decide on the three remaining comments you did above

Another thing: for now I supposed that all cell boundary polygons were the same on all z-levels, and I save only one ShapesModel instance for all z-layers. Indeed, all the seven z-level are very similar, so I personally always perform a segmentation that is common to all z-layers. But I don't know if some people run a segmentation on each z-layer separately (to aggregate the polygons in a 3D shape), and I don't know if spatialdata can handle 3D boundaries?

@giovp
Copy link
Member

giovp commented Jun 13, 2023

Hi @quentinblampey , spatialdata cannot support 3d boundaries on shapes. What is a good resolution to save the polygons at a fixed z layer? I wonder if all of them should be loaded/saved?

@quentinblampey
Copy link
Contributor Author

Hi @quentinblampey , spatialdata cannot support 3d boundaries on shapes. What is a good resolution to save the polygons at a fixed z layer? I wonder if all of them should be loaded/saved?

Ok, I was just anticipating if some users want to use 3D boundaries. Maybe it's okay to only save the polygons of one z-layer (i.e., what I currently implemented)
Indeed, the segmentation is 2D by default, and I don't know if people are really making 3D boundaries on MERSCOPE data.

Also, I'm not sure I understood your first question, what do you mean exactly?

@giovp
Copy link
Member

giovp commented Jun 13, 2023

ah sorry, I meant indeed what is the best z stack to choose for the 2D polygons, which I understand it is what you are doing now. I replied in line to the comment, I think it makes sense to alow user to selectively load parts, all or None of the image, what do you think>?

@quentinblampey
Copy link
Contributor Author

As you suggested @giovp, I added an argument to choose between no image, or one z-layer, or a list of z-layers
By default, I load only images of z_layer=3. I think this is a good default value, as this is also the one by default in the official Vizgen visualizer

For the transcripts, I also now use only one 2D PointsModel object

And for the polygons, I supposed that users are not performing 3D boundaries. In this case, the vigzen preprocessing tool simply duplicates the same polygons on all z-layers, so I just select the polygons from z_layer=0 (actually, I'm not even sure that the vizgen segmentation can handle/produce 3D boundaries, so it should be fine for now)

NB: why did mypy complained about the python 3.10 union typing (with | instead of Union) while other readers use the 3.10 syntax?

@giovp giovp self-requested a review June 19, 2023 13:25
Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks really good, some very minor things on documentation.

Genral question, is there any way we can get a unique id for the dataset (similar to 10x genomics library_id?). this way it would be easier to name elements in a unique way that then is useful when concatenated. E.g. right now shapes are defined like this

shapes = {"polygons": polygons}

it would be better to have something like this

shapes = {f"{dataset_id}_polygons": polygons}

if it's not possible to infer it from the dataset, maybe we can ask the user to input it as an argument? which then will be used to populate the name of all elements (e.g. f"{dataset_id}_images" etc). What do you think?

Regarding mypy, I think you were missing a from future import annotations, can you try with that one added?

src/spatialdata_io/readers/merscope.py Show resolved Hide resolved
src/spatialdata_io/readers/merscope.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merscope.py Outdated Show resolved Hide resolved
src/spatialdata_io/readers/merscope.py Outdated Show resolved Hide resolved
@quentinblampey
Copy link
Contributor Author

Indeed it's great to add a dataset_id!
We can easily infer it given the provided directory path. This path typically looks like 202305021210_XXX_VMSC09302/region_0, where region_0 is the name of the region of interest, and 202305021210_XXX_VMSC09302 is the name of the slide (i.e. one MERSCOPE run).
Most of the time, there is only one region per slide, but it's possible to have multiple (for instance, it's possible to have multiple patients for one MERSCOPE run). In case there are multiple regions, it raises another question: should we load all the regions at once? For now, I ask for the region directory, but I can ask for the slide directory and iterate over all "region_X"

I added arguments for the region and slide name, but by default, I try to infer it using the provided directory path
Then, I simply define dataset_id = f"{slide_name}_{region_name}"

Thanks for helping me solve the mypy issue :)

@giovp
Copy link
Member

giovp commented Jun 19, 2023

great! happy to get it merged, one very last thing, can you add merscope to the API docs, see in docs/api.md you can also see how docs are rendered in this PR if you click to details int he rtd CI

image

it might be that they fail cause a newline is missing for the bulletpoint in the params. I find that bulletpoint usually works if they are indetened and with 1 line space before and after the bulletpoint block

@quentinblampey
Copy link
Contributor Author

Yep, this is added in the docs! I also updated the README so that the "MERSCOPE reader" is not in the "Coming soon" section anymore

I checked the docs, and it looks all good to me 👍

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic work @quentinblampey ! LGTM!

@giovp giovp merged commit 07053ae into scverse:main Jun 20, 2023
@LucaMarconato
Copy link
Member

LucaMarconato commented Jul 14, 2023

Thanks again @quentinblampey for the PR!

Important note:
For now, it seems that the points are not aligned on the images/shapes when using napari-spatialdata. The transformation seems correct and work perfectly fine for shapes, but not for points. I tried switching the axis, inversing the transformation, but didn't fixed the issue. I don't understand why it works for shapes but not for points although they are defined on the same space and have the same x/y coordinates range. (Currently discussing about this with @LucaMarconato)

I just found out the cause, it was a bug in PointsModels, not in the transformations code. This is the code of PointsModel.parse(), the bug is in one of the last lines, I explain it in a comment.

    @parse.register(pd.DataFrame)
    @parse.register(DaskDataFrame)
    @classmethod
    def _(
        cls,
        data: pd.DataFrame,
        coordinates: Mapping[str, str],
        feature_key: str | None = None,
        instance_key: str | None = None,
        transformations: MappingToCoordinateSystem_t | None = None,
        **kwargs: Any,
    ) -> DaskDataFrame:
        if "npartitions" not in kwargs and "chunksize" not in kwargs:
            kwargs["npartitions"] = cls.NPARTITIONS
        ndim = len(coordinates)
        axes = [X, Y, Z][:ndim]
        if isinstance(data, pd.DataFrame):
            table: DaskDataFrame = dd.from_pandas(  # type: ignore[attr-defined]
                pd.DataFrame(data[[coordinates[ax] for ax in axes]].to_numpy(), columns=axes), **kwargs
            )
            if feature_key is not None:
                feature_categ = dd.from_pandas(
                    data[feature_key].astype(str).astype("category"),
                    **kwargs,
                )  # type: ignore[attr-defined]
                table[feature_key] = feature_categ
        elif isinstance(data, dd.DataFrame):  # type: ignore[attr-defined]
            table = data[[coordinates[ax] for ax in axes]]
            table.columns = axes
            if feature_key is not None and data[feature_key].dtype.name != "category":
                table[feature_key] = data[feature_key].astype(str).astype("category")
        if instance_key is not None:
            table[instance_key] = data[instance_key]
		# BUG HERE: before the code was without ", *coordinates.keys()"
		# since coordinates was set to {'x': 'global_x', 'y': 'global_y'}, we had that table['x'] and table['y'] were 
		# correctly set to data['global_x'] and data['global_y'], but then the for loop below, was replacing table['x'] and table['y']
		# with data['x'] and data['y'], causing the misalignment of the data. I will make a PR for correcting this (also adding tests)
        for c in set(data.columns) - {feature_key, instance_key, *coordinates.values(), *coordinates.keys()}:
            table[c] = data[c]
        return cls._add_metadata_and_validate(
            table, feature_key=feature_key, instance_key=instance_key, transformations=transformations
        )

I will make a PR in spatialdata to fix the bug and I will merge the PR in spatialdata-sandbox to download and transform to Zarr the merfish data. Now everything works.

@quentinblampey
Copy link
Contributor Author

Thanks @LucaMarconato for fixing this, I confirm that now everything is aligned as intended!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants