Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using convert to merge only two compartments causes either data loss or data mismatching #86

Closed
jenna-tomkinson opened this issue Aug 14, 2023 · 4 comments · Fixed by #183
Assignees
Labels
bug Something isn't working

Comments

@jenna-tomkinson
Copy link
Member

Problem

I have tried many different variations of convert to avoid this issue. I will provide one example in this issue.

To explain, I have a SQLite file acquired directly from CellProfiler but the data within the file is different from the traditional image-based analysis pipeline where there are three object compartments (nuclei, cells, and cytoplasm).

In the case of the project I am working on, I only have one object compartment which is nuclei along with whole image features. I tried using the convert function to merge the nuclei features and image features + metadata using the below code:

# Only select from the two compartments that we have and merge them
preset_join = """
SELECT
    *
FROM
    read_parquet('per_image.parquet') as per_image
INNER JOIN read_parquet('per_nuclei.parquet') AS per_nuclei ON
    per_nuclei.Metadata_ImageNumber = per_image.Metadata_ImageNumber
"""

# merge single cells and output as parquet file
    convert(
        source_path=str(plate_folder),
        dest_path=str(output_path),
        dest_datatype=dest_datatype,
        metadata=["image"],
        compartments=["nuclei"],
        identifying_columns=["ImageNumber"],
        joins=preset_join,
        parsl_config=local_htex,
        chunk_size=10000
    )

The data is a 384 well plate with 4 FOVs per well, meaning there are a total of 1536 image sets. But everytime I have tried running the convert function with different parameter, I will get the correct number of rows of nuclei based on the SQLite database compartment "per_nuclei", but I will have missing Wells and ImageNumbers.

For example, I have a converted parquet plate loaded in a dataframe that shows 256,074 rows which matches the SQLite file:

image

This makes sense, but for a sanity check, I went to see if I had all of the wells and image numbers since when I skimmed the SQLite file I did not see any missing wells when looking at all of the nuclei. I see that I am missing a ton of rows and their corresponding images:

image
image

Using the wells was the easiest to use to go back into the SQLite file and check to see if there are any nuclei from that well. Using well O06, I went into the SQLite file and checked for what image numbers are associated with this well and use the image number to find if there were any nuclei found in that well. This is what I discovered when only looking at one of the 4 image numbers (4 FOVs) for that well:

image

There are a ton of nuclei from that well, which means that somehow through my parameters that the convert function is either not including the data or because we see the correct amount of rows/nuclei that cells are getting labelled wrong.

Solution

It would help to go through this in person to see what is going on since this is dataset specific. It could be because of the parameters for the query, but it is just not clear to me why this is happening.

@d33bs d33bs moved this to In Progress in SET Projects Aug 15, 2023
@d33bs
Copy link
Member

d33bs commented Aug 15, 2023

Thank you for raising this @jenna-tomkinson ! I believe this may relate to the JOIN type being used (INNER) and the dataset specifics (where Image and Nuclei tables do not contain all the same join key values). See responses under #85 (comment) . I hope to uncover more information from discussing this with you further.

@jenna-tomkinson
Copy link
Member Author

I wanted to update this issue as I have found this bug to happen when using CytoTable with the standard compartments. Please see my PR in the NF1 repo where I am making a fix for this bug: WayScience/nf1_schwann_cell_painting_data#35.

It seems like there is still an issue where extra rows with all NAs for each column are showing up in the final parquet file.

@d33bs d33bs added the bug Something isn't working label Mar 13, 2024
@d33bs d33bs self-assigned this Mar 14, 2024
@d33bs
Copy link
Member

d33bs commented Mar 14, 2024

Hi @jenna-tomkinson thanks for raising this issue and for adding to it recently! I took a moment to try and validate what you mentioned earlier in the issue description using the most recent version of the CytoTable code you referenced in WayScience/nf1_schwann_cell_painting_data#35 .

Mostly, I focused on your statements as the basis of troubleshooting what you mentioned:

But everytime I have tried running the convert function with different parameter, I will get the correct number of rows of nuclei based on the SQLite database compartment "per_nuclei", but I will have missing Wells and ImageNumbers.

I found I was not able to get the same number of rows in the CytoTable result as the SQLite Per_Nuclei table. I also found that there were no missing ImageNumbers, Image Well Numbers, or Nuclei ObjectNumbers. This seems to go against what you mentioned earlier in this issue. Within the below example, I focused only on unmodified CytoTable output to help isolate where the issue may be occurring. When you have the chance, could you confirm you see the same or help guide me to what might be missing?

Please see this Google Colab notebook or related Gist I created as a backup.

It seems like there is still an issue where extra rows with all NAs for each column are showing up in the final parquet file.

CytoTable provides a parameter, convert(..., drop_null=False) which will by default leave intact any null columns which are provided from data sources or created as a result of join operations. As an aside, join operations can sometimes lead to NA/Null column values and it might be necessary to adjust to avoid this. As a double check, is this a new element as part of this issue, or should we create a different issue to cover this aspect?

@d33bs
Copy link
Member

d33bs commented Apr 1, 2024

After discussing this further with @jenna-tomkinson (thanks Jenna!) it seems there is likely an issue with the preset SQL for joins which allows image table data with no matching compartment data be added to CytoTable output. This would only manifest when the image table has metadata which does not show up in the compartment tables. We believe this state of the source data (SQLite or CSV files) would only occur when CellProfiler processed an image but did not find compartment-level data.

This is an example of current SQL used for joins in CytoTable (modified based on this issue):

WITH Per_Image_Filtered AS (
    SELECT
        Metadata_ImageNumber,
        Image_Metadata_Well,
        Image_Metadata_Plate
    FROM
        read_parquet('per_image.parquet')
    )
SELECT
    *
FROM
    Per_Image_Filtered AS per_image
LEFT JOIN read_parquet('per_cytoplasm.parquet') AS per_cytoplasm ON
    per_cytoplasm.Metadata_ImageNumber = per_image.Metadata_ImageNumber
LEFT JOIN read_parquet('per_cells.parquet') AS per_cells ON
    per_cells.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
    AND per_cells.Metadata_Cells_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('per_nuclei.parquet') AS per_nuclei ON
    per_nuclei.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
    AND per_nuclei.Metadata_Nuclei_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Nuclei

The following would likely avoid the rows with NA's (by using the cytoplasm compartment table as the basis for all joined data).

WITH Per_Image_Filtered AS (
    SELECT
        Metadata_ImageNumber,
        Image_Metadata_Well,
        Image_Metadata_Plate
    FROM
        read_parquet('per_image.parquet')
    )
SELECT
    *
FROM
    read_parquet('per_cytoplasm.parquet') AS per_cytoplasm
LEFT JOIN Per_Image_Filtered AS per_image ON
     per_image.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
LEFT JOIN read_parquet('per_cells.parquet') AS per_cells ON
    per_cells.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
    AND per_cells.Metadata_Cells_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('per_nuclei.parquet') AS per_nuclei ON
    per_nuclei.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
    AND per_nuclei.Metadata_Nuclei_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Nuclei

To fix this I think we need to:

  • Add test data to match this scenario
  • Add a test which checks for this type of output
  • Change the presets so that the cytoplasm compartment table is used as the basis for all joins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants