Using `convert` to merge only two compartments causes either data loss or data mismatching #86

jenna-tomkinson · 2023-08-14T21:37:09Z

Problem

I have tried many different variations of convert to avoid this issue. I will provide one example in this issue.

To explain, I have a SQLite file acquired directly from CellProfiler but the data within the file is different from the traditional image-based analysis pipeline where there are three object compartments (nuclei, cells, and cytoplasm).

In the case of the project I am working on, I only have one object compartment which is nuclei along with whole image features. I tried using the convert function to merge the nuclei features and image features + metadata using the below code:

# Only select from the two compartments that we have and merge them
preset_join = """
SELECT
    *
FROM
    read_parquet('per_image.parquet') as per_image
INNER JOIN read_parquet('per_nuclei.parquet') AS per_nuclei ON
    per_nuclei.Metadata_ImageNumber = per_image.Metadata_ImageNumber
"""

# merge single cells and output as parquet file
    convert(
        source_path=str(plate_folder),
        dest_path=str(output_path),
        dest_datatype=dest_datatype,
        metadata=["image"],
        compartments=["nuclei"],
        identifying_columns=["ImageNumber"],
        joins=preset_join,
        parsl_config=local_htex,
        chunk_size=10000
    )

The data is a 384 well plate with 4 FOVs per well, meaning there are a total of 1536 image sets. But everytime I have tried running the convert function with different parameter, I will get the correct number of rows of nuclei based on the SQLite database compartment "per_nuclei", but I will have missing Wells and ImageNumbers.

For example, I have a converted parquet plate loaded in a dataframe that shows 256,074 rows which matches the SQLite file:

This makes sense, but for a sanity check, I went to see if I had all of the wells and image numbers since when I skimmed the SQLite file I did not see any missing wells when looking at all of the nuclei. I see that I am missing a ton of rows and their corresponding images:

Using the wells was the easiest to use to go back into the SQLite file and check to see if there are any nuclei from that well. Using well O06, I went into the SQLite file and checked for what image numbers are associated with this well and use the image number to find if there were any nuclei found in that well. This is what I discovered when only looking at one of the 4 image numbers (4 FOVs) for that well:

There are a ton of nuclei from that well, which means that somehow through my parameters that the convert function is either not including the data or because we see the correct amount of rows/nuclei that cells are getting labelled wrong.

Solution

It would help to go through this in person to see what is going on since this is dataset specific. It could be because of the parameters for the query, but it is just not clear to me why this is happening.

The text was updated successfully, but these errors were encountered:

d33bs · 2023-08-15T15:38:09Z

Thank you for raising this @jenna-tomkinson ! I believe this may relate to the JOIN type being used (INNER) and the dataset specifics (where Image and Nuclei tables do not contain all the same join key values). See responses under #85 (comment) . I hope to uncover more information from discussing this with you further.

jenna-tomkinson · 2024-02-20T16:53:51Z

I wanted to update this issue as I have found this bug to happen when using CytoTable with the standard compartments. Please see my PR in the NF1 repo where I am making a fix for this bug: WayScience/nf1_schwann_cell_painting_data#35.

It seems like there is still an issue where extra rows with all NAs for each column are showing up in the final parquet file.

d33bs · 2024-03-14T18:07:11Z

Hi @jenna-tomkinson thanks for raising this issue and for adding to it recently! I took a moment to try and validate what you mentioned earlier in the issue description using the most recent version of the CytoTable code you referenced in WayScience/nf1_schwann_cell_painting_data#35 .

Mostly, I focused on your statements as the basis of troubleshooting what you mentioned:

But everytime I have tried running the convert function with different parameter, I will get the correct number of rows of nuclei based on the SQLite database compartment "per_nuclei", but I will have missing Wells and ImageNumbers.

I found I was not able to get the same number of rows in the CytoTable result as the SQLite Per_Nuclei table. I also found that there were no missing ImageNumbers, Image Well Numbers, or Nuclei ObjectNumbers. This seems to go against what you mentioned earlier in this issue. Within the below example, I focused only on unmodified CytoTable output to help isolate where the issue may be occurring. When you have the chance, could you confirm you see the same or help guide me to what might be missing?

Please see this Google Colab notebook or related Gist I created as a backup.

It seems like there is still an issue where extra rows with all NAs for each column are showing up in the final parquet file.

CytoTable provides a parameter, convert(..., drop_null=False) which will by default leave intact any null columns which are provided from data sources or created as a result of join operations. As an aside, join operations can sometimes lead to NA/Null column values and it might be necessary to adjust to avoid this. As a double check, is this a new element as part of this issue, or should we create a different issue to cover this aspect?

d33bs · 2024-04-01T21:11:15Z

After discussing this further with @jenna-tomkinson (thanks Jenna!) it seems there is likely an issue with the preset SQL for joins which allows image table data with no matching compartment data be added to CytoTable output. This would only manifest when the image table has metadata which does not show up in the compartment tables. We believe this state of the source data (SQLite or CSV files) would only occur when CellProfiler processed an image but did not find compartment-level data.

This is an example of current SQL used for joins in CytoTable (modified based on this issue):

WITH Per_Image_Filtered AS (
    SELECT
        Metadata_ImageNumber,
        Image_Metadata_Well,
        Image_Metadata_Plate
    FROM
        read_parquet('per_image.parquet')
    )
SELECT
    *
FROM
    Per_Image_Filtered AS per_image
LEFT JOIN read_parquet('per_cytoplasm.parquet') AS per_cytoplasm ON
    per_cytoplasm.Metadata_ImageNumber = per_image.Metadata_ImageNumber
LEFT JOIN read_parquet('per_cells.parquet') AS per_cells ON
    per_cells.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
    AND per_cells.Metadata_Cells_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('per_nuclei.parquet') AS per_nuclei ON
    per_nuclei.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
    AND per_nuclei.Metadata_Nuclei_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Nuclei

The following would likely avoid the rows with NA's (by using the cytoplasm compartment table as the basis for all joined data).

WITH Per_Image_Filtered AS (
    SELECT
        Metadata_ImageNumber,
        Image_Metadata_Well,
        Image_Metadata_Plate
    FROM
        read_parquet('per_image.parquet')
    )
SELECT
    *
FROM
    read_parquet('per_cytoplasm.parquet') AS per_cytoplasm
LEFT JOIN Per_Image_Filtered AS per_image ON
     per_image.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
LEFT JOIN read_parquet('per_cells.parquet') AS per_cells ON
    per_cells.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
    AND per_cells.Metadata_Cells_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('per_nuclei.parquet') AS per_nuclei ON
    per_nuclei.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
    AND per_nuclei.Metadata_Nuclei_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Nuclei

To fix this I think we need to:

Add test data to match this scenario
Add a test which checks for this type of output
Change the presets so that the cytoplasm compartment table is used as the basis for all joins

d33bs added this to SET Projects Aug 15, 2023

d33bs moved this to In Progress in SET Projects Aug 15, 2023

This was referenced Aug 16, 2023

Move to PyArrow parquet writer for data extraction #87

Merged

Chunk_size parameter when using with preset + joins variables causes ordering of Metadata_ImageNumber column #85

Open

d33bs moved this from In Progress to Paused in SET Projects Aug 22, 2023

d33bs added the bug Something isn't working label Mar 13, 2024

d33bs self-assigned this Mar 14, 2024

d33bs mentioned this issue Apr 4, 2024

Use compartment tables as basis of preset joins #183

Merged

13 tasks

d33bs closed this as completed in #183 Apr 18, 2024

github-project-automation bot moved this from Paused to Done in SET Projects Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `convert` to merge only two compartments causes either data loss or data mismatching #86

Using `convert` to merge only two compartments causes either data loss or data mismatching #86

jenna-tomkinson commented Aug 14, 2023

d33bs commented Aug 15, 2023

jenna-tomkinson commented Feb 20, 2024

d33bs commented Mar 14, 2024 •

edited

Loading

d33bs commented Apr 1, 2024 •

edited

Loading

Using convert to merge only two compartments causes either data loss or data mismatching #86

Using convert to merge only two compartments causes either data loss or data mismatching #86

Comments

jenna-tomkinson commented Aug 14, 2023

Problem

Solution

d33bs commented Aug 15, 2023

jenna-tomkinson commented Feb 20, 2024

d33bs commented Mar 14, 2024 • edited Loading

d33bs commented Apr 1, 2024 • edited Loading

Using `convert` to merge only two compartments causes either data loss or data mismatching #86

Using `convert` to merge only two compartments causes either data loss or data mismatching #86

d33bs commented Mar 14, 2024 •

edited

Loading

d33bs commented Apr 1, 2024 •

edited

Loading