-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using convert
to merge only two compartments causes either data loss or data mismatching
#86
Comments
Thank you for raising this @jenna-tomkinson ! I believe this may relate to the JOIN type being used (INNER) and the dataset specifics (where Image and Nuclei tables do not contain all the same join key values). See responses under #85 (comment) . I hope to uncover more information from discussing this with you further. |
I wanted to update this issue as I have found this bug to happen when using CytoTable with the standard compartments. Please see my PR in the NF1 repo where I am making a fix for this bug: WayScience/nf1_schwann_cell_painting_data#35. It seems like there is still an issue where extra rows with all NAs for each column are showing up in the final parquet file. |
Hi @jenna-tomkinson thanks for raising this issue and for adding to it recently! I took a moment to try and validate what you mentioned earlier in the issue description using the most recent version of the CytoTable code you referenced in WayScience/nf1_schwann_cell_painting_data#35 . Mostly, I focused on your statements as the basis of troubleshooting what you mentioned:
I found I was not able to get the same number of rows in the CytoTable result as the SQLite Per_Nuclei table. I also found that there were no missing ImageNumbers, Image Well Numbers, or Nuclei ObjectNumbers. This seems to go against what you mentioned earlier in this issue. Within the below example, I focused only on unmodified CytoTable output to help isolate where the issue may be occurring. When you have the chance, could you confirm you see the same or help guide me to what might be missing? Please see this Google Colab notebook or related Gist I created as a backup.
CytoTable provides a parameter, |
After discussing this further with @jenna-tomkinson (thanks Jenna!) it seems there is likely an issue with the preset SQL for joins which allows image table data with no matching compartment data be added to CytoTable output. This would only manifest when the image table has metadata which does not show up in the compartment tables. We believe this state of the source data (SQLite or CSV files) would only occur when CellProfiler processed an image but did not find compartment-level data. This is an example of current SQL used for joins in CytoTable (modified based on this issue): WITH Per_Image_Filtered AS (
SELECT
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('per_image.parquet')
)
SELECT
*
FROM
Per_Image_Filtered AS per_image
LEFT JOIN read_parquet('per_cytoplasm.parquet') AS per_cytoplasm ON
per_cytoplasm.Metadata_ImageNumber = per_image.Metadata_ImageNumber
LEFT JOIN read_parquet('per_cells.parquet') AS per_cells ON
per_cells.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
AND per_cells.Metadata_Cells_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('per_nuclei.parquet') AS per_nuclei ON
per_nuclei.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
AND per_nuclei.Metadata_Nuclei_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Nuclei The following would likely avoid the rows with NA's (by using the cytoplasm compartment table as the basis for all joined data). WITH Per_Image_Filtered AS (
SELECT
Metadata_ImageNumber,
Image_Metadata_Well,
Image_Metadata_Plate
FROM
read_parquet('per_image.parquet')
)
SELECT
*
FROM
read_parquet('per_cytoplasm.parquet') AS per_cytoplasm
LEFT JOIN Per_Image_Filtered AS per_image ON
per_image.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
LEFT JOIN read_parquet('per_cells.parquet') AS per_cells ON
per_cells.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
AND per_cells.Metadata_Cells_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Cells
LEFT JOIN read_parquet('per_nuclei.parquet') AS per_nuclei ON
per_nuclei.Metadata_ImageNumber = per_cytoplasm.Metadata_ImageNumber
AND per_nuclei.Metadata_Nuclei_Number_Object_Number = per_cytoplasm.Metadata_Cytoplasm_Parent_Nuclei To fix this I think we need to:
|
Problem
I have tried many different variations of convert to avoid this issue. I will provide one example in this issue.
To explain, I have a SQLite file acquired directly from CellProfiler but the data within the file is different from the traditional image-based analysis pipeline where there are three object compartments (nuclei, cells, and cytoplasm).
In the case of the project I am working on, I only have one object compartment which is nuclei along with whole image features. I tried using the
convert
function to merge the nuclei features and image features + metadata using the below code:The data is a 384 well plate with 4 FOVs per well, meaning there are a total of 1536 image sets. But everytime I have tried running the
convert
function with different parameter, I will get the correct number of rows of nuclei based on the SQLite database compartment "per_nuclei", but I will have missing Wells and ImageNumbers.For example, I have a converted parquet plate loaded in a dataframe that shows 256,074 rows which matches the SQLite file:
This makes sense, but for a sanity check, I went to see if I had all of the wells and image numbers since when I skimmed the SQLite file I did not see any missing wells when looking at all of the nuclei. I see that I am missing a ton of rows and their corresponding images:
Using the wells was the easiest to use to go back into the SQLite file and check to see if there are any nuclei from that well. Using well
O06
, I went into the SQLite file and checked for what image numbers are associated with this well and use the image number to find if there were any nuclei found in that well. This is what I discovered when only looking at one of the 4 image numbers (4 FOVs) for that well:There are a ton of nuclei from that well, which means that somehow through my parameters that the convert function is either not including the data or because we see the correct amount of rows/nuclei that cells are getting labelled wrong.
Solution
It would help to go through this in person to see what is going on since this is dataset specific. It could be because of the parameters for the query, but it is just not clear to me why this is happening.
The text was updated successfully, but these errors were encountered: