-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk_size parameter when using with preset + joins variables causes ordering of Metadata_ImageNumber
column
#85
Comments
UpdateIt looks like when using the Once I sorted in order by the "Metadata_ImageNumber" column, I could see that the values for this did not start from 695 and leave out all numbers before: But, I am find that I am missing ImageNumbers and wells. This can be a different issue, but I wanted to make sure this is known. I would wish to see that the image numbers are in order and not starting at a random number. |
Metadata_ImageNumber
Metadata_ImageNumber
column
Thank you for opening this issue @jenna-tomkinson ! From what I read, you highlighted two challenges in this issue. I'll try to organize these below and respond to them afterwards. Please don't hesitate to let me know if these align with what you're seeing. I'd be interested in getting your input on these.
I don't believe CytoTable currently sorts values of the output. The reason for the mixed sorting is due to indeterministic unsorted chunked concatenation of smaller tables. The values aren't currently sorted as this would require additional complexity and I wasn't certain about how opinionated CytoTable should be when it came to this. That said, should the values be sorted in a particular way here? If so, which column(s) should be used and which sorting direction would be preferred (ascending, descending, etc)? A pattern used in testing for sorting values and determining table equivalency is to sort by all columns ascendingly (see below). table = parquet.read_table("file.parquet")
sorted_table = table.sort_by(
[(name, "ascending") for name in test_table.column_names]
)
When performing a SQL INNER JOIN we're telling the engine to only return values for which join key values exist in both tables. While I can't be certain without seeing the dataset, I think there may be different Could you reference the dataset you're using so we may further troubleshoot this here? Just in case, I wrote the following example using Pandas DataFrames and DuckDB SQL queries over these DataFrames to simulate / demonstrate the difference between SQL JOIN types. While these are DataFrames instead of SQLite tables being queried, the functionality acts the same. import duckdb
import pandas as pd
# create an image dataframe from a dictionary
df_image = pd.DataFrame(
{
"ImageNumber": [1, 3, 4, 5],
"Image_Feature_x": [0.1, 0.3, 0.4, 0.5],
}
)
# create an nuclei dataframe from a dictionary
df_nuclei = pd.DataFrame(
{
"ImageNumber": [1, 2, 3, 4],
"Nuclei_Feature_y": [0.5, 0.4, 0.3, 0.2],
}
)
# show the initial dataframes
print(df_image, end="\n\n")
print(df_nuclei, end="\n\n")
# show the mismatching ImageNumber series
print(
"df_image.ImageNumber equals df_nuclei.ImageNumber: ",
df_image["ImageNumber"].equals(df_nuclei["ImageNumber"]),
end="\n\n",
)
# demonstrate the result of an inner join
print("Example INNER JOIN:")
sql_stmt = """
SELECT
image.ImageNumber,
image.Image_Feature_x,
nuclei.Nuclei_Feature_y
FROM df_image image
INNER JOIN df_nuclei AS nuclei ON
nuclei.ImageNumber = image.ImageNumber
"""
print(
duckdb.connect().execute(sql_stmt).df(),
end="\n\n",
)
# demonstrate the result of a left join
print("Example LEFT JOIN:")
sql_stmt = """
SELECT
image.ImageNumber,
image.Image_Feature_x,
nuclei.Nuclei_Feature_y
FROM df_image image
LEFT JOIN df_nuclei AS nuclei ON
nuclei.ImageNumber = image.ImageNumber
"""
print(
duckdb.connect().execute(sql_stmt).df(),
end="\n\n",
)
# demonstrate the result of a right join
print("Example RIGHT JOIN:")
sql_stmt = """
SELECT
image.ImageNumber,
image.Image_Feature_x,
nuclei.Nuclei_Feature_y
FROM df_image image
RIGHT JOIN df_nuclei AS nuclei ON
nuclei.ImageNumber = image.ImageNumber
"""
print(
duckdb.connect().execute(sql_stmt).df(),
end="\n\n",
) |
Thank you for the response! Regarding the sorting of the values from ImageNumber column, that makes a lot of sense why the values aren't sorted in ascending order! I didn't know that and it isn't a huge deal, I was expecting to see ascending order but I can easily reorder the rows in ascending order after. For the missing values issue, I have gone through the SQLite file a bunch of times and I am seeing that there are ImageNumbers associated to wells that are not included in the final conversion. It is even more weird because the number of rows matches that of the SQLite file so how is there missing values? It is very confusing. The dataset comes from a private repository, and I had to rerun the analysis so the data wouldn't be available if I gave access. I think having an in-person/remote discussion would help. Thank you for the response! |
Moving forward, I propose dedicating focus on mismatching or missing data in #86. This issue has some relevant discussion beyond that of #86 which could turn into new issues for:
|
Problem
When using the
chunk_size
parameter above 1000 with my code, it is causing the "ImageNumber" column in the per_nuclei.parquet file to start from 695 and not 1. This causes the file after concat to only have 695 to the last image number, but it has the correct number of rows for nuclei compared to the SQLite file. I had to change from LEFT JOIN to INNER so that I could get the correct number of rows based on the SQLite file. Using LEFT or RIGHT caused chaos because it was generating extra rows where the nuclei features were all NaN.Here is the code I was using:
Solution
If I don't use the chunk_size parameter, it looks like it is fixed. But, it goes from running for 1 minute to 8 minutes without the chunk_size increased. I am not sure what is going on at all.
Please feel free to change the title to make it more concise and I am happy to explain and demonstrate with my dataset what is happening.
The text was updated successfully, but these errors were encountered: