Move to PyArrow parquet writer for data extraction #87

d33bs · 2023-08-16T04:08:57Z

Description

This PR seeks to address concerns which were raised by @jenna-tomkinson via #85 and #86. There appears to be a challenge in extracting certain data from SQLite through DuckDB and writing it directly to a file with DuckDB. This change updates CytoTable to use DuckDB for extraction (remains the same) and PyArrow's Parquet table writer for writing to file (updated). In testing this functionality with currently non-public datasets I was able to observe the correct number of unique ImageNumber keys when using only an Image and Nuclei table with SQL inner joins (as described in the issues).

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

falquaddoomi

I'm unfortunately not familiar enough with either DuckDB or SQLite to evaluate the underlying difference, but the changes here look modest and reasonable.

gwaybio

Two minor questions, which may not require any changes and should not delay merging.

Do you need to update the documentation above the try statement (lines 311-313) to reflect the parquet writer change?
I imagine that the except clause would catch when the _duckdb_reader() fails, so probably no need to change this clause alongside the try logic change?

d33bs · 2023-08-18T16:13:53Z

Thanks @gwaybio and @falquaddoomi for the reviews! Addressing questions below:

Do you need to update the documentation above the try statement (lines 311-313) to reflect the parquet writer change?

Thank you, I've updated this to be a bit more explicit, also moving around some of the mentions to appropriate sections of the code.

I imagine that the except clause would catch when the _duckdb_reader() fails, so probably no need to change this clause alongside the try logic change?

Thank you for checking! Any exceptions would still emerge here when they're encountered. The "mixed-type" DuckDB exception traces to the nested select within the COPY functionality and this will persist in the new format as well (the SELECT query doesn't change). This functionality is tested within test_cell_health_cellprofiler_to_cytominer_database_legacy, which utilizes a dataset that includes a SQLite string value within a float column.

move to pyarrrow parquet write

efdb2ef

d33bs requested review from falquaddoomi and gwaybio August 16, 2023 04:08

falquaddoomi approved these changes Aug 17, 2023

View reviewed changes

gwaybio approved these changes Aug 17, 2023

View reviewed changes

updating docs surrounding data extract and excepts

f64ccd9

d33bs merged commit 162533a into cytomining:main Aug 18, 2023

d33bs deleted the mismatched-compartment-joins branch August 18, 2023 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to PyArrow parquet writer for data extraction #87

Move to PyArrow parquet writer for data extraction #87

d33bs commented Aug 16, 2023

falquaddoomi left a comment

gwaybio left a comment

d33bs commented Aug 18, 2023

Move to PyArrow parquet writer for data extraction #87

Move to PyArrow parquet writer for data extraction #87

Conversation

d33bs commented Aug 16, 2023

Description

What is the nature of your change?

Checklist

falquaddoomi left a comment

Choose a reason for hiding this comment

gwaybio left a comment

Choose a reason for hiding this comment

d33bs commented Aug 18, 2023