Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to PyArrow parquet writer for data extraction #87

Merged
merged 2 commits into from
Aug 18, 2023

Conversation

d33bs
Copy link
Member

@d33bs d33bs commented Aug 16, 2023

Description

This PR seeks to address concerns which were raised by @jenna-tomkinson via #85 and #86. There appears to be a challenge in extracting certain data from SQLite through DuckDB and writing it directly to a file with DuckDB. This change updates CytoTable to use DuckDB for extraction (remains the same) and PyArrow's Parquet table writer for writing to file (updated). In testing this functionality with currently non-public datasets I was able to observe the correct number of unique ImageNumber keys when using only an Image and Nuclei table with SQL inner joins (as described in the issues).

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@d33bs d33bs requested review from falquaddoomi and gwaybio August 16, 2023 04:08
Copy link
Collaborator

@falquaddoomi falquaddoomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unfortunately not familiar enough with either DuckDB or SQLite to evaluate the underlying difference, but the changes here look modest and reasonable.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor questions, which may not require any changes and should not delay merging.

  1. Do you need to update the documentation above the try statement (lines 311-313) to reflect the parquet writer change?
  2. I imagine that the except clause would catch when the _duckdb_reader() fails, so probably no need to change this clause alongside the try logic change?

@d33bs
Copy link
Member Author

d33bs commented Aug 18, 2023

Thanks @gwaybio and @falquaddoomi for the reviews! Addressing questions below:

  1. Do you need to update the documentation above the try statement (lines 311-313) to reflect the parquet writer change?

Thank you, I've updated this to be a bit more explicit, also moving around some of the mentions to appropriate sections of the code.

  1. I imagine that the except clause would catch when the _duckdb_reader() fails, so probably no need to change this clause alongside the try logic change?

Thank you for checking! Any exceptions would still emerge here when they're encountered. The "mixed-type" DuckDB exception traces to the nested select within the COPY functionality and this will persist in the new format as well (the SELECT query doesn't change). This functionality is tested within test_cell_health_cellprofiler_to_cytominer_database_legacy, which utilizes a dataset that includes a SQLite string value within a float column.

@d33bs d33bs merged commit 162533a into cytomining:main Aug 18, 2023
@d33bs d33bs deleted the mismatched-compartment-joins branch August 18, 2023 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants