Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bsweger/expand moto fixtures #66

Merged
merged 3 commits into from
Dec 6, 2024
Merged

Bsweger/expand moto fixtures #66

merged 3 commits into from
Dec 6, 2024

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Dec 3, 2024

Resolves #65

This one can be reviewed commit by commit (each commit has a descriptive message).

We could do more robust S3 mocking by running moto in server mode or figuring out to mock the fsspec interface. At this point in time, I don't think the juice is worth the sqeeze, so this PR contains a hacky workaround that will execute the code path that caused last week's bug.

Additionally, the first commit updates the mock S3 fixture to contain files with sensible data (since now we want to test file contents in addition to S3 version lookups).

Originally, the s3_setup fixture created in conftest.py
was designed to unit test cladetime's ability to pull
the correct versionId of S3 objects when provided with
a specific date. Thus, the content of the objects was
irrelevant.

Since then, we've added Cladetime features that also require
testing the content of the files on S3. This changeset
updates the s3_setup pytest fixture to include realistic
metadata, sequence, and ncov_metadata files. Rather than
using file content to test the version, we can now
check a metadata field called "version".
This seems obvious in hindsignt, but for .zst files, the
sequence.get_metadata function uses polars to access URLs
directly (via scan_csv). Polars uses fsspec to open remote
files, so if we pass it a url to a mock, moto-created S3
bucket, it will simply try to access a real S3 bucket
(hence the 403 errors)

The moto setup works for .xz files, because in that case,
the actual file-handling is done by requests, which then
feeds the data to polars.
@bsweger bsweger requested a review from zkamvar December 3, 2024 22:13
@bsweger bsweger marked this pull request as draft December 3, 2024 22:13
@bsweger bsweger changed the title Bsweger/expand moto fixtures WIP Bsweger/expand moto fixtures Dec 3, 2024
@bsweger bsweger marked this pull request as ready for review December 4, 2024 13:58
@bsweger bsweger changed the title WIP Bsweger/expand moto fixtures Bsweger/expand moto fixtures Dec 4, 2024
zkamvar
zkamvar previously approved these changes Dec 4, 2024
Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I appreciate the detailed commit comments and your checks to confirm the shape of the output are cromulent.

The only other thing I'm wondering: should we test for column classes and that they don't contain missing values (for paranoia's sake)?

These additional checks do some basic asserts to ensure
the schema of the metadata columns used by Cladetime
and to ensure completeness/uniqueness of the strain
column (which acts, essentially, as a primary key)
@bsweger
Copy link
Collaborator Author

bsweger commented Dec 6, 2024

The only other thing I'm wondering: should we test for column classes and that they don't contain missing values (for paranoia's sake)?

Thanks for the review! Some of the columns may contain missing values, since here we're checking the metadata download before it goes through cladetime's function to filter out missing/bad data.

But I did add a commit that checks the data type of the columns we care about and checks the integrity of the column that acts as the metadata's unique id.

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Double approved!

@bsweger bsweger merged commit cba8c60 into main Dec 6, 2024
2 checks passed
@bsweger bsweger deleted the bsweger/expand-moto-fixtures branch December 6, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Verify LazyFrame contents for get_metadata
2 participants