-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bsweger/expand moto fixtures #66
Conversation
Originally, the s3_setup fixture created in conftest.py was designed to unit test cladetime's ability to pull the correct versionId of S3 objects when provided with a specific date. Thus, the content of the objects was irrelevant. Since then, we've added Cladetime features that also require testing the content of the files on S3. This changeset updates the s3_setup pytest fixture to include realistic metadata, sequence, and ncov_metadata files. Rather than using file content to test the version, we can now check a metadata field called "version".
This seems obvious in hindsignt, but for .zst files, the sequence.get_metadata function uses polars to access URLs directly (via scan_csv). Polars uses fsspec to open remote files, so if we pass it a url to a mock, moto-created S3 bucket, it will simply try to access a real S3 bucket (hence the 403 errors) The moto setup works for .xz files, because in that case, the actual file-handling is done by requests, which then feeds the data to polars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I appreciate the detailed commit comments and your checks to confirm the shape of the output are cromulent.
The only other thing I'm wondering: should we test for column classes and that they don't contain missing values (for paranoia's sake)?
These additional checks do some basic asserts to ensure the schema of the metadata columns used by Cladetime and to ensure completeness/uniqueness of the strain column (which acts, essentially, as a primary key)
Thanks for the review! Some of the columns may contain missing values, since here we're checking the metadata download before it goes through cladetime's function to filter out missing/bad data. But I did add a commit that checks the data type of the columns we care about and checks the integrity of the column that acts as the metadata's unique id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Double approved!
Resolves #65
This one can be reviewed commit by commit (each commit has a descriptive message).
We could do more robust S3 mocking by running moto in server mode or figuring out to mock the fsspec interface. At this point in time, I don't think the juice is worth the sqeeze, so this PR contains a hacky workaround that will execute the code path that caused last week's bug.
Additionally, the first commit updates the mock S3 fixture to contain files with sensible data (since now we want to test file contents in addition to S3 version lookups).