-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sgkit compatible ancestors #778
Conversation
I thought it worth doing some perf now, so the effect of the lmdb removal can be seen separately, this is initially just for
So this branch is slightly slower at File sizes (number is chunk size in variant dimension):
This branch gives smaller ancestor files, decreasing the chunk size (variant dimension) has a small increase in file size, and almost no change in runtime (~1s) |
That's looking good. The |
@@ -1932,29 +1932,34 @@ def verify_data_round_trip(self, sample_data, ancestor_data, ancestors): | |||
stored_start = ancestor_data.ancestors_start[:] | |||
stored_end = ancestor_data.ancestors_end[:] | |||
stored_time = ancestor_data.ancestors_time[:] | |||
stored_ancestors = ancestor_data.ancestors_haplotype[:] | |||
# Remove the ploidy dimension | |||
stored_ancestors = ancestor_data.ancestors_full_haplotype[:, :, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we assert here that the ploidy is 1, (i.e. ancestor_data.ancestors_full_haplotype.shape[2] == 1), just to make sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good start. Re perf, just for kicks I wonder if it's worth perf testing ancestor generation with e.g. a 100K sites and 1M samples? Or instead doing it with some missing data, which will generate lots of ancestors.
d68141c
to
c969a56
Compare
Codecov Report
@@ Coverage Diff @@
## main #778 +/- ##
==========================================
+ Coverage 93.31% 93.35% +0.03%
==========================================
Files 17 17
Lines 5597 5657 +60
Branches 999 1012 +13
==========================================
+ Hits 5223 5281 +58
- Misses 246 247 +1
- Partials 128 129 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
c969a56
to
aada29a
Compare
@jeromekelleher Now rebased onto #779 so is a bit simpler. |
839a6f1
to
3662ae8
Compare
Will I take a look once #779 is merged? |
3662ae8
to
120137e
Compare
1e767f8
to
4f8a840
Compare
The ancestors load into sgkit, but you can't do anything with them yet as xarray is turning the genotypes into a floating-point array, investigating. |
xarray seems keen on that sort of thing - time to start filing some upstream bugs? |
It was the same one I hit before, so already filed, but I can work around it. |
Can you link to the upstream issue please? |
pydata/xarray#7292 as zarr's fill_value is |
One more wrinkle, I thought that |
4f8a840
to
efbc68c
Compare
So far |
23c4764
to
fd7e066
Compare
What function were you using that requires it?
True, but on the other hand the mask compresses very well so isn't much storage overhead... I note that https://github.com/pystatgen/vcf-zarr-spec/blob/main/vcf_zarr_spec.md says masks are optional ("An array called |
That seems like a good test. sgkit doesn't really mandate a particular set of variables - it depends on what you want to do. It has certain conventions (which you can see by looking at the variables that are populated by the various functions to load from plink/bgen/vcf), but each function implementing a particular genetic method will typically depend on different variables. |
|
fd7e066
to
3fb2a58
Compare
Right. You could perhaps add it as needed with data_vars["call_genotype_mask"] = ([DIM_VARIANT, DIM_SAMPLE, DIM_PLOIDY], call_genotype < 0) |
3fb2a58
to
e0600b5
Compare
Yikes, segfault on circle CI. Adding more test output to see which one it was. |
9303e13
to
0201165
Compare
Now that we have something that works in sgkit (to a first approximation) I've re-run the perf numbers. To recap this was the state when we were just writing
Now with writing
Sadly this is quite a bit slower, but is expected seeing that we're essentially writing out the genotypes twice. @jeromekelleher Is this slow down ok, or should we look at making the mask optional in sgkit? |
Now that we have something that works in sgkit (to a first approximation) I've re-run the perf numbers. To recap this was the state when we were just writing
Now with writing
Sadly this is quite a bit slower, but is expected seeing that we're essentially writing out the genotypes twice. @jeromekelleher Is this slow down ok, or should we look at making the mask optional in sgkit? |
Overall slowdown is about 1/3 I think? If so let's not worry about it, this bit isn't a bottleneck anyway. |
0615f02
to
6056ee7
Compare
@jeromekelleher I've factored out the common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one minor thing
tsinfer/formats.py
Outdated
prev_chunk_id = chunk_id | ||
yield chunk[:, j % chunk_size] | ||
else: | ||
raise ValueError("Only first two dimensions supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a test (or just assert?)
b1f980f
to
44418bf
Compare
44418bf
to
a8a108a
Compare
Work in progress.
Tests finally pass with a genotype array that is sgkit-compatible.
Next I need to remove the lmdb store (without breaking SampleData which shares the storage base class) and then do some perf tests.