Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sgkit compatible ancestors #778

Merged
merged 1 commit into from
Jan 20, 2023
Merged

Conversation

benjeffery
Copy link
Member

Work in progress.

Tests finally pass with a genotype array that is sgkit-compatible.
Next I need to remove the lmdb store (without breaking SampleData which shares the storage base class) and then do some perf tests.

@benjeffery
Copy link
Member Author

I thought it worth doing some perf now, so the effect of the lmdb removal can be seen separately, this is initially just for generate-ancestors, on sample data from this ts:

║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │      61779║
╟───────────────┼───────────╢
║Sequence Length│ 50000000.0║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │      30000║
╟───────────────┼───────────╢
║Total Size     │   17.8 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤══════╤═════════╤════════════╗
║Table      │Rows  │Size     │Has Metadata║
╠═══════════╪══════╪═════════╪════════════╣
║Edges      │294197│  9.0 MiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Individuals│     0│ 24 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Migrations │     0│  8 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Mutations  │ 65362│  2.3 MiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Nodes      │102737│  2.7 MiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Populations│     1│ 16 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Provenances│     1│955 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Sites      │ 65362│  1.6 MiB│          No║
╚═══════════╧══════╧═════════╧════════════╝

main:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:23, 323it/s]

branch:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.34kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:58, 260it/s]

So this branch is slightly slower at ga-gen

File sizes (number is chunk size in variant dimension):

95M 100k.main.ancestors
62M 100k.branch.2048.ancestors
59M 100k.branch.16384.ancestors
59M 100k.branch.65536.ancestors

This branch gives smaller ancestor files, decreasing the chunk size (variant dimension) has a small increase in file size, and almost no change in runtime (~1s)

@hyanwong
Copy link
Member

That's looking good. The ga step was never much of a time-blocker anyway, so although 25% increase in time isn't ideal, it's not something worth spending huge amounts of time trying to change, either.

@@ -1932,29 +1932,34 @@ def verify_data_round_trip(self, sample_data, ancestor_data, ancestors):
stored_start = ancestor_data.ancestors_start[:]
stored_end = ancestor_data.ancestors_end[:]
stored_time = ancestor_data.ancestors_time[:]
stored_ancestors = ancestor_data.ancestors_haplotype[:]
# Remove the ploidy dimension
stored_ancestors = ancestor_data.ancestors_full_haplotype[:, :, 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert here that the ploidy is 1, (i.e. ancestor_data.ancestors_full_haplotype.shape[2] == 1), just to make sure?

Copy link
Member

@hyanwong hyanwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start. Re perf, just for kicks I wonder if it's worth perf testing ancestor generation with e.g. a 100K sites and 1M samples? Or instead doing it with some missing data, which will generate lots of ancestors.

@codecov
Copy link

codecov bot commented Nov 28, 2022

Codecov Report

Merging #778 (a8a108a) into main (ff1d06d) will increase coverage by 0.03%.
The diff coverage is 95.08%.

@@            Coverage Diff             @@
##             main     #778      +/-   ##
==========================================
+ Coverage   93.31%   93.35%   +0.03%     
==========================================
  Files          17       17              
  Lines        5597     5657      +60     
  Branches      999     1012      +13     
==========================================
+ Hits         5223     5281      +58     
- Misses        246      247       +1     
- Partials      128      129       +1     
Flag Coverage Δ
C 93.35% <95.08%> (+0.03%) ⬆️
python 96.31% <95.08%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tsinfer/formats.py 97.48% <95.04%> (-0.03%) ⬇️
tsinfer/inference.py 98.56% <100.00%> (-0.01%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@benjeffery
Copy link
Member Author

@jeromekelleher Now rebased onto #779 so is a bit simpler.

@benjeffery benjeffery force-pushed the sgkit-ancestor-2 branch 2 times, most recently from 839a6f1 to 3662ae8 Compare November 29, 2022 11:03
@jeromekelleher
Copy link
Member

Will I take a look once #779 is merged?

@benjeffery
Copy link
Member Author

The ancestors load into sgkit, but you can't do anything with them yet as xarray is turning the genotypes into a floating-point array, investigating.

@jeromekelleher
Copy link
Member

xarray seems keen on that sort of thing - time to start filing some upstream bugs?

@benjeffery
Copy link
Member Author

xarray seems keen on that sort of thing - time to start filing some upstream bugs?

It was the same one I hit before, so already filed, but I can work around it.

@jeromekelleher
Copy link
Member

Can you link to the upstream issue please?

@benjeffery
Copy link
Member Author

pydata/xarray#7292 as zarr's fill_value is 0 by default.

@benjeffery
Copy link
Member Author

One more wrinkle, I thought that call_genotype_mask was optional in sgkit, but appears that it is needed. This means creating and storing an additional array of the same shape and almost identical content to the genotypes.

@benjeffery
Copy link
Member Author

So far sgkit.load_dataset, sgkit.variant_stats, sgkit.display_genotypes are working on the ancestors. As I added them, display_genotypes required sample_id and variant_stats did not. @tomwhite what other methods should I add that give the full set of requirements on a dataset? Or is there a better way of testing "sgkit compatible"-ness?

@benjeffery benjeffery force-pushed the sgkit-ancestor-2 branch 3 times, most recently from 23c4764 to fd7e066 Compare December 1, 2022 23:42
@tomwhite
Copy link

tomwhite commented Dec 2, 2022

One more wrinkle, I thought that call_genotype_mask was optional in sgkit, but appears that it is needed.

What function were you using that requires it?

This means creating and storing an additional array of the same shape and almost identical content to the genotypes.

True, but on the other hand the mask compresses very well so isn't much storage overhead...

I note that https://github.com/pystatgen/vcf-zarr-spec/blob/main/vcf_zarr_spec.md says masks are optional ("An array called <name> may have an accompanying array called <name>_mask..."), so we might want to revisit. E.g. compute the mask on demand perhaps.

@tomwhite
Copy link

tomwhite commented Dec 2, 2022

So far sgkit.load_dataset, sgkit.variant_stats, sgkit.display_genotypes are working on the ancestors. As I added them, display_genotypes required sample_id and variant_stats did not. @tomwhite what other methods should I add that give the full set of requirements on a dataset? Or is there a better way of testing "sgkit compatible"-ness?

That seems like a good test.

sgkit doesn't really mandate a particular set of variables - it depends on what you want to do. It has certain conventions (which you can see by looking at the variables that are populated by the various functions to load from plink/bgen/vcf), but each function implementing a particular genetic method will typically depend on different variables.

@benjeffery
Copy link
Member Author

What function were you using that requires it?

variant_stats required the mask.

@tomwhite
Copy link

tomwhite commented Dec 2, 2022

variant_stats required the mask.

Right. You could perhaps add it as needed with

data_vars["call_genotype_mask"] = ([DIM_VARIANT, DIM_SAMPLE, DIM_PLOIDY], call_genotype < 0)

@benjeffery
Copy link
Member Author

Yikes, segfault on circle CI. Adding more test output to see which one it was.

@benjeffery benjeffery force-pushed the sgkit-ancestor-2 branch 5 times, most recently from 9303e13 to 0201165 Compare December 5, 2022 13:06
@benjeffery
Copy link
Member Author

Now that we have something that works in sgkit (to a first approximation) I've re-run the perf numbers. To recap this was the state when we were just writing call_genotype:
main:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:23, 323it/s]

branch:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.34kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:58, 260it/s]

Now with writing call_genotype_mask in addition:

ga-add   (1/2)100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46.4k/46.4k [03:29, 221it/s]

Sadly this is quite a bit slower, but is expected seeing that we're essentially writing out the genotypes twice.
The mask is very compressible though, its information content being equal to the lengths of the ancestors, only taking up 2MB.

@jeromekelleher Is this slow down ok, or should we look at making the mask optional in sgkit?

@benjeffery
Copy link
Member Author

benjeffery commented Dec 5, 2022

Now that we have something that works in sgkit (to a first approximation) I've re-run the perf numbers. To recap this was the state when we were just writing call_genotype:
main:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:23, 323it/s]

branch:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.34kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:58, 260it/s]

Now with writing call_genotype_mask in addition:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████| 46.4k/46.4k [03:29, 221it/s]

Sadly this is quite a bit slower, but is expected seeing that we're essentially writing out the genotypes twice.
The mask is very compressible though, its information content being equal to the lengths of the ancestors, only taking up 2MB.

@jeromekelleher Is this slow down ok, or should we look at making the mask optional in sgkit?

@jeromekelleher
Copy link
Member

Overall slowdown is about 1/3 I think? If so let's not worry about it, this bit isn't a bottleneck anyway.

@benjeffery benjeffery force-pushed the sgkit-ancestor-2 branch 3 times, most recently from 0615f02 to 6056ee7 Compare December 7, 2022 12:01
@benjeffery benjeffery marked this pull request as ready for review December 7, 2022 12:48
@benjeffery
Copy link
Member Author

@jeromekelleher I've factored out the common create_dataset args as discussed in our meeting. Would appreciate a review here now.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor thing

prev_chunk_id = chunk_id
yield chunk[:, j % chunk_size]
else:
raise ValueError("Only first two dimensions supported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a test (or just assert?)

@benjeffery benjeffery force-pushed the sgkit-ancestor-2 branch 2 times, most recently from b1f980f to 44418bf Compare January 20, 2023 16:41
@mergify mergify bot merged commit ca1d1cc into tskit-dev:main Jan 20, 2023
@benjeffery benjeffery deleted the sgkit-ancestor-2 branch January 23, 2023 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants