Sgkit compatible ancestors #778

benjeffery · 2022-11-22T23:54:13Z

Work in progress.

Tests finally pass with a genotype array that is sgkit-compatible.
Next I need to remove the lmdb store (without breaking SampleData which shares the storage base class) and then do some perf tests.

benjeffery · 2022-11-23T12:20:57Z

I thought it worth doing some perf now, so the effect of the lmdb removal can be seen separately, this is initially just for generate-ancestors, on sample data from this ts:

║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │      61779║
╟───────────────┼───────────╢
║Sequence Length│ 50000000.0║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │      30000║
╟───────────────┼───────────╢
║Total Size     │   17.8 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤══════╤═════════╤════════════╗
║Table      │Rows  │Size     │Has Metadata║
╠═══════════╪══════╪═════════╪════════════╣
║Edges      │294197│  9.0 MiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Individuals│     0│ 24 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Migrations │     0│  8 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Mutations  │ 65362│  2.3 MiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Nodes      │102737│  2.7 MiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Populations│     1│ 16 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Provenances│     1│955 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Sites      │ 65362│  1.6 MiB│          No║
╚═══════════╧══════╧═════════╧════════════╝

main:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:23, 323it/s]

branch:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.34kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:58, 260it/s]

So this branch is slightly slower at ga-gen

File sizes (number is chunk size in variant dimension):

95M 100k.main.ancestors
62M 100k.branch.2048.ancestors
59M 100k.branch.16384.ancestors
59M 100k.branch.65536.ancestors

This branch gives smaller ancestor files, decreasing the chunk size (variant dimension) has a small increase in file size, and almost no change in runtime (~1s)

hyanwong · 2022-11-23T12:42:33Z

That's looking good. The ga step was never much of a time-blocker anyway, so although 25% increase in time isn't ideal, it's not something worth spending huge amounts of time trying to change, either.

hyanwong · 2022-11-23T12:44:51Z

tests/test_formats.py

@@ -1932,29 +1932,34 @@ def verify_data_round_trip(self, sample_data, ancestor_data, ancestors):
        stored_start = ancestor_data.ancestors_start[:]
        stored_end = ancestor_data.ancestors_end[:]
        stored_time = ancestor_data.ancestors_time[:]
-        stored_ancestors = ancestor_data.ancestors_haplotype[:]
+        # Remove the ploidy dimension
+        stored_ancestors = ancestor_data.ancestors_full_haplotype[:, :, 0]


should we assert here that the ploidy is 1, (i.e. ancestor_data.ancestors_full_haplotype.shape[2] == 1), just to make sure?

hyanwong

Looks like a good start. Re perf, just for kicks I wonder if it's worth perf testing ancestor generation with e.g. a 100K sites and 1M samples? Or instead doing it with some missing data, which will generate lots of ancestors.

codecov · 2022-11-28T18:01:20Z

Codecov Report

Merging #778 (a8a108a) into main (ff1d06d) will increase coverage by 0.03%.
The diff coverage is 95.08%.

@@            Coverage Diff             @@
##             main     #778      +/-   ##
==========================================
+ Coverage   93.31%   93.35%   +0.03%     
==========================================
  Files          17       17              
  Lines        5597     5657      +60     
  Branches      999     1012      +13     
==========================================
+ Hits         5223     5281      +58     
- Misses        246      247       +1     
- Partials      128      129       +1

Flag	Coverage Δ
C	`93.35% <95.08%> (+0.03%)`	⬆️
python	`96.31% <95.08%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tsinfer/formats.py	`97.48% <95.04%> (-0.03%)`	⬇️
tsinfer/inference.py	`98.56% <100.00%> (-0.01%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

benjeffery · 2022-11-28T18:04:58Z

@jeromekelleher Now rebased onto #779 so is a bit simpler.

jeromekelleher · 2022-11-29T13:16:42Z

Will I take a look once #779 is merged?

benjeffery · 2022-11-30T11:33:23Z

The ancestors load into sgkit, but you can't do anything with them yet as xarray is turning the genotypes into a floating-point array, investigating.

jeromekelleher · 2022-12-01T09:50:28Z

xarray seems keen on that sort of thing - time to start filing some upstream bugs?

benjeffery · 2022-12-01T10:13:12Z

xarray seems keen on that sort of thing - time to start filing some upstream bugs?

It was the same one I hit before, so already filed, but I can work around it.

jeromekelleher · 2022-12-01T10:14:29Z

Can you link to the upstream issue please?

benjeffery · 2022-12-01T12:02:13Z

pydata/xarray#7292 as zarr's fill_value is 0 by default.

benjeffery · 2022-12-01T12:03:47Z

One more wrinkle, I thought that call_genotype_mask was optional in sgkit, but appears that it is needed. This means creating and storing an additional array of the same shape and almost identical content to the genotypes.

benjeffery · 2022-12-01T12:44:41Z

So far sgkit.load_dataset, sgkit.variant_stats, sgkit.display_genotypes are working on the ancestors. As I added them, display_genotypes required sample_id and variant_stats did not. @tomwhite what other methods should I add that give the full set of requirements on a dataset? Or is there a better way of testing "sgkit compatible"-ness?

tomwhite · 2022-12-02T07:54:55Z

One more wrinkle, I thought that call_genotype_mask was optional in sgkit, but appears that it is needed.

What function were you using that requires it?

This means creating and storing an additional array of the same shape and almost identical content to the genotypes.

True, but on the other hand the mask compresses very well so isn't much storage overhead...

I note that https://github.com/pystatgen/vcf-zarr-spec/blob/main/vcf_zarr_spec.md says masks are optional ("An array called <name> may have an accompanying array called <name>_mask..."), so we might want to revisit. E.g. compute the mask on demand perhaps.

tomwhite · 2022-12-02T07:58:31Z

So far sgkit.load_dataset, sgkit.variant_stats, sgkit.display_genotypes are working on the ancestors. As I added them, display_genotypes required sample_id and variant_stats did not. @tomwhite what other methods should I add that give the full set of requirements on a dataset? Or is there a better way of testing "sgkit compatible"-ness?

That seems like a good test.

sgkit doesn't really mandate a particular set of variables - it depends on what you want to do. It has certain conventions (which you can see by looking at the variables that are populated by the various functions to load from plink/bgen/vcf), but each function implementing a particular genetic method will typically depend on different variables.

benjeffery · 2022-12-02T13:17:19Z

What function were you using that requires it?

variant_stats required the mask.

tomwhite · 2022-12-02T13:25:31Z

variant_stats required the mask.

Right. You could perhaps add it as needed with

data_vars["call_genotype_mask"] = ([DIM_VARIANT, DIM_SAMPLE, DIM_PLOIDY], call_genotype < 0)

benjeffery · 2022-12-02T13:28:47Z

Yikes, segfault on circle CI. Adding more test output to see which one it was.

benjeffery · 2022-12-05T14:42:22Z

Now that we have something that works in sgkit (to a first approximation) I've re-run the perf numbers. To recap this was the state when we were just writing call_genotype:
main:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:23, 323it/s]

branch:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.34kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:58, 260it/s]

Now with writing call_genotype_mask in addition:

ga-add   (1/2)100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46.4k/46.4k [03:29, 221it/s]

Sadly this is quite a bit slower, but is expected seeing that we're essentially writing out the genotypes twice.
The mask is very compressible though, its information content being equal to the lengths of the ancestors, only taking up 2MB.

@jeromekelleher Is this slow down ok, or should we look at making the mask optional in sgkit?

benjeffery · 2022-12-05T14:42:32Z

Now that we have something that works in sgkit (to a first approximation) I've re-run the perf numbers. To recap this was the state when we were just writing call_genotype:
main:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:23, 323it/s]

branch:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.34kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████████| 46.4k/46.4k [02:58, 260it/s]

Now with writing call_genotype_mask in addition:

ga-add   (1/2)100%|██████████████████████████████████████████████████████| 65.4k/65.4k [00:07, 8.78kit/s]
ga-gen   (2/2)100%|████████████████████████████████████████████████████| 46.4k/46.4k [03:29, 221it/s]

Sadly this is quite a bit slower, but is expected seeing that we're essentially writing out the genotypes twice.
The mask is very compressible though, its information content being equal to the lengths of the ancestors, only taking up 2MB.

@jeromekelleher Is this slow down ok, or should we look at making the mask optional in sgkit?

jeromekelleher · 2022-12-05T17:45:03Z

Overall slowdown is about 1/3 I think? If so let's not worry about it, this bit isn't a bottleneck anyway.

benjeffery · 2022-12-19T14:01:48Z

@jeromekelleher I've factored out the common create_dataset args as discussed in our meeting. Would appreciate a review here now.

jeromekelleher

LGTM, one minor thing

jeromekelleher · 2023-01-11T13:21:31Z

tsinfer/formats.py

+                prev_chunk_id = chunk_id
+            yield chunk[:, j % chunk_size]
+    else:
+        raise ValueError("Only first two dimensions supported")


Needs a test (or just assert?)

hyanwong reviewed Nov 23, 2022

View reviewed changes

benjeffery force-pushed the sgkit-ancestor-2 branch from d68141c to c969a56 Compare November 28, 2022 17:52

benjeffery force-pushed the sgkit-ancestor-2 branch from c969a56 to aada29a Compare November 28, 2022 18:03

benjeffery force-pushed the sgkit-ancestor-2 branch 2 times, most recently from 839a6f1 to 3662ae8 Compare November 29, 2022 11:03

benjeffery force-pushed the sgkit-ancestor-2 branch from 3662ae8 to 120137e Compare November 29, 2022 13:36

benjeffery mentioned this pull request Nov 29, 2022

Minimum viable sgkit dataset #748

Merged

benjeffery force-pushed the sgkit-ancestor-2 branch 2 times, most recently from 1e767f8 to 4f8a840 Compare November 30, 2022 11:32

benjeffery force-pushed the sgkit-ancestor-2 branch from 4f8a840 to efbc68c Compare December 1, 2022 12:42

benjeffery force-pushed the sgkit-ancestor-2 branch 3 times, most recently from 23c4764 to fd7e066 Compare December 1, 2022 23:42

benjeffery force-pushed the sgkit-ancestor-2 branch from fd7e066 to 3fb2a58 Compare December 2, 2022 13:18

benjeffery force-pushed the sgkit-ancestor-2 branch from 3fb2a58 to e0600b5 Compare December 2, 2022 13:28

benjeffery force-pushed the sgkit-ancestor-2 branch 5 times, most recently from 9303e13 to 0201165 Compare December 5, 2022 13:06

benjeffery force-pushed the sgkit-ancestor-2 branch 3 times, most recently from 0615f02 to 6056ee7 Compare December 7, 2022 12:01

benjeffery marked this pull request as ready for review December 7, 2022 12:48

jeromekelleher approved these changes Jan 11, 2023

View reviewed changes

benjeffery force-pushed the sgkit-ancestor-2 branch 2 times, most recently from b1f980f to 44418bf Compare January 20, 2023 16:41

Store ancestor data in sgkit format

a8a108a

benjeffery force-pushed the sgkit-ancestor-2 branch from 44418bf to a8a108a Compare January 20, 2023 22:33

benjeffery added the AUTOMERGE-REQUESTED label Jan 20, 2023

mergify bot merged commit ca1d1cc into tskit-dev:main Jan 20, 2023

mergify bot removed the AUTOMERGE-REQUESTED label Jan 20, 2023

benjeffery deleted the sgkit-ancestor-2 branch January 23, 2023 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sgkit compatible ancestors #778

Sgkit compatible ancestors #778

benjeffery commented Nov 22, 2022

benjeffery commented Nov 23, 2022

hyanwong commented Nov 23, 2022

hyanwong Nov 23, 2022

hyanwong left a comment

codecov bot commented Nov 28, 2022 •

edited

Loading

benjeffery commented Nov 28, 2022

jeromekelleher commented Nov 29, 2022

benjeffery commented Nov 30, 2022

jeromekelleher commented Dec 1, 2022

benjeffery commented Dec 1, 2022

jeromekelleher commented Dec 1, 2022

benjeffery commented Dec 1, 2022

benjeffery commented Dec 1, 2022

benjeffery commented Dec 1, 2022

tomwhite commented Dec 2, 2022

tomwhite commented Dec 2, 2022

benjeffery commented Dec 2, 2022

tomwhite commented Dec 2, 2022

benjeffery commented Dec 2, 2022

benjeffery commented Dec 5, 2022

benjeffery commented Dec 5, 2022 •

edited

Loading

jeromekelleher commented Dec 5, 2022

benjeffery commented Dec 19, 2022

jeromekelleher left a comment

jeromekelleher Jan 11, 2023

Sgkit compatible ancestors #778

Sgkit compatible ancestors #778

Conversation

benjeffery commented Nov 22, 2022

benjeffery commented Nov 23, 2022

hyanwong commented Nov 23, 2022

hyanwong Nov 23, 2022

Choose a reason for hiding this comment

hyanwong left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 28, 2022 • edited Loading

Codecov Report

benjeffery commented Nov 28, 2022

jeromekelleher commented Nov 29, 2022

benjeffery commented Nov 30, 2022

jeromekelleher commented Dec 1, 2022

benjeffery commented Dec 1, 2022

jeromekelleher commented Dec 1, 2022

benjeffery commented Dec 1, 2022

benjeffery commented Dec 1, 2022

benjeffery commented Dec 1, 2022

tomwhite commented Dec 2, 2022

tomwhite commented Dec 2, 2022

benjeffery commented Dec 2, 2022

tomwhite commented Dec 2, 2022

benjeffery commented Dec 2, 2022

benjeffery commented Dec 5, 2022

benjeffery commented Dec 5, 2022 • edited Loading

jeromekelleher commented Dec 5, 2022

benjeffery commented Dec 19, 2022

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Jan 11, 2023

Choose a reason for hiding this comment

codecov bot commented Nov 28, 2022 •

edited

Loading

benjeffery commented Dec 5, 2022 •

edited

Loading