Decouple SampleData from AncestorData #779

benjeffery · 2022-11-23T16:14:06Z

I thought this best done as a separate PR from the sgkit ancestors work.

This ended up a bit more invasive than I had hoped. It looks like ancestor_data.sequence_length is used in quite a few places - mostly for making tree sequence tables of the ancestors, so I've also passed that through too, thought it might be possible to use max(position)? The main changes are in generate_ancestors and AncestorsGenerator as the AncestorData can no longer be made ahead of time, and has to be made once the sites are known.

codecov · 2022-11-28T15:08:10Z

Codecov Report

Merging #779 (c7f62d5) into main (6ca6edc) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #779      +/-   ##
==========================================
- Coverage   93.33%   93.30%   -0.03%     
==========================================
  Files          17       17              
  Lines        5581     5571      -10     
  Branches      991      990       -1     
==========================================
- Hits         5209     5198      -11     
  Misses        246      246              
- Partials      126      127       +1

Flag	Coverage Δ
C	`93.30% <100.00%> (-0.03%)`	⬇️
python	`96.30% <100.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tsinfer/eval_util.py	`90.51% <100.00%> (ø)`
tsinfer/formats.py	`97.49% <100.00%> (-0.10%)`	⬇️
tsinfer/inference.py	`98.57% <100.00%> (+<0.01%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

benjeffery · 2022-11-28T15:38:06Z

Hmm, tests are passing locally. Looking into it.

benjeffery · 2022-11-28T15:54:54Z

Not sure why docs are failing either.

benjeffery · 2022-11-28T16:37:42Z

Rebased onto #780 to fix the Ci issues. Should be good to go now.

jeromekelleher

LGTM!

jeromekelleher · 2022-11-29T13:11:42Z

tsinfer/formats.py

@@ -2252,7 +2242,7 @@ def __eq__(self, other):

 class AncestorData(DataContainer):
    """
-    AncestorData(sample_data, *, path=None, num_flush_threads=0, compressor=None, \
+    AncestorData(position, *, path=None, num_flush_threads=0, compressor=None, \


Missing sequence_length here

jeromekelleher · 2022-11-29T13:15:52Z

Can you take a look through please @hyanwong?

hyanwong · 2022-11-29T13:51:33Z

evaluation.py

@@ -819,7 +823,9 @@ def sim_true_and_inferred_ancestors(args):
    sample_data = generate_samples(ts, args.error)

    inferred_anc = tsinfer.generate_ancestors(sample_data, engine=args.engine)
-    true_anc = tsinfer.AncestorData(sample_data)
+    true_anc = tsinfer.AncestorData(
+        sample_data.sites_position, sample_data.sequence_length


Trivial, but if you want to shorten sample_data to sd (which we do sometimes anyway in the tests), it will make all this fit on a single line. There are a whole set of other examples like this too.

hyanwong · 2022-11-29T13:52:25Z

tests/test_inference.py

    def match_ancestors_ancestors_unfinalised(self, path=None):
        with tsinfer.SampleData(sequence_length=2) as sample_data:
            sample_data.add_site(1, genotypes=[0, 1, 1, 0], alleles=["G", "C"])
-        with tsinfer.AncestorData(sample_data, path=path) as ancestor_data:
+        with tsinfer.AncestorData(


Ditto, sd rather than sample_data might make it all more readable (fewer lines)

hyanwong · 2022-11-29T13:54:36Z

tsinfer/formats.py

-        self._num_alleles = self.sample_data.num_alleles()
-        position = self.sample_data.sites_position[:]
+        self.data.attrs["sequence_length"] = sequence_length
+        if self.sequence_length == 0:


Should this be if self.sequence_length <= 0?

hyanwong · 2022-11-29T14:11:55Z

tests/test_inference.py

-        with tempfile.TemporaryDirectory(prefix="tsinf_inference_test") as tempdir:
-            filename = os.path.join(tempdir, "samples.tmp")
-            self.make_ancestor_data_unfinalised(filename)
-
    def test_match_ancestors_ancestors(self):
        self.match_ancestors_ancestors_unfinalised()


Do we test match_ancestors_ancestors_unfinalised with the path argument anywhere?

Yes, on line 78.

hyanwong

This all LGTM, but I'd need to work with it for a bit to see what some of the changes imply in terms of workflow (should be the same, I'm guessing). It looks like it's passing the truncate_ancestors tests too, which I wasn't expecting 👍

benjeffery force-pushed the ancestor-position-only branch from 8d0c30c to 398061c Compare November 28, 2022 15:01

benjeffery force-pushed the ancestor-position-only branch from 398061c to 8a27ef4 Compare November 28, 2022 15:14

benjeffery changed the title ~~WIP - decouple SampleData from AncestorData~~ Decouple SampleData from AncestorData Nov 28, 2022

benjeffery force-pushed the ancestor-position-only branch from 8a27ef4 to 215ef34 Compare November 28, 2022 15:15

benjeffery marked this pull request as ready for review November 28, 2022 15:16

benjeffery force-pushed the ancestor-position-only branch from 215ef34 to 600d4a6 Compare November 28, 2022 15:41

benjeffery force-pushed the ancestor-position-only branch 3 times, most recently from ae3c09f to 2630169 Compare November 28, 2022 16:35

benjeffery mentioned this pull request Nov 28, 2022

Sgkit compatible ancestors #778

Merged

benjeffery force-pushed the ancestor-position-only branch from 2630169 to 201afa3 Compare November 29, 2022 10:55

jeromekelleher approved these changes Nov 29, 2022

View reviewed changes

jeromekelleher requested a review from hyanwong November 29, 2022 13:15

hyanwong reviewed Nov 29, 2022

View reviewed changes

hyanwong approved these changes Nov 29, 2022

View reviewed changes

Decouple SampleData from AncestorData

c7f62d5

benjeffery force-pushed the ancestor-position-only branch from 201afa3 to c7f62d5 Compare November 30, 2022 10:36

benjeffery added the AUTOMERGE-REQUESTED label Nov 30, 2022

mergify bot merged commit b4432e1 into tskit-dev:main Nov 30, 2022

mergify bot removed the AUTOMERGE-REQUESTED label Nov 30, 2022

benjeffery deleted the ancestor-position-only branch November 30, 2022 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple SampleData from AncestorData #779

Decouple SampleData from AncestorData #779

benjeffery commented Nov 23, 2022 •

edited

Loading

codecov bot commented Nov 28, 2022 •

edited

Loading

benjeffery commented Nov 28, 2022

benjeffery commented Nov 28, 2022

benjeffery commented Nov 28, 2022

jeromekelleher left a comment

jeromekelleher Nov 29, 2022

benjeffery Nov 30, 2022

jeromekelleher commented Nov 29, 2022

hyanwong Nov 29, 2022

hyanwong Nov 29, 2022

hyanwong Nov 29, 2022 •

edited

Loading

benjeffery Nov 30, 2022

hyanwong Nov 29, 2022

benjeffery Nov 30, 2022

hyanwong left a comment

Decouple SampleData from AncestorData #779

Decouple SampleData from AncestorData #779

Conversation

benjeffery commented Nov 23, 2022 • edited Loading

codecov bot commented Nov 28, 2022 • edited Loading

Codecov Report

benjeffery commented Nov 28, 2022

benjeffery commented Nov 28, 2022

benjeffery commented Nov 28, 2022

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Nov 29, 2022

Choose a reason for hiding this comment

benjeffery Nov 30, 2022

Choose a reason for hiding this comment

jeromekelleher commented Nov 29, 2022

hyanwong Nov 29, 2022

Choose a reason for hiding this comment

hyanwong Nov 29, 2022

Choose a reason for hiding this comment

hyanwong Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

benjeffery Nov 30, 2022

Choose a reason for hiding this comment

hyanwong Nov 29, 2022

Choose a reason for hiding this comment

benjeffery Nov 30, 2022

Choose a reason for hiding this comment

hyanwong left a comment

Choose a reason for hiding this comment

benjeffery commented Nov 23, 2022 •

edited

Loading

codecov bot commented Nov 28, 2022 •

edited

Loading

hyanwong Nov 29, 2022 •

edited

Loading