Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple SampleData from AncestorData #779

Merged
merged 1 commit into from
Nov 30, 2022

Conversation

benjeffery
Copy link
Member

@benjeffery benjeffery commented Nov 23, 2022

I thought this best done as a separate PR from the sgkit ancestors work.

This ended up a bit more invasive than I had hoped. It looks like ancestor_data.sequence_length is used in quite a few places - mostly for making tree sequence tables of the ancestors, so I've also passed that through too, thought it might be possible to use max(position)? The main changes are in generate_ancestors and AncestorsGenerator as the AncestorData can no longer be made ahead of time, and has to be made once the sites are known.

@codecov
Copy link

codecov bot commented Nov 28, 2022

Codecov Report

Merging #779 (c7f62d5) into main (6ca6edc) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #779      +/-   ##
==========================================
- Coverage   93.33%   93.30%   -0.03%     
==========================================
  Files          17       17              
  Lines        5581     5571      -10     
  Branches      991      990       -1     
==========================================
- Hits         5209     5198      -11     
  Misses        246      246              
- Partials      126      127       +1     
Flag Coverage Δ
C 93.30% <100.00%> (-0.03%) ⬇️
python 96.30% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tsinfer/eval_util.py 90.51% <100.00%> (ø)
tsinfer/formats.py 97.49% <100.00%> (-0.10%) ⬇️
tsinfer/inference.py 98.57% <100.00%> (+<0.01%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@benjeffery benjeffery changed the title WIP - decouple SampleData from AncestorData Decouple SampleData from AncestorData Nov 28, 2022
@benjeffery benjeffery marked this pull request as ready for review November 28, 2022 15:16
@benjeffery
Copy link
Member Author

Hmm, tests are passing locally. Looking into it.

@benjeffery
Copy link
Member Author

Not sure why docs are failing either.

@benjeffery benjeffery force-pushed the ancestor-position-only branch 3 times, most recently from ae3c09f to 2630169 Compare November 28, 2022 16:35
@benjeffery
Copy link
Member Author

Rebased onto #780 to fix the Ci issues. Should be good to go now.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -2252,7 +2242,7 @@ def __eq__(self, other):

class AncestorData(DataContainer):
"""
AncestorData(sample_data, *, path=None, num_flush_threads=0, compressor=None, \
AncestorData(position, *, path=None, num_flush_threads=0, compressor=None, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing sequence_length here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@jeromekelleher
Copy link
Member

Can you take a look through please @hyanwong?

@@ -819,7 +823,9 @@ def sim_true_and_inferred_ancestors(args):
sample_data = generate_samples(ts, args.error)

inferred_anc = tsinfer.generate_ancestors(sample_data, engine=args.engine)
true_anc = tsinfer.AncestorData(sample_data)
true_anc = tsinfer.AncestorData(
sample_data.sites_position, sample_data.sequence_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial, but if you want to shorten sample_data to sd (which we do sometimes anyway in the tests), it will make all this fit on a single line. There are a whole set of other examples like this too.

def match_ancestors_ancestors_unfinalised(self, path=None):
with tsinfer.SampleData(sequence_length=2) as sample_data:
sample_data.add_site(1, genotypes=[0, 1, 1, 0], alleles=["G", "C"])
with tsinfer.AncestorData(sample_data, path=path) as ancestor_data:
with tsinfer.AncestorData(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, sd rather than sample_data might make it all more readable (fewer lines)

self._num_alleles = self.sample_data.num_alleles()
position = self.sample_data.sites_position[:]
self.data.attrs["sequence_length"] = sequence_length
if self.sequence_length == 0:
Copy link
Member

@hyanwong hyanwong Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be if self.sequence_length <= 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

with tempfile.TemporaryDirectory(prefix="tsinf_inference_test") as tempdir:
filename = os.path.join(tempdir, "samples.tmp")
self.make_ancestor_data_unfinalised(filename)

def test_match_ancestors_ancestors(self):
self.match_ancestors_ancestors_unfinalised()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we test match_ancestors_ancestors_unfinalised with the path argument anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on line 78.

Copy link
Member

@hyanwong hyanwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all LGTM, but I'd need to work with it for a bit to see what some of the changes imply in terms of workflow (should be the same, I'm guessing). It looks like it's passing the truncate_ancestors tests too, which I wasn't expecting 👍

@mergify mergify bot merged commit b4432e1 into tskit-dev:main Nov 30, 2022
@benjeffery benjeffery deleted the ancestor-position-only branch November 30, 2022 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants