-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple SampleData from AncestorData #779
Conversation
8d0c30c
to
398061c
Compare
Codecov Report
@@ Coverage Diff @@
## main #779 +/- ##
==========================================
- Coverage 93.33% 93.30% -0.03%
==========================================
Files 17 17
Lines 5581 5571 -10
Branches 991 990 -1
==========================================
- Hits 5209 5198 -11
Misses 246 246
- Partials 126 127 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
398061c
to
8a27ef4
Compare
8a27ef4
to
215ef34
Compare
Hmm, tests are passing locally. Looking into it. |
215ef34
to
600d4a6
Compare
Not sure why docs are failing either. |
ae3c09f
to
2630169
Compare
Rebased onto #780 to fix the Ci issues. Should be good to go now. |
2630169
to
201afa3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
tsinfer/formats.py
Outdated
@@ -2252,7 +2242,7 @@ def __eq__(self, other): | |||
|
|||
class AncestorData(DataContainer): | |||
""" | |||
AncestorData(sample_data, *, path=None, num_flush_threads=0, compressor=None, \ | |||
AncestorData(position, *, path=None, num_flush_threads=0, compressor=None, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing sequence_length
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Can you take a look through please @hyanwong? |
@@ -819,7 +823,9 @@ def sim_true_and_inferred_ancestors(args): | |||
sample_data = generate_samples(ts, args.error) | |||
|
|||
inferred_anc = tsinfer.generate_ancestors(sample_data, engine=args.engine) | |||
true_anc = tsinfer.AncestorData(sample_data) | |||
true_anc = tsinfer.AncestorData( | |||
sample_data.sites_position, sample_data.sequence_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trivial, but if you want to shorten sample_data
to sd
(which we do sometimes anyway in the tests), it will make all this fit on a single line. There are a whole set of other examples like this too.
def match_ancestors_ancestors_unfinalised(self, path=None): | ||
with tsinfer.SampleData(sequence_length=2) as sample_data: | ||
sample_data.add_site(1, genotypes=[0, 1, 1, 0], alleles=["G", "C"]) | ||
with tsinfer.AncestorData(sample_data, path=path) as ancestor_data: | ||
with tsinfer.AncestorData( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, sd
rather than sample_data
might make it all more readable (fewer lines)
tsinfer/formats.py
Outdated
self._num_alleles = self.sample_data.num_alleles() | ||
position = self.sample_data.sites_position[:] | ||
self.data.attrs["sequence_length"] = sequence_length | ||
if self.sequence_length == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be if self.sequence_length <= 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
with tempfile.TemporaryDirectory(prefix="tsinf_inference_test") as tempdir: | ||
filename = os.path.join(tempdir, "samples.tmp") | ||
self.make_ancestor_data_unfinalised(filename) | ||
|
||
def test_match_ancestors_ancestors(self): | ||
self.match_ancestors_ancestors_unfinalised() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we test match_ancestors_ancestors_unfinalised
with the path
argument anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on line 78.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all LGTM, but I'd need to work with it for a bit to see what some of the changes imply in terms of workflow (should be the same, I'm guessing). It looks like it's passing the truncate_ancestors
tests too, which I wasn't expecting 👍
201afa3
to
c7f62d5
Compare
I thought this best done as a separate PR from the sgkit ancestors work.
This ended up a bit more invasive than I had hoped. It looks like
ancestor_data.sequence_length
is used in quite a few places - mostly for making tree sequence tables of the ancestors, so I've also passed that through too, thought it might be possible to usemax(position)
? The main changes are ingenerate_ancestors
andAncestorsGenerator
as theAncestorData
can no longer be made ahead of time, and has to be made once the sites are known.