Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"empty" chromosomes should not be in tree sequence, at least sometimes? #205

Closed
petrelharp opened this issue Aug 10, 2021 · 20 comments
Closed
Labels
bug priority trees related to tree-seq, tskit, etc.

Comments

@petrelharp
Copy link
Collaborator

Over in tskit-dev/pyslim#192 (reply in thread) we (me and @elissasoroj) we got tripped up by 'empty' chromosomes being included in the tree sequence. If they are really supposed to be "empty chromosomes", then they should be included, as they are, but what @elissasoroj was using them for in the script was essentially as non-chromosomes. Perhaps there's a less confusing way to arrange all this?

@bhaller
Copy link
Contributor

bhaller commented Aug 10, 2021

Yeah, this definitely deserves to be an issue, but I'm not sure how to handle it. When simulating sex chromosomes, SLiM has a concept of "null chromosomes" that are, as you say, essentially non-chromosomes – placeholders. In other situations, such as this, SLiM has no way to know whether the user's intent is to have a chromosome that contains no mutations, or a null chromosome. There needs to be a way, particularly with addRecombinant(), to express these semantics in distinct ways. That would allow SLiM to do a variety of nice/smart things. I'm not sure what the right path forward is; there are backward compatibility issues to worry about too, although perhaps they are minor.

@bhaller bhaller added the trees related to tree-seq, tskit, etc. label Sep 11, 2021
@bhaller
Copy link
Contributor

bhaller commented Sep 17, 2021

OK, this bug has become high-priority because #208 reveals that not treating null genomes correctly actually leads to inconsistent internal state for certain models involving haploids and overlapping generations. Anybody who is interested can read over there for the gory details. Here, suffice to say that this needs to be fixed immediately because of that. @petrelharp and I have discussed this offline and arrived at a proposal, enumerated below.

SLiM already has a concept of "null genomes" – genome objects that are not just empty, but conceptually don't even exist. When simulating the X chromosome, SLiM uses null genomes as placeholders for the Y in males; when simulating the Y, it uses null genomes as placeholders for the X in both sexes. What we propose is to extend this existing concept to other cases where it applies – in particular, to the second genome of haploids, but maybe there are other cases where this will turn out to apply as well.

Right now, passing (NULL, NULL, NULL) to addRecombinant() for the two parental genomes and the breakpoint vector used to generate one genome of the resulting offspring already conceptually has the semantics that it is creating a "null genome" – it just doesn't actually do so, but creates an empty genome instead. The semantics are already there because in this specific case of (NULL, NULL, NULL), addRecombinant() does not generate new mutations into the resulting genome; instead, it guarantees that the new offspring genome will be completely empty. This policy was explicitly chosen in order to enable easy implementation of haploid models and similar models (e.g., haplodiploids). The right move would have been to actually create a null genome for that case, but I didn't realize that at the time. We propose to fix that error by simply making that semantic policy "official", generating a null genome instead of an empty genome.

If you actually want an empty but non-null genome, you will be allowed to pass (NULL, NULL, 0) to request that. Right now passing that combination of arguments to addRecombinant() is illegal, so this will carve out a new policy with no backward-compatibility drawbacks. (However, we think use of this feature will be very rare; it is provided only for completeness.) This option, unlike the current (NULL, NULL, NULL) functionality, will add new mutations to the resulting offspring genome, since it is conceptually generating the new offspring genome via reproduction of some kind from an implicit wild-type ancestral population.

This shift will break backward compatibility for some models that rely upon the existing (NULL, NULL, NULL) functionality. If the intent is to create an empty non-null genome, they should shift to using (NULL, NULL, 0) instead to do that. If the intent is to create a null genome, conceptually, then all is well – except that the model will now be under SLiM's existing restrictions that disallow any change/inspection/manipulation of null genomes, so model code that previously executed operations over all genomes will now need to be narrowed down to involve only the non-null genomes. This should result in a cleaner model design, where it is more clear which genomes are actually being inspected/modified in a meaningful way, so it should be an improvement, but it will break some existing code.

To minimize the impact of that, a new property will be added to Subpopulation and Individual, named genomesNonNull, that will return a vector containing only those genomes that are non-null. Using this instead of the existing genomes property will probably fix most cases of non-compliant code. Additionally, SLiM will not raise an exception in the specific case of calling removeMutations() on a null genome; that will simply do nothing. That policy is consistent with the fact that you are already allowed to call removeMutations() to remove a mutation that is not, in fact, present in the target genome(s); it just does nothing. We expect that removeMutations(), to remove fixed mutations from haploid models, will be the most common violation of the null-genome restrictions, so carving out this exception to those restrictions will hopefully make this change a bit less painful.

A benefit of these changes is that substitution of fixed mutations will now be automatic in virtually all models, whether haploid, haplodiploid, or anything else. The reason that, e.g., haploid models have needed to explicitly remove/substitute fixed mutations themselves is that SLiM didn't know whether empty genomes were simply empty (in which case a given mutation would not be fixed) or were null (in which case the genome would be irrelevant for evaluation of fixation). Now SLiM will know the difference between these two things, and so it will be able to evaluate fixation correctly for non-diploid models, allowing most uses of removeMutations() to be removed from existing models.

On the tree-seq side, we propose that null genomes should simply be omitted from the tree sequence tables entirely. They exist as placeholders on the SLiM side, for ease of bookkeeping; there is no need for them to exist on the tskit side. This will be a change of policy for sex-chromosome simulations that output tree sequences; and it will also change the tree sequence observed for haploid etc. models. As a consequence, some backward compatibility may be broken and some analysis code may have to change. The upside is that these confusing, cluttering placeholders will no longer be present in the tree sequence, which should make visualization, interpretation, and analysis of such tree sequences much easier.

The existing policy is that substitute=T should only be passed to removeMutations() if the mutations being removed are in fact fixed. Enforcing this has not been possible, however, since SLiM did not know the difference between empty genomes and null genomes. Now it will know the difference, and so we will be able to strictly enforce this policy (which is necessary for the tree sequence to correctly match SLiM's internal state). This may break backward compatibility, but only for models that violated the existing policy. It will have the benefit that new derived states will never need to be recorded upon substitution, whether that substitution is done by SLiM directly or by the user's script with substitute=T, because by definition no derived state in the tree sequence will actually have changed – the substituted mutation was already present in every genome.

Haploid etc. models that are WF models will not change their mode of operation, because there will be no way to express to SLiM the semantic concept that a given genome should be null, rather than empty. Instead, such models will continue to remove mutations from second genomes using removeMutations(). The strict enforcement of substitute=T mentioned in the previous paragraph will not be enforced for WF models, to allow this to work. Ensuring consistency of the model will be up to the model implementor, in that case.

A new method, addHaploid(), may be added as a convenience. It will do the same thing as addRecombinant(), but assuming the values (NULL, NULL, NULL) for (strand3, strand4, breaks2) to ensure the second genome of the offspring is null.

This bug has only survived this long because of multiple failures in the internal consistency-checking code employed by SLiM. Those failure points will be patched up as part of the fix of this bug.

Whoo, that was complicated. It would be great if interested parties would read the above and comment as needed, so that any issues can be anticipated prior to implementation. Thanks! @petrelharp @elissasoroj @jeanrjc @mam737 @dinhe878 @yannickwurm

@petrelharp
Copy link
Collaborator Author

petrelharp commented Sep 17, 2021

Question: will NULL, NULL, 0 add mutations to the new, empty genome? Edit: whoops, you said. The answer is "yes".

If not, then I guess I'd think about making NULL, NULL, NULL give you the empty genome, and just add addHaploid to provide the empty genome functionality. That'd be fully backwards compatible.

@bhaller
Copy link
Contributor

bhaller commented Sep 17, 2021

Indeed, yes. I think the motivation for breaking backward compatibility with addRecombinant() is that, since it already does not add new mutations, it already really expresses the semantics of making a null genome rather than merely an empty genome; it just doesn't follow through on that promise, which seems inconsistent and even simply a bug. But if the break with backward compatibility seems too large, leaving addRecombinant() alone is certainly an option; I just think it would leave us with an ugly inconsistency in the APIs, and now seems like a good time to fix that. :->

@petrelharp
Copy link
Collaborator Author

This does seem a less bad break than starting to add mutations to the empty genomes thus obtained.

@bhaller
Copy link
Contributor

bhaller commented Sep 21, 2021

So, I'm partway through implementing this, and discovered an unexpected consequence. Recipe 16.13 (vanilla nonWF haploid model) no longer functions properly, because it removes mutations itself at a frequency of 0.5. Now that (NULL, NULL, NULL) makes null genomes instead of empty genomes, the frequency calculations in SLiM now automatically do the right thing – fixation in a haploid model now happens at a calculated frequency of 1.0, not 0.5. So the model's code is removing mutations that are only halfway to fixation. This is bad – the model runs, but produces unintended/incorrect results. I think the problem is that I relaxed the "null genome" checks inside removeMutations(), as described above, in an attempt to make it "just work" for models like this. I should leave those null genome checks in place, so that models like this will now error out when they try to remove mutations from null genomes. That will alert the user to the fact that something has changed that has broken backward compatibility, and hopefully they will figure out the right fix.

@bhaller
Copy link
Contributor

bhaller commented Sep 21, 2021

Another update: I decided that (NULL, NULL, 0) to addRecombinant() to request a non-null empty genome is not needed, and will not be implemented. It's just an ugly hack in the API, using the breaks vector to pass in a flag to the method (eww), and it's really not needed. Virtually nobody is expected to even want to do this. If you want an offspring individual with both genomes empty, use addEmpty(). If you really want an offspring individual with one empty (but non-null) genome, make it inherit from an individual in a dummy subpop that you create to represent "wild-type ancestors" or whatever. I.e., make the inheritance pattern reflect biological reality.

@bhaller
Copy link
Contributor

bhaller commented Sep 21, 2021

Other implementation decision: for right now, null genomes will continue to be recorded in the tree sequence. This bug will still be fixed, because empty genomes will be changed to null genomes, as they should be, and then you can filter them out with a simplification if you want to. This bug is really about empty non-null genomes hanging around in the tree sequence, primarily; they are harder to filter out. Null genomes are hard to keep out of the tree sequence because there's a fair bit of code in SLiM that assumes that there are exactly two nodes per individual, with specific IDs (* 2 and * 2 + 1). Once the work I'm doing is done, perhaps we can revisit the idea of keeping null genomes out of the tree sequence, too, but it's an orthogonal issue I think.

@yannickwurm
Copy link

@roddypr

@bhaller
Copy link
Contributor

bhaller commented Sep 21, 2021

@roddypr

Thanks @yannickwurm; I would've tagged him but I couldn't find his username on GitHub.

@bhaller
Copy link
Contributor

bhaller commented Sep 21, 2021

Another implementation decision: I won't be adding an addHaploid() method. It would basically just be a wrapper for addRecombinant() that would assume (NULL, NULL, NULL) for the second triplet of parameters. Since it wouldn't really cut down on the conceptual complexity of addRecombinant() significantly, it doesn't seem worthwhile; just API cruft.

On the other hand, it looks like I will be adding arguments to both addEmpty() and addSubpop() that let you request null genomes in the new individual(s), to create new individuals/populations composed of haploids.

@bhaller
Copy link
Contributor

bhaller commented Sep 22, 2021

I have realized that these changes open up the topic of dominance coefficients. Normally, mutations in SLiM use a dominance coefficient kept by the MutationType when they are heterozygous. Right now, the only use of "null genomes" is in sex-chromosome simulations, where individuals that are XX can be either homozygous or heterozygous for an X-linked mutation they possess, while individuals that are XY can only possess one copy and the Y chromosome is a null genome. This special case, of the XY fitness, is presently controlled with a special X-dominance coefficient kept by SLiMSim, which is the same for all mutation types (representing the degree of dosage compensation exhibited by the organism, I guess).

This strategy was already pretty clunky, and now that null genomes will be present in other cases (e.g., haploids), there is a need for a more general mechanism. In haplodiploid models, or models of alternation of generations, there will similarly be both haploids and diploids present, so a different dominance coefficient is needed to govern fitness in the haploid case (including the XY case, conceptually) than in the diploid case (including the XX case). So I propose to remove the current X-dominance coefficient completely, deliberately breaking compatibility for those who were using it (to push them to the new scheme), and replacing it with a new property on MutationType, rather than SLiMSim, called haploidDominanceCoeff. This would allow this dominance coefficient to be varied by MutationType rather than being one global number, and it would now apply to all cases where a mutation sits opposite a null genome: X- when modeling the X, -Y when modeling the Y, and A- (or -A) when modeling autosomes.

Right now I think haplodiploid models and alternation of generations models often control these dominance effects with fitness() callbacks. That would still be possible, for maximum flexibility, but the hope is that the proposed scheme would make such callbacks unnecessary for many such models.

If anybody tagged on this has any thoughts regarding this proposal, now is the time to comment. :->

It's amazing how you tug on one string and it turns out to be connected to all the other strings, isn't it?

@petrelharp
Copy link
Collaborator Author

It's amazing how you tug on one string and it turns out to be connected to all the other strings, isn't it?

lol, yes. I just hope they don't get all tangled up.

@bhaller
Copy link
Contributor

bhaller commented Sep 28, 2021

OK, I've just committed a fix for this. The diffs are quite complex and touch on a bunch of different areas, but I think I got it right, and it passes all tests on my end. :-> @petrelharp, please run your tests against this new version to confirm that all is well.

As mentioned in the discussion above, the fixes for this issue involved some intentional breaks with backward compatibility, and models involving haploidy will likely need to be revised. @petrelharp @elissasoroj @jeanrjc @mam737 @dinhe878 @yannickwurm please be aware of this. I'm happy to help with script revisions for anybody who is unsure how to proceed; the changes are straightforward but may be non-obvious. :->

@bhaller
Copy link
Contributor

bhaller commented Sep 29, 2021

Also @roddypr

@petrelharp
Copy link
Collaborator Author

All the pyslim and treerec/tests tests passed!

@bhaller
Copy link
Contributor

bhaller commented Oct 1, 2021

A summary of what changed from a user's perspective, for those wanting to adjust their models (sorry for all the verbiage, this was quite a complex change!):

  • WF models, which do not use addRecombinant(), will be unaffected by all of this, so if your models are all WF you can stop reading now.

  • nonWF models that do not pass NULL, NULL, NULL to addRecombinant() with the intention of making an unused genome will be unaffected by all of this, so if your model never does that, you can stop reading now.

  • In nonWF models that call addRecombinant() with NULL, NULL, NULL, this now creates a "null genome" instead of an empty genome. The effect is similar but not the same. A null genome is a placeholder that says "conceptually, there is no genome here at all", whereas an empty genome says "there is a genome here, it just doesn't contain any mutations". Using null genomes makes much more sense, then, for models that involve haploids or sex chromosomes, where you want to say "this haploid really only has one genome" or "this XY male really has only one genome since we're only simulating one of the two sex chromosomes". So if you're using addRecombinant() in this way, your model will automatically shift over to using null genomes, which is a good thing but has backward-compatibility consequences.

  • null genomes cannot be used/accessed; trying to do so will cause an error. So if you try to, say, removeMutations() from a null genome, or anything of that sort, you'll get an error, since conceptually that genome does not even exist. If you can run your model and never get such an error, then you are probably fine.

  • If you're getting an error because you're trying to remove fixed mutations from null genomes, you may be able to simply get rid of that code; SLiM will now correctly detect fixation and do substitution for you, whenever "fixation" is defined as "present in all genomes that are not null genomes". For this to work, you need convertToSubstitution=T to be set on the relevant mutation type(s), as usual in nonWF models. If you were doing your own mutation fixation/removal you probably didn't have convertToSubstitution=Tset; so if you add that, you might be able to get rid of your mutation fixation/removal code altogether.

  • If you're accessing null genomes for some other reason, you probably just need to narrow down which genomes you're using in that code. Instead of getting subpop.genomes, for example, you can now get subpop.genomesNonNull; instead of getting individual.genomes, you can now get individual.genomesNonNull; and in other situations you can use the isNullGenome property of Genome to filter down to only the non-null genomes in a sample or whatever.

  • When you create a subpopulation with addSubpop(), you normally get diploids with two empty genomes. For models where the biological intent is to get haploids (or any other type of individuals with their second genome being a null genome), you can now pass haploid=T to addSubpop() to get exactly that. For models that need even more precise control over which genomes in which initial individuals are null genomes versus empty genomes (perhaps a mix of haploids and diploids in the initial population – this might not apply to anyone reading this), the addEmpty() method now allows that to be specified directly on a per-individual basis; so you could then (a) create a new subpopulation with addSubpop() with size 0, and then (b) fill it up with individuals configured as desired with addEmpty() in a reproduction() callback.

  • The way dominance coefficients for haploids and sex-chromosome models work has shifted. The old dominanceCoeffX property of SLiMSim, which specifically governed the fitness effects of X-linked mutations in XY males, has been removed. Now there is a new property of MutationType, haploidDominanceCoeff, that is used whenever a single copy of a mutation is possessed by an individual and the other genome is a null genome – i.e., both the hemizygous case of XY males when a sex chromosome is being modeled, and the haploid case. If you were using dominanceCoeffX you can probably shift over to using haploidDominanceCoeff on all relevant MutationTypes to get the same effect. If you were using a fitness() callback or fitnessScaling to calculate haploid fitness effects in a model that mixes haploids and diploids, you can probably remove that code and use haploidDominanceCoeff instead (as well as the usual dominanceCoeff property that governs heterozygous fitness effects), and SLiM will do the correct calculations for you, which will be faster and simpler.

  • If you wish to distinguish between null genomes and other genomes on the Python side, pyslim provides a flag for whether a genome is null or not; see the pyslim documentation for details. If you have existing code that simplifies down to exclude empty second genomes, it will probably continue to work, I would guess; it will just be simplifying away null genomes instead of empty genomes.

  • Note that this is all now in the current GitHub head, and will be released when SLiM 3.7 is released. Hopefully none of you will need to make your model code run against both old and new SLiM; I'd suggest just making a clean break with the past, and saying "this model now requires SLiM 3.7". But if you really need to, you can check the SLiM version using version()[1] > 3.699, and do different things for before 3.7 versus 3.7 and later, since all of the changes described here will not work in SLiM 3.6. Or you could provide two versions of your model, one for SLiM 3.6 and earlier, the other for SLiM 3.7 and later.

Hopefully that clarifies the issues. I can send a draft of the SLiM 3.7 manual to anyone who wants it; all of these changes are now documented (you can also see the new doc in a current build of SLiMgui, in the help panel). Again, I'm happy to assist in modifications to SLiM scripts as needed; just send me a model script that runs under SLiMgui and I'll tweak it for you. (I can't help with changes to Python analysis code, though, since I'm not fluent in Python.) Sorry for all this; I strive to maintain backward compatibility, but this is a case where breaking models was really unavoidable, and the new design is markedly better. It's faster, too – nonWF models that use addRecombinant() with NULL, NULL, NULL will often see a substantial speedup (as much as ~1/3 faster!) with these changes. So that's the silver lining. :->

@bhaller
Copy link
Contributor

bhaller commented Oct 1, 2021

Ah, one important addendum to the above:

  • Code that looks at mutation frequencies will notice a change. For example, haploid nonWF models used to see a maximum frequency of 0.5, due to the empty second genomes. Such models will now see a maximum frequency of 1.0, because the null genomes are excluded from the frequency calculation. This is obviously better, but represents a change that breaks backward compatibility.

@petrelharp
Copy link
Collaborator Author

Wow, this was pretty big! It all sounds better, though! Note that once #218 goes in the null chromosomes won't in the tree sequence, though.

@bhaller
Copy link
Contributor

bhaller commented Oct 6, 2021

Wow, this was pretty big! It all sounds better, though! Note that once #218 goes in the null chromosomes won't in the tree sequence, though.

Yes; it's not clear to me whether #218 will go in any time soon, though. As you and I realized, there are substantial obstacles there, and the benefit is not that large since if one doesn't want the null genomes in the tree one can just simplify them all away anyway. So, we'll see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug priority trees related to tree-seq, tskit, etc.
Projects
None yet
Development

No branches or pull requests

3 participants