recombination rate and/or tract length too high (error) #373

mshpak76 · 2023-05-08T15:45:54Z

mshpak76
May 8, 2023

I am attempting to simulate selection and migration in large populations (Ne ~ 1e+6) with mutation, recombination/conversion parameters based on estimates for Drosophila. Because of the long run time, I'm using an approximation where I simulate a smaller population size (Ne ~ 1e+4) while multiplying mutation rates, recombination rates, and selection coefficients by 100 to keep the same effective Ns, Nr, Nmu etc.

One problem that I experienced with this is that with a rescaled recombination rate and conversion tract length of:

initializeRecombinationRate(7.34e-6);
initializeGeneConversion(0.8515, 518, 1.0);

I invariably get the following SLiM error:

ERROR (Chromosome::DrawDSBBreakpoints): non-overlapping recombination regions could not be achieved in 100 tries; 
 terminating.  The recombination rate and/or mean gene conversion
 tract length may be too high.

I am not sure what I can do about this. If I reduce the tract size to 100 (e.g. similar to mammalian), I don't get this error, but I would like to simulate a Drosophila-like population as closely as possible. I also would rather not use very large window sizes in order to keep the simulations consistent with some empirical results based on ~ 5kb window size.

What are my options - is there a way to force SLiM to run with overlapping recombination tracts?

bhaller · 2023-05-09T12:58:57Z

bhaller
May 9, 2023
Maintainer

This seems surprising, and may be a bug in SLiM's code. Your parameters – recombination rate 7.34e-6, nonCrossoverFraction 0.8515, and meanLength 518 – seem like they would be unlikely to cause a collision in the generated gene conversion tracts very often at all – maybe sometimes it would have to try 2 or even 3 times, but it says it is trying 100 times and failing. That doesn't make sense to me. What chromosome length are you modeling? Can you post your full model code here, ideally with all extraneous detail pruned away so that it is more or less as simple as possible while still causing this error? Also, what version of SLiM are you using?

0 replies

mshpak76 · 2023-05-09T14:58:11Z

mshpak76
May 9, 2023
Author

Equally surprising is the fact that the problem (largely) goes away with a larger window length, i.e. if I use initializeGenomicElement(g1, 0, 9999) rather than initializeGenomicElement(g1,0,4999) as I have below. Window size shouldn't matter very much in terms of the probability of overlapping conversion tracts, because hile increasing it does lead to conversion events being spaced farther apart on average, it also introduces more of them.

In any case, to answer your questions, I'm running slim version 3.5, and here is an example of the code that throws the overlapping conversion tract errors for me:

initialize() {
initializeTreeSeq();
initializeMutationRate(0);
initializeMutationType("m2", 1.0, "f", 0.0);
initializeGenomicElementType("g1", m2, 1.0);
initializeGenomicElement(g1, 0, 4999);
initializeRecombinationRate(7.34e-6);
initializeGeneConversion(0.851, 518, 1.0);
}

1 early() {defineConstant("simID", getSeed());}
1 early() { sim.addSubpop("p1", 2e+4); }

2000 late()
{
//sim.deregisterScriptBlock(self);
sim.treeSeqOutput("TestConversion.trees");
sim.simulationFinished();
}

0 replies

bhaller · 2023-05-09T16:02:57Z

bhaller
May 9, 2023
Maintainer

OK, here's what's going on. It's not common, but sometimes you get two gene conversion tracts even though the recombination rate is 7.34e-6 and the genome length is 5000 base positions. When you do, SLiM draws lengths for the left and right extent of each of the gene conversion tracts from a geometric distribution with mean 518 (see section 1.5.6 if this is less than completely clear to you). And sometimes those draws just happen to be unusually long. The expected length of each tract is 518x2 = 1036 base positions, so the expected length of two tracts together is 1036x2 = 2072, but for the example I'm looking at in the debugger right now, SLiM drew extents of (1848, 1631) for the first tract and (1412, 97) for the second tract, for a total length of 3479 for the first tract and 1509 for the second tract. This totals to 4988, so it could fit both tracts into the genome, but it would have to draw just the right positions for both; in 100 tries, it fails to do so. In any case, with even slightly worse luck we'd draw lengths that total to more than 5000, and then we have no chance whatsoever of satisfying the constraints. In fact, with sufficiently bad luck we could draw even a single GC tract that is longer than 5000 base positions, so really we don't even need to get two GC tracts for this to happen, it could happen with one.

So. SLiM draws a position for each GC tract, checks whether they are fully inside the genome and whether they overlap, and if there is a problem, it loops back to try again, and if it fails 100 times in a row, it throws the error you're seeing. But here's the key fact: when it loops back, it draws new candidate positions for the two tracts, but it does not redraw their lengths. This is necessary, in order to satisfy the constraints requested by the user. If we draw new lengths whenever we fail to come up with good positions, that will bias the length distribution towards shorter lengths (because longer lengths will tend to have positioning problems more often); so then we will no longer supply GC tracts with a mean length of 518 (per side) or 1036 (total), as requested.

So it's not a bug; it's functioning as designed. In a sense, the problem is simply that you have requested a set of constraints that have a pretty high probability, over the course of a run of this model, of not being satisfied. This is, I think, due to your rescaling. Basically, with your rescaling factor of 100 you're compacting 100 generations worth of gene conversion into a single generation, and have thereby made GC tract collisions much, much more likely to occur. The locus you're simulating (5000 bases) is also pretty short compared to your expected tract length (1036 bases), and so when you get more than one tract in a given gamete, and get unusually high length draws for those tracts, boom.

I can imagine that that answer is less than satisfying to you. :-> I'm not sure how I might change the behavior to fix the problem, though. I'm open to suggestions. I could imagine things like:

Increase the number of tries from 100 to 1000, say; but I think your model will still fail, since it's not that unlikely to draw lengths that are simply impossible to place;
Add a flag to initializeGeneConversion() that says "if you fail 100 times, redraw lengths as well as positions"; this would bias the length distribution downward, as discussed above, so the user would need to understand that and accept the consequences, and it's a rather ugly solution since it's basically a flag saying "violate the constraints I've just given you if they cause a problem" – ugh;
Add some way of limiting the number of GC tracts that are allowed to be drawn – in your model, presumably to 1 – for a given gamete. This would mean that your nonCrossoverFraction of 0.8515 would no longer be satisfied; it would get biased downwards, again, in order for the constraints specified to be satisfied. So, similarly, ugh.

I think the ideal solution is simply for you to choose constraints on the model that are less likely to lead to a violation and an error – probably rescaling less. But I'm open to other ideas. Tagging @philippmesser and @petrelharp in case they have any input on this.

10 replies

bhaller May 9, 2023
Maintainer

Yes, it would appear so. Interesting! :-> Well, let me know if you end up needing me to do something about it.

mshpak76 May 9, 2023
Author

I don't think there's any modification of the code that would avoid the biases we're discussing.
However, I do have a question about the input argument meanLength for initializeGeneConversion(). If I use meanLength = 518 as an input argument, slim will actually generate (on average) a tract length of 518 by taking two pieces with expected length of 259, correct? One of your replies to my query seemed to contradict this and implied that an input argument of 518 corresponds to an expected tract length of 1036 (see below). However, the documentation for initializeGeneConversion() suggests that an input of meanLength = 518 will give me an expected tract length of 518 based on two section of meanLength/2. This is a little confusing, could you please clarify which interpretation is correct?

In other words, does

initializeGeneConversion(0.851, 518, 1.0);

give me an expected tract length of 518 or of 1036?

From your first post on this thread:
And sometimes those draws just happen to be unusually long. The expected length of each tract is 518x2 = 1036 base positions, so the expected length of two tracts together is 1036x2 = 2072

petrelharp May 9, 2023
Collaborator

Ah ha, good sleuthing. Let's see:

This is necessary, in order to satisfy the constraints requested by the user. If we draw new lengths whenever we fail to come up with good positions, that will bias the length distribution towards shorter lengths (because longer lengths will tend to have positioning problems more often); so then we will no longer supply GC tracts with a mean length of 518 (per side) or 1036 (total), as requested.

I agree with this in principle, but I actually think it would be Just Fine to always redraw lengths (and preferable to adding another confusing argument). This is because (a) in many realistic cases (no rescaling, long genome) it should not have any measurable effect; (b) we have such a very vague idea of typical gene conversion lengths anyhow (I know from trying to track down estimates for stdpopsim); (c) it might arguably be more realistic, if maybe GC events that go off the end of a chromosome result in inviable gametes.

Another option might be if the GC tract goes off the end of the chromosome, just truncate it at the end?

bhaller May 9, 2023
Maintainer

...If I use meanLength = 518 as an input argument, slim will actually generate (on average) a tract length of 518 by taking two pieces with expected length of 259, correct?

Ah, yes, this is correct, sorry. The expected total tract length is 518, as stated in the documentation; I was misremembering exactly how this was parameterized.

bhaller May 9, 2023
Maintainer

Ah ha, good sleuthing. Let's see:

This is necessary, in order to satisfy the constraints requested by the user. If we draw new lengths whenever we fail to come up with good positions, that will bias the length distribution towards shorter lengths (because longer lengths will tend to have positioning problems more often); so then we will no longer supply GC tracts with a mean length of 518 (per side) or 1036 (total), as requested.

I agree with this in principle, but I actually think it would be Just Fine to always redraw lengths (and preferable to adding another confusing argument). This is because (a) in many realistic cases (no rescaling, long genome) it should not have any measurable effect; (b) we have such a very vague idea of typical gene conversion lengths anyhow (I know from trying to track down estimates for stdpopsim); (c) it might arguably be more realistic, if maybe GC events that go off the end of a chromosome result in inviable gametes.

Hmm. OK, I'm open to this, I think? @mshpak76 what do you think? I think @petrelharp is right, of course, that for no rescaling and a long genome it would have no measurable effect; for your model, I don't know how large the introduced bias might be. It would be a change in behavior, but I think it's justified to prevent models from erroring. (Right now, with a freakishly large tract length draw, any GC model could error, in principle, although the probability is extremely small.)

Another option might be if the GC tract goes off the end of the chromosome, just truncate it at the end?

I don't think I like that so much; I think that might indeed cause noticeable bias, even for typical models. And it's not clear how you would solve the problem that's happening here, of being unable to draw two tracts that are non-overlapping; you have to make some arbitrary decision (truncate both tracts at the mean of their overlap range, or something). Better to just re-draw.

@mshpak76 if you want me to switch the implementation to redrawing the lengths on failure to lay out, I'm OK with that; Peter has convinced me it's a largely harmless change that would resolve the issue.

mshpak76 · 2023-05-09T19:21:14Z

mshpak76
May 9, 2023
Author

I'm thinking that this would be a useful feature if introduced as an optional input argument, so that users can decide whether they want to use it or to just keep sampling until they find conversion tract boundaries that work (as would be the case in most runs). Thank you, Max

…

On Tue, May 9, 2023 at 1:48 PM Ben Haller ***@***.***> wrote: Ah ha, good sleuthing. Let's see: This is necessary, in order to satisfy the constraints requested by the user. If we draw new lengths whenever we fail to come up with good positions, that will bias the length distribution towards shorter lengths (because longer lengths will tend to have positioning problems more often); so then we will no longer supply GC tracts with a mean length of 518 (per side) or 1036 (total), as requested. I agree with this in principle, but I actually think it would be Just Fine to always redraw lengths (and preferable to adding another confusing argument). This is because (a) in many realistic cases (no rescaling, long genome) it should not have any measurable effect; (b) we have such a very vague idea of typical gene conversion lengths anyhow (I know from trying to track down estimates for stdpopsim); (c) it might arguably be more realistic, if maybe GC events that go off the end of a chromosome result in inviable gametes. Hmm. OK, I'm open to this, I think? @mshpak76 <https://github.com/mshpak76> what do you think? I think @petrelharp <https://github.com/petrelharp> is right, of course, that for no rescaling and a long genome it would have no measurable effect; for *your* model, I don't know how large the introduced bias might be. It would be a change in behavior, but I think it's justified to prevent models from erroring. (Right now, with a freakishly large tract length draw, *any* GC model could error, in principle, although the probability is extremely small.) Another option might be if the GC tract goes off the end of the chromosome, just truncate it at the end? I don't think I like that so much; I think that might indeed cause noticeable bias, even for typical models. And it's not clear how you would solve the problem that's happening here, of being unable to draw two tracts that are non-overlapping; you have to make some arbitrary decision (truncate both tracts at the mean of their overlap range, or something). Better to just re-draw. @mshpak76 <https://github.com/mshpak76> if you want me to switch the implementation to redrawing the lengths on failure to lay out, I'm OK with that; Peter has convinced me it's a largely harmless change that would resolve the issue. — Reply to this email directly, view it on GitHub <#373 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUZMZCNLT2VCQ6PLVY7OJH3XFKGQFANCNFSM6AAAAAAX2DX35Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- ======================= Max Shpak, Ph.D. Department of Genetics University of Wisconsin Madison, WI 53706

7 replies

bhaller May 9, 2023
Maintainer

Might work, yes, at least if you throw away the runs that still errored out. I'm optimistic that I might get to this feature request tomorrow, though. Up to you. :->

petrelharp May 10, 2023
Collaborator

My vote is actually to not complicate things, so have no argument and just change the behavior.

And, I agree that this would be a fine quick fix.

bhaller May 10, 2023
Maintainer

My vote is actually to not complicate things, so have no argument and just change the behavior.

And, I agree that this would be a fine quick fix.

The ship has sailed, the argument was added. It has a default of F, providing the old/conservative behavior, so that avoids breaking backward reproducibility for people's existing models, which is nice. It does complicate things slightly, but nobody will notice it unless they're actually reading the doc for initializeGeneConversion(), so I think it's ok. :->

petrelharp May 10, 2023
Collaborator

okay! I'd be less bothered if the docs in the GUI were easier to parse, but that's another issue...

bhaller May 10, 2023
Maintainer

okay! I'd be less bothered if the docs in the GUI were easier to parse, but that's another issue...

Easier to parse in what sense? It is indeed another issue – have you filed an issue? :->

bhaller · 2023-05-10T18:00:01Z

bhaller
May 10, 2023
Maintainer

OK, the GitHub head now has a new optional parameter to initializeGeneConversion(), [l$ redrawLengthsOnFailure=F]. If you set that to T, it redraws lengths as well as positions on a GC tract layout failure. Your example model above now runs to completion without errors, at least the couple of times I tried it. Of course it is still possible to come up with a model with such stringent parameters for GC that tract layout still fails; at that point, different parameters would need to be chosen. :->

So, closing this discussion now. Thanks for the report, @mshpak76; I think this is a useful, if rather "edge case", addition. Please let me know if you encounter any further difficulties.

0 replies

mshpak76 · 2023-05-10T18:25:04Z

mshpak76
May 10, 2023
Author

Thank you for adding this option, I will test the code with the redrawLengthsOnFailure later today or tomorrow .

…

On Wed, May 10, 2023 at 1:00 PM Ben Haller ***@***.***> wrote: OK, the GitHub head now has a new optional parameter to initializeGeneConversion(), [l$ redrawLengthsOnFailure=F]. If you set that to T, it redraws lengths as well as positions on a GC tract layout failure. Your example model above now runs to completion without errors, at least the couple of times I tried it. Of course it is still possible to come up with a model with such stringent parameters for GC that tract layout still fails; at that point, different parameters would need to be chosen. :-> So, closing this discussion now. Thanks for the report, @mshpak76 <https://github.com/mshpak76>; I think this is a useful, if rather "edge case", addition. Please let me know if you encounter any further difficulties. — Reply to this email directly, view it on GitHub <#373 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUZMZCP2LGVFIVFRPCVUNFLXFPJSXANCNFSM6AAAAAAX2DX35Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- ======================= Max Shpak, Ph.D. Department of Genetics University of Wisconsin Madison, WI 53706

0 replies

mshpak76 · 2023-05-10T22:41:33Z

mshpak76
May 10, 2023
Author

Do I need to reinstall slim on all servers and workstations, or can the new version of the initializeGeneConversion() function be imported?

…

On Wed, May 10, 2023 at 3:02 PM Ben Haller ***@***.***> wrote: My vote is actually to *not* complicate things, so have no argument and just change the behavior. And, I agree that this would be a fine quick fix. The ship has sailed, the argument was added. It has a default of F, providing the old/conservative behavior, so that avoids breaking backward reproducibility for people's existing models, which is nice. It does complicate things slightly, but nobody will notice it unless they're actually reading the doc for initializeGeneConversion(), so I think it's ok. :-> — Reply to this email directly, view it on GitHub <#373 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUZMZCNK74SZPV34ETR4423XFPX53ANCNFSM6AAAAAAX2DX35Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- ======================= Max Shpak, Ph.D. Department of Genetics University of Wisconsin Madison, WI 53706

1 reply

bhaller May 10, 2023
Maintainer

Do I need to reinstall slim on all servers and workstations, or can the new version of the initializeGeneConversion() function be imported?

I'm not sure what you mean by "imported". You will want to build SLiM from the current GitHub head, on any machine where you want this new functionality.

mshpak76 · 2023-05-11T15:16:36Z

mshpak76
May 11, 2023
Author

Just an additional observation: I'm not sure if I understand the reasoning behind not allowing a conversion tract to span regions outside of the simulated genomic window. In practice, if we partition a chromosome into e.g. 5 kb windows, there will often be conversion tracts at edges spanning more than a single window. Conversely, the requirement that conversion tracts are contained in a window creates an artifact of increasing the density of conversion tracts in the center of a window and decreasing them at the periphery, especially for short windows. What problems would relaxing this "edge effect" requirement cause? Thanks, Max

…

On Wed, May 10, 2023 at 5:45 PM Ben Haller ***@***.***> wrote: Do I need to reinstall slim on all servers and workstations, or can the new version of the initializeGeneConversion() function be imported? I'm not sure what you mean by "imported". You will want to build SLiM from the current GitHub head, on any machine where you want this new functionality. — Reply to this email directly, view it on GitHub <#373 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUZMZCKV6S2XDZZQ5VHRBBDXFQLB5ANCNFSM6AAAAAAX2DX35Y> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- ======================= Max Shpak, Ph.D. Department of Genetics University of Wisconsin Madison, WI 53706

1 reply

bhaller May 11, 2023
Maintainer

Hi @mshpak76. It's a good question. SLiM doesn't know that you're only modeling a window; it thinks you're modeling a whole chromosome of the given length. (Maybe there should be a way to communicate that semantic distinction to SLiM, but at present there is not.) I don't know whether it makes biological sense, mechanistically, for a GC tract to start/end at the start/end of an entire chromosome; maybe it does make sense, I don't know. On top of those concerns, there is also a comment in the existing code that starting a tract at the very start of the chromosome causes a problem with tree-sequence recording, at present; I'm not sure why. So... it's complicated. I absolutely see your point, but changing that behavior would be substantially more complex and would require significant thought.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recombination rate and/or tract length too high (error) #373

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 19 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

recombination rate and/or tract length too high (error) #373

mshpak76 May 8, 2023

Replies: 8 comments · 19 replies

bhaller May 9, 2023 Maintainer

mshpak76 May 9, 2023 Author

bhaller May 9, 2023 Maintainer

bhaller May 9, 2023 Maintainer

mshpak76 May 9, 2023 Author

petrelharp May 9, 2023 Collaborator

bhaller May 9, 2023 Maintainer

bhaller May 9, 2023 Maintainer

mshpak76 May 9, 2023 Author

bhaller May 9, 2023 Maintainer

petrelharp May 10, 2023 Collaborator

bhaller May 10, 2023 Maintainer

petrelharp May 10, 2023 Collaborator

bhaller May 10, 2023 Maintainer

bhaller May 10, 2023 Maintainer

mshpak76 May 10, 2023 Author

mshpak76 May 10, 2023 Author

bhaller May 10, 2023 Maintainer

mshpak76 May 11, 2023 Author

bhaller May 11, 2023 Maintainer

mshpak76
May 8, 2023

Replies: 8 comments 19 replies

bhaller
May 9, 2023
Maintainer

mshpak76
May 9, 2023
Author

bhaller
May 9, 2023
Maintainer

bhaller May 9, 2023
Maintainer

mshpak76 May 9, 2023
Author

petrelharp May 9, 2023
Collaborator

bhaller May 9, 2023
Maintainer

bhaller May 9, 2023
Maintainer

mshpak76
May 9, 2023
Author

bhaller May 9, 2023
Maintainer

petrelharp May 10, 2023
Collaborator

bhaller May 10, 2023
Maintainer

petrelharp May 10, 2023
Collaborator

bhaller May 10, 2023
Maintainer

bhaller
May 10, 2023
Maintainer

mshpak76
May 10, 2023
Author

mshpak76
May 10, 2023
Author

bhaller May 10, 2023
Maintainer

mshpak76
May 11, 2023
Author

bhaller May 11, 2023
Maintainer