-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: simulating species tutorial narrative #1
Comments
I like this idea a lot - it targets more toward utility to (new) users than being about stdpopsim itself. That said, I think it would also require more attention to the "but what if my favorite species doesn't have a chromosome-level assembly?" issue we've been wrangling with, that I've currently addressed with hand-waving and references to stdpopsim being "community-driven, hint hint." That could bring us back around to your yaml idea (popsim-consortium/stdpopsim#1020) but I doubt the implementation timeline for that fits the preprint by ProbGen timeline. Thus the answer might be a disappointing "tough luck" or a somewhat less disappointing but still unsatisfying "stay tuned." What do you think? |
I think it's fine to say that doing simulations of species that don't have well characterised chromosomes (== good assemblies) is problematic. You just have to make some stuff up that seems plausible to you, ultimately 😄 Put it another way: I don't think we need to apologise for the fact that not every species has a high quality assembly. We provide some help and automation for those that do (stdpopsim) but here's some general guidance on what you can do if that's not possible (do you really need to simulate a whole chromosome? does it actually matter for the question you're asking, or can just some unspecified length of genome do the job just as well?) |
Yes, I see how that could work, and might not actually take much restructuring from what we have now. I can give that a go later today or tomorrow unless there are other opinions. @gregorgorjanc @izabelcavassim since you've already weighed in on Slack, thoughts? |
Sounds good to me @lauterbur ! I think you have a very good start, but taking @jeromekelleher take into account is sensible |
All this sounds good. I think the narrative around good assemblies is that if there isn't a good assembly then you're either (a) just simulating a bajillion little chunks of genome, or (b) simulating an anonymous chromosome, not specific to your speices. Both are fine and useful. In considering what you miss out by doing these things it's helpful to think about what we gain by doing whole-chromosome simulations. One thing we gain is the full effects of linked selection - linked sweeps outside the region would reduce diversity and cause more unpredictability. For another example, suppose we simulate with an unrealistically high genome-wide rate of delterious mutations; in a whole-genome simulation we could see this leads to biologically unrealistic genetic load, but using many small chunks we might not. Besides realism, by using real genetic maps we possibly gain information, because we can match positions on the real genome to positions on the simulated genome, thus providing a matched comparison of recombination rate and gene density. And, what if you don't have a whole-genome assembly? Well, by drawing the recombination rate and/or gene density in your little chunks of genome from some distribution you can mimic the distribution of these things across the real chromosome; that might be important for providing good training data for ML models, for instance. Or you could make up a fake chromosome with a fake genetic map in a way that looks real-ish; if you do a good job then the main thing you're missing out on is the ability to match specific parts of the real genome to specific parts of the simulated genome. But, it seems almost as good - it just requires more effort in making things up. So - what about stdpopsim? We could let stdpopsim simulate the bajillion contigs, but (a) we know that's not right because it's missing out on the wider linked selection (although it may be useful), and more crucially (b) it wouldn't be very useful! There wouldn't be much difference at all between a stdpopsim script and a script that does a bunch of msprime simulations with chromosome lengths drawn from some list. stdpopsim is useful because it compresses the substantial work of putting a lot of layers of complex information into a simulation; but if a species doesn't have those many layers of information, it's not that useful, and we might as well just write more vigenettes for msprime or something. Readers might wonder "what about species with a bajillion contigs but they're annotated"; then it'd be a pain to simulate the specific locations of the genes on those bajillion contigs. I think the answer there is that you probably shouldn't be worrying about that level of detail. If you haven't got genes located on a whole genome, then the precise location of the genes on unliked contigs is false precision, and you're better off comparing to simulations of anonymous genomes. Ok, that's my thoughts there. In all this I think it might help to hang things on a few concrete examples of what one wants to do simulations for. |
This is great @petrelharp - I think getting some guidance like the first two paragraphs into the paper is much more useful to everyone than worrying about how we get species with cruddy assemblies into stdpopsim. |
Thanks, @petrelharp, this is great. This is along the lines of what I was imagining in response to @jeromekelleher 's comments, just a lot better than I would have come up with! (I think you might have just volunteered yourself for that section ;-) I'll recombobulate the outline this afternoon, have some meetings to run to now.) |
i'd argue that there is a bit of a middle ground here between "good genomes" and a bajillion contigs, Indeed part of the appeal of stdpopsim is that if your species of interest does have an assembly that is As an aside, i think a good bit of structure for the narrative could come straight from the workshop materials here. |
What's the middle ground assembly look like, concretely? |
maybe on the worse side, but e.g. ensembl serves up the present assembly of Octopus bimaculoides: https://www.ebi.ac.uk/ena/browser/view/GCA_001194135.1?show=blobtoolkit 1.51 x 10^5 contigs... |
Reopening this issue because of this morning's conversation about the main narrative and threading the needle between generality (guide to popgen simulations) and stdpopsim specificity. Selected entries from this morning's notes: |
I would like to make some edits to the manuscript to make sure the text echoes the messages mentioned above. I am suggesting the following section structure, and I prepared a list of edit suggestions as tasks below. Feel free to chime in and provide input before I move things around (@lauterbur @petrelharp and others). I plan to start with the tasks in sections 2-4. Section 1: Introduction
Section 2: The utility of standardized genome-wide simulations
Section 3: The updated stdpopsim catalog
Section 4: Guidelines for implementing a population genomic simulation
Section 5: examples from species added to stdpopsim
Section 6: Conclusion and discussion
|
This sounds good to me - I'd add one thing (or maybe it naturally fits into one of the poitns of the Intro): something about the goals of stdpopsim, e.g., writign out what we'd like to enable and why, leading into "therefore, we wanted to add a bunch more species...". |
Good point. I think that this will fit naturally as the connecting arguments between the 2nd and 3rd points in the introduction. |
I think a good way of structuring this paper would be to couch it as a "So, you want to simulate your favourite species, eh? Here's what you need" narrative. The discussion about how we do things in stdpopsim and the lessons we've learned through the hackathon can be incorporated into this narrative, without making the whole thing about stdpopsim and how we maintain its catalog. We can have a section in there where we discuss the current contents of the catalog and the processes that we have for maintaining it. I think this would generally be quite a useful paper, as it seems to me that it's generally not well understood what the components of a realistic simulation model are.
Apologies if this is going over old ground - we can close this issue if there's already strong ideas on the narrative and I missed the memo!
The text was updated successfully, but these errors were encountered: