Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: simulating species tutorial narrative #1

Closed
jeromekelleher opened this issue Feb 24, 2022 · 14 comments
Closed

Suggestion: simulating species tutorial narrative #1

jeromekelleher opened this issue Feb 24, 2022 · 14 comments

Comments

@jeromekelleher
Copy link
Member

I think a good way of structuring this paper would be to couch it as a "So, you want to simulate your favourite species, eh? Here's what you need" narrative. The discussion about how we do things in stdpopsim and the lessons we've learned through the hackathon can be incorporated into this narrative, without making the whole thing about stdpopsim and how we maintain its catalog. We can have a section in there where we discuss the current contents of the catalog and the processes that we have for maintaining it. I think this would generally be quite a useful paper, as it seems to me that it's generally not well understood what the components of a realistic simulation model are.

Apologies if this is going over old ground - we can close this issue if there's already strong ideas on the narrative and I missed the memo!

@lauterbur
Copy link
Collaborator

I like this idea a lot - it targets more toward utility to (new) users than being about stdpopsim itself. That said, I think it would also require more attention to the "but what if my favorite species doesn't have a chromosome-level assembly?" issue we've been wrangling with, that I've currently addressed with hand-waving and references to stdpopsim being "community-driven, hint hint." That could bring us back around to your yaml idea (popsim-consortium/stdpopsim#1020) but I doubt the implementation timeline for that fits the preprint by ProbGen timeline. Thus the answer might be a disappointing "tough luck" or a somewhat less disappointing but still unsatisfying "stay tuned."

What do you think?

@jeromekelleher
Copy link
Member Author

I think it's fine to say that doing simulations of species that don't have well characterised chromosomes (== good assemblies) is problematic. You just have to make some stuff up that seems plausible to you, ultimately 😄

Put it another way: I don't think we need to apologise for the fact that not every species has a high quality assembly. We provide some help and automation for those that do (stdpopsim) but here's some general guidance on what you can do if that's not possible (do you really need to simulate a whole chromosome? does it actually matter for the question you're asking, or can just some unspecified length of genome do the job just as well?)

@lauterbur
Copy link
Collaborator

Yes, I see how that could work, and might not actually take much restructuring from what we have now. I can give that a go later today or tomorrow unless there are other opinions.

@gregorgorjanc @izabelcavassim since you've already weighed in on Slack, thoughts?

@gregorgorjanc
Copy link
Contributor

Sounds good to me @lauterbur ! I think you have a very good start, but taking @jeromekelleher take into account is sensible

@petrelharp
Copy link
Contributor

All this sounds good. I think the narrative around good assemblies is that if there isn't a good assembly then you're either (a) just simulating a bajillion little chunks of genome, or (b) simulating an anonymous chromosome, not specific to your speices. Both are fine and useful. In considering what you miss out by doing these things it's helpful to think about what we gain by doing whole-chromosome simulations. One thing we gain is the full effects of linked selection - linked sweeps outside the region would reduce diversity and cause more unpredictability. For another example, suppose we simulate with an unrealistically high genome-wide rate of delterious mutations; in a whole-genome simulation we could see this leads to biologically unrealistic genetic load, but using many small chunks we might not. Besides realism, by using real genetic maps we possibly gain information, because we can match positions on the real genome to positions on the simulated genome, thus providing a matched comparison of recombination rate and gene density.

And, what if you don't have a whole-genome assembly? Well, by drawing the recombination rate and/or gene density in your little chunks of genome from some distribution you can mimic the distribution of these things across the real chromosome; that might be important for providing good training data for ML models, for instance. Or you could make up a fake chromosome with a fake genetic map in a way that looks real-ish; if you do a good job then the main thing you're missing out on is the ability to match specific parts of the real genome to specific parts of the simulated genome. But, it seems almost as good - it just requires more effort in making things up.

So - what about stdpopsim? We could let stdpopsim simulate the bajillion contigs, but (a) we know that's not right because it's missing out on the wider linked selection (although it may be useful), and more crucially (b) it wouldn't be very useful! There wouldn't be much difference at all between a stdpopsim script and a script that does a bunch of msprime simulations with chromosome lengths drawn from some list. stdpopsim is useful because it compresses the substantial work of putting a lot of layers of complex information into a simulation; but if a species doesn't have those many layers of information, it's not that useful, and we might as well just write more vigenettes for msprime or something.

Readers might wonder "what about species with a bajillion contigs but they're annotated"; then it'd be a pain to simulate the specific locations of the genes on those bajillion contigs. I think the answer there is that you probably shouldn't be worrying about that level of detail. If you haven't got genes located on a whole genome, then the precise location of the genes on unliked contigs is false precision, and you're better off comparing to simulations of anonymous genomes.

Ok, that's my thoughts there. In all this I think it might help to hang things on a few concrete examples of what one wants to do simulations for.

@jeromekelleher
Copy link
Member Author

This is great @petrelharp - I think getting some guidance like the first two paragraphs into the paper is much more useful to everyone than worrying about how we get species with cruddy assemblies into stdpopsim.

@lauterbur
Copy link
Collaborator

lauterbur commented Feb 24, 2022

Thanks, @petrelharp, this is great. This is along the lines of what I was imagining in response to @jeromekelleher 's comments, just a lot better than I would have come up with! (I think you might have just volunteered yourself for that section ;-) I'll recombobulate the outline this afternoon, have some meetings to run to now.)

@andrewkern
Copy link
Member

i'd argue that there is a bit of a middle ground here between "good genomes" and a bajillion contigs,
and that in the coming years we will see an explosion in these sorts of assemblies. For instance the
Darwin Tree of Life Project is going to be handing down thousands of genome assemblies and annotations.
I imagine those will be served up in a systematic way that we at stdpopsim could in all likelihood easily digest.

Indeed part of the appeal of stdpopsim is that if your species of interest does have an assembly that is
hosted by e.g. ensembl, then we have the tools to semi-automatically include it, and this has ramifications
for the data dump that is coming.

As an aside, i think a good bit of structure for the narrative could come straight from the workshop materials here.

@petrelharp
Copy link
Contributor

What's the middle ground assembly look like, concretely?

@andrewkern
Copy link
Member

maybe on the worse side, but e.g. ensembl serves up the present assembly of Octopus bimaculoides: https://www.ebi.ac.uk/ena/browser/view/GCA_001194135.1?show=blobtoolkit

1.51 x 10^5 contigs...

@lauterbur
Copy link
Collaborator

Reopening this issue because of this morning's conversation about the main narrative and threading the needle between generality (guide to popgen simulations) and stdpopsim specificity.

Selected entries from this morning's notes:
Andy: Generally, have to be careful about how we pitch this paper. Tried to set tone via title. His view is paper showcases resource and provides cookbook. Further we get away from stdpopsim, the thinner the ice. A reviewer might not be impressed if they view it as just guidelines for simulation.
Peter: I thought about it as lessons from stdpopsim, but don't want takeway to be "everybody add to stdpopsim".
Ilan: Lessons learned from stdpopsim as main theme. Get stdpopsim in there, but be broader. Summarize lessons learned, and tie recipe to stdpopsim. Emphasize requirements to be in stdpopsim, so we don't get an explosion of weak entries in catalog.
https://popsimgroup.slack.com/archives/CARMCLPS6/p1651593855025819

@lauterbur lauterbur reopened this May 3, 2022
@igronau
Copy link
Contributor

igronau commented May 6, 2022

I would like to make some edits to the manuscript to make sure the text echoes the messages mentioned above. I am suggesting the following section structure, and I prepared a list of edit suggestions as tasks below. Feel free to chime in and provide input before I move things around (@lauterbur @petrelharp and others). I plan to start with the tasks in sections 2-4.

Section 1: Introduction
The structure of the current intro is very good, but there are a few things that I think will improve the flow:

  • Mention simulations early on. This is the main topic here and should be introduced as early as possible. I'll see if I can mention simulations already in the first paragraph.
  • Mention stdpopsim early on. Again, for a central concept to the paper, we probably want this to be mentioned in the 2nd paragraph of the introduction. I'll try to mention this before the detailed discussion about complex considerations in simulations.
  • Add a bit more detail about our active attempts to expand since the initial publication in 2020: workshops + hackathon. Currently, this is only mentioned in passing. We can probably move details from the later sections.
  • End with the final paragraph, which states the objective of this paper:
    1. describing our expansion efforts of stdpopsim
    2. providing a list of guidelines for generating "good" population genomic simulations.
  • Mention that these guidelines can be used when adding a species to stdpopsim, but can also be used in stand-alone simulations.
  • Note: I'm not really sure about the paragraph that's currently before last. It contains long and fairly vague statements and I think we can probably merge this somehow with the last paragraph.

Section 2: The utility of standardized genome-wide simulations
The main purpose of this section is to set the stage by describing what is involved in the process of defining a population genomic simulation, the types of decisions people have to make (high level), common pitfalls, and the benefits of the standardized approach we promote in stdpopsim.

  • Start by introducing the main set of parameters required for simulations. Move the high-level description from the "Implementing a population genomic simulation". I think that this has to be introduced early on to give the appropriate context. This should be covered in one paragraph or a short itemized list without the low-level details described in "Implementing a population genomic simulation". These will be given in Section 4 below.
  • Discuss typical issues with "ad-hoc" simulations people tend to do in their studies. This is mostly based on "On the utility of working together". Cover these issues:
    1. Due to complication of the simulation process, people typically choose to simulate the simplest scenario that addresses their question of interest.
    2. Simulations from different studies are difficult to compare even when they are based on the same sources because of specific (arbitrary) choices that different people make.
    3. Simulations are error-prone
  • Discuss the utility of chromosome-level simulations. Due to the need to simplify simulations, most studies rely on simulations of short sequences, but we would like to promote simulations of whole-chromosome data. This part will be based on "On the utility of simulating the whole genome". The current version looks a bit too long. I'll try to summarize all the main points in 2-3 compelling paragraphs.

Section 3: The updated stdpopsim catalog

  • This section will describe our expansion efforts since the 2020 publication, including workshops and hackathon. It'll describe the species that were added, etc. Most of this is currently covered by the first two paragraphs in the Application section and a bit in the "What about species lacking chromosome-level assemblies?". I think that it'll be nice to present this before the simulation guidelines we provide in the next section, to give context as to what made us reach these guidelines.

Section 4: Guidelines for implementing a population genomic simulation
This is pretty much what's currently under "Implementing a population genomic simulation" with some adjustments due to the other shifts I suggested above, especially in Section 1.

  • Make sure to strike a good balance between providing general guidelines and promoting expansion of stdpopsim.

Section 5: examples from species added to stdpopsim
I think that the two examples (cow and mosquito) are best described in the same context, despite their different nature and purpose. Alternatively, we can move the mosquito example to a more technical appendix with step-by-step instructions on adding species.

  • First sub-section describes the considerations made for the cow species. We use this species example to demonstrate some of the decisions made in the process. We don't dive into the technical steps of adding a species to the stdpopsim repository
  • Second sub-section describes addition of the mosquito. Here, the model is quite simple, and we emphasize the technical steps involved in adding the species to the catalog.

Section 6: Conclusion and discussion

  • Will revisit this section once the rest is in place.

@petrelharp
Copy link
Contributor

This sounds good to me - I'd add one thing (or maybe it naturally fits into one of the poitns of the Intro): something about the goals of stdpopsim, e.g., writign out what we'd like to enable and why, leading into "therefore, we wanted to add a bunch more species...".

@igronau
Copy link
Contributor

igronau commented May 7, 2022

Good point. I think that this will fit naturally as the connecting arguments between the 2nd and 3rd points in the introduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants