You are now in the regens-analysis repository. Click here if you want to go back to regens.
Regens repeats the following process for each chromosome. Each chromosome that REGENS simulates begins as a set of SNPs without genotypes, which is demarcated into segments by breakpoints. The user selects the number of breakpoints per chromosome as one of REGENS' input arguments, and then that many breakpoint positions are drawn from the empirical distribution of recombination event positions. This empirical distribution is computed via equation 2 in the REGENS manuscript, where P(Ri = 1) is computed for the ith recombination interval by feeding its recombination rate (computed from the recombination map) into haldane's map function. Once an empty chromosome is segmented by breakpoints, the row indices of whole genome bed file rows from a real dataset are duplicated so that 1) there is one real individual for each empty segment and 2) every real individual is selected an equal number of times (minus 1 for each remainder sample if the number of segments is not divisible by the number of individuals). Then, for each empty segment, a whole chromosome is randomly selected without replacement from the set of autosomal genotypes that correspond to the duplicated indices, and the empty simulated segment is filled with the the homologous segment from the sampled real chromosome. These steps are repeated for every empty simulated segment in every chromosome so that all of the empty simulated genomes are filled with real SNP values. This quasirandom selection of individuals minimizes maf variation between the simulated and real datasets and also maintains normal population level genetic variability by randomizing segment selection.
about the recombination maps (input that we provided) 🦃
REGENS converts output recombination rate maps from pyrho (which correspond to the twenty-six 1000 Genome populations on a one to one basis) into probabilities of drawing each simulated breakpoint at a specific genomic location. It is also possible to simulate GWAS data from a custom plink (bed, bim, bam) fileset or a custom recombination rate map (or both files can be custom). Note that recombination rate maps between populations within a superpopulation (i.e. british and italian) have pearson correlation coefficients of roughly 0.9 (see figure 2B of the pyrho paper), so if a genotype dataset has no recombination rate map for the exact population, then map for a closely relatrf population should suffice.
REGENS can easily simulate GWAS data from any of the 26 populations in the 1000 genomes project, and a filtered subset of these subpopulations' genotype data is provided in the github in corresponding plink filesets. In summary, I kept a random subset of 500000 quality control filtered, biallelic SNPs such that every subpopulation contains at least two instances of the minor allele. Exact thinning methods are in the supplementary analysis.
REGENS simulated simulated 20000 samples with 500000 SNPs per sample ten times. Triadsim simulated 10000 trios with 500000 SNPs per individual ten times. A perfect comparison is not possible because simulating 10000 trios simulates 30000 individuals but only simulates 20000 unrelated individuals (assuming each kin's mother and father are not related). REGENS benefits from this comparison by having to read and write only two thirds as many samples, while Triadsim benefits because they only have to draw half as many breakpoints. To clarify the latter, each of Triadsim's breakpoints is applied to a trio, of which 10000 were simulated. On the other hand, each of REGENS' breakpoints is applied to an individual, of which 20000 were simulated. Since this at least roughy ammounts to a wash, the fairest comparison was to compare each algorithm's ability to simulate the same number of unrelated individuals because relatives are generally removed from real GWAS data. A bootstrap confidence interval was computed for the ratio of Triadsims mean runtime to REGENS' mean runtime, and another one was computed for the ratio of Triadsim's max RAM usage to REGENS' max ram usage. All replicate runs for both algorithms were run on an Intel(R) Xeon(R) CPU E5-2690 v4 2.60GHz processor. Instructions for how to rerun those tests are here.