-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Here is a list of (what we believe might be) frequently asked questions. If we are wrong and your questions remain unanswered, please open an issue.
-
Where did you get the whole-genome multiple sequence alignments from to compute tracks?
For the tracks that we computed we took the multiple sequence alignments available on the UCSC FTP server (search for "multiple alignments").
-
Can I use such an alignment to score other species (not just the reference species)?
PhyloCSF++ will always expect the first sequence of an alignment to be the reference sequence. In theory, you could use a multiple sequence alignment and choose a different reference sequence. However, many multiple sequence alignments are reference-guided, hence for other sequences you will have a significant lower coverage.
-
Can I use compressed alignment files?
PhyloCSF++ uses memory mapping to process each file in parallel without having to load the entire alignment file into memory. Hence, it is not possible to use compressed input files. For the same reason you cannot use process substitution to redirect the output of a command as input (e.g.,
phylocsf++ ... <(gunzip aln.maf.gz)
). Please unzip your alignments first before you pass them to PhyloCSF++. -
Does it make a difference whether I pass multiple alignment files or a single file that contains all alignments?
For tracks the output will be exactly the same. For scoring alignments, the only difference is that a file with scores is written for every input file separately. There is however an important difference for both building tracks and scoring MSA: PhyloCSF++ parallelizes over the alignments in each file, not over all files. To get the best speed-up from parallelization, a file should have a significant amount of alignments in it. The worst case is if you have a lot of files with only a single alignment in each, then no parallelization will happen.
-
What is a model?
Computing PhyloCSF scores on an MSA requires a phylogenetic tree as well as codon frequencies and codon substitution rates for both coding and non-coding regions. These models have been pre-computed. We included them into PhyloCSF++: 58mammals, 29mammals, 100vertebrates, 49birds, 53birds, 21mosquitoes, 12flies, 20flies, 23flies, 26worms, 7yeast. Only species included in the model can be considered from your alignment. To see what species are included in a model, you can run:
phylocsf++ build-tracks --model-info 29mammals
-
New models have been uploaded to the original repository. Can I use them?
To make the tool easier to use, we included all available models into the program. Nonetheless you can also just specify a path to the model. You can also open an issue and we will include the new models in our program.
-
Do the species names in the alignment file have to match the species names in the model?
They either have to match the names in the model or one of their alternative names, e.g., you can use
Human
orhg38
. To see what (alternative) names you can use, run:phylocsf++ build-tracks --model-info 29mammals
If your alignment file uses other species identifiers, you can specify a mapping file:
phylocsf++ build-tracks --mapping my_species_names.tsv ...
The mapping file has to be tab-separated with the 1st column being the species name from the model, and the 2nd column the species name from your alignment file.
-
How can I compute models for my own set of species?
For training your own model, a phylogenetic tree with evolutionary distances, as well as codon frequencies and codon substitution rates for both coding and non-coding regions are required. At the moment neither PhyloCSF nor PhyloCSF++ have a tool publicly available to compute this model, but we are working on it and are planning to include it into PhyloCSF++ in the near future. Since models between PhyloCSF and PhyloCSF++ are compatible, you will also be able to use models computed with PhyloCSF++ with the original PhyloCSF software.
-
Why do I need the genome length and coding regions to smoothen the scores / compute posterior probabilities?
An HMM is used for this and needs it as prior training data.
-
How do I get the coding regions for smoothening / posterior probabilities?
PhyloCSF++ needs a tab-separated file in the following format:
chrom strand phase start-coord stop-coord
You can extract this file directly from a gene annotation file (gff/gtf) with the following command:
awk -F'\t' 'BEGIN { OFS="\t" } ($3 == "CDS") { print $1, $7, $8, $4, $5 }' genes.gff > CodingExons.txt
The chromosome names do not have to match with the chromosome names in the alignment file. This file is only needed to extract information such as length of CDS, proportion of coding regions in the genome, etc.
-
I get different scores than with the original PhyloCSF tool. Is this a bug?
Most parts of the software use randomization (the MLE and OMEGA strategy, as well as the smoothening of the tracks). This will sometimes lead to minor differences in the scores. If you get significant differences, please open an issue, provide us with the alignment and options that you ran PhyloCSF++ with, and we will investigate the cause for it.
-
What do I need to do with the track files after I have computed them?
If you want to load your tracks into a genome browser, you usually want to index them first, i.e., convert the
wig
files intobw
files. For this you can use wigToBigWig. wigToBigWig will ask for a chrom.sizes file, a file that contains the sequence length for each chromosome/sequence which you can also find on the UCSC FTP server. You can parallelize converting allwig
files with GNU parallel in the output directory:find . -name '*.wig' | parallel --will-cite -j10 'wigFile={}; wigToBigWig $wigFile hg38.chrom.sizes "${wigFile:0:-3}bw"'
-
How can I get scores for different ORF?
When you score a single alignment (instead of computing tracks), the original PhyloCSF software has an option to specify what reading frame has to be scored. Our tool PhyloCSF++ only scores the first reading frame on the forward strand. If you are interested in other reading frames, you need to reverse the alignment and/or remove the first one or two bases. We don't see a benefit in this feature and left it out to have a clean interface. Instead, we will make an extension of our tool available that will score entire transcripts and CDS features from a GFF/GTF file.
So far we hope everything is self-explanatory. 😃
-
TODO
TODO