From b38de2ae882855c959ce5520b2959b92b19d235d Mon Sep 17 00:00:00 2001 From: keblat <42971789+keblat@users.noreply.github.com> Date: Thu, 6 Sep 2018 12:59:57 +0100 Subject: [PATCH 1/4] Update paralog, spell correct --- docs/advice/target_capture_data_analysis.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/advice/target_capture_data_analysis.md b/docs/advice/target_capture_data_analysis.md index 1e9794f..df09b64 100644 --- a/docs/advice/target_capture_data_analysis.md +++ b/docs/advice/target_capture_data_analysis.md @@ -3,7 +3,7 @@ [Working environment, pre-requisites and best practices](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/target_capture_data_analysis.md#1-working-environment-pre-requisites-and-best-practices) [Checking data quality](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/target_capture_data_analysis.md#2-checking-data-quality) [Cleaning data](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/target_capture_data_analysis.md#3-cleaning-data) -[Assembly strategy and reference file](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/target_capture_data_analysis.md#4-chosing-a-strategy-to-assemble-the-data-and-formatting-the-reference-file) +[Assembly strategy and reference file](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/target_capture_data_analysis.md#4-choosing-a-strategy-to-assemble-the-data-and-formatting-the-reference-file) [Hybpiper tips](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/target_capture_data_analysis.md#5-tips-when-running-hybpiper) - [Name list](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/target_capture_data_analysis.md#name-list) - [PATH](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/target_capture_data_analysis.md#path) @@ -104,7 +104,7 @@ scp miseq_run/*.fastq user@server_address:raw_data/ Once you know how you should clean your data, you can use a tool such as [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) to do it. -Trimming tools perform various operations, depending on the tool and on the options you chose. +Trimming tools perform various operations, depending on the tool and on the options you choose. - They trim, i.e they cut, the end of the reads if they match adapter sequences (indicated by the user). - They trim the reads at the end and sometimes at the beginning, based on the quality of the base(s) or arbitrarily. - They remove completely some reads, based on their quality and/or their length. @@ -147,7 +147,7 @@ You need to understand the above command and adapt it to our needs/input names! THE ORDER OF THE OUTPUT FILES MATTERS! -## **4. Chosing a strategy to assemble the data, and formatting the reference file** +## **4. Choosing a strategy to assemble the data, and formatting the reference file** ### Assembly strategy There are multiple pipelines to analyze target capture sequencing data and to produce phylogenies from them. @@ -182,7 +182,7 @@ You can download a plastome in Genbank, in format .gb, open it in Geneious, use When you want to do it for multiple plastomes, or if you don't have access to Geneious, you can use/customize one of our [scripts](), to extract only the regions you want based on the annotations, and rename them as you wish. Depending on your input it will be relevant or not to use the "intronerate" option without modifications (see below). -If you chose to use blast, your reference file will have to provide amino-acid sequences, so it may not make sense to use blast for something else than coding sequences. +If you choose to use blast, your reference file will have to provide amino-acid sequences, so it may not make sense to use blast for something else than coding sequences. However, we found out empirically that blast allows to recover less reads, but longer genes, than with bwa, regardless if sequences were coding or not. @@ -349,7 +349,9 @@ done < namelist.txt Depending on the result you may want to discard a gene for a species or for all species. This can also be a first step to identify recent whole genome duplications, or ancient gene duplications. - +Useful command: +parallel "python paralog_retriever.py namelist.txt {} > {}.paralogs.fasta" ::: < gene_list.txt 2> paralog_summary.txt +produces summary table with the number of copies of each gene. gene_list.txt would be the list of target genes, one per line. ### General advice: From 9babd6c91274680d0b9f5d577ab03415d26c62fb Mon Sep 17 00:00:00 2001 From: keblat <42971789+keblat@users.noreply.github.com> Date: Thu, 6 Sep 2018 13:33:54 +0100 Subject: [PATCH 2/4] Added UPP --- docs/advice/from_sequences_to_phylogenies.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/advice/from_sequences_to_phylogenies.md b/docs/advice/from_sequences_to_phylogenies.md index 3385b9a..c27c890 100644 --- a/docs/advice/from_sequences_to_phylogenies.md +++ b/docs/advice/from_sequences_to_phylogenies.md @@ -9,6 +9,7 @@ [Visualize support](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/from_sequences_to_phylogenies.md#7-visualize-support-on-the-species-tree) [Dating divergence times](https://github.com/sidonieB/bioinfo-utils/blob/master/docs/advice/from_sequences_to_phylogenies.md#8-dating-divergence-times) +**FRAG**: Tips for handling highly fragmented DNA (e.g. from ancient herbarium specimens) ## **1. Multiple Sequence Alignments** There are many different programs to align sequences. @@ -42,12 +43,16 @@ There is also a [MPI version](https://mafft.cbrc.jp/alignment/software/mpi.html) ### PASTA See the documentation [here](https://github.com/smirarab/pasta). What follows needs testing and proofreading. Comments welcome! -PASTA aligns sequences following an iterative approach that uses a user-provided guide tree to generate an alignment, generates a new tree from this alignment, use this tree to improve the previous alignment, and does it again a user-defined number of times. +PASTA aligns sequences following an iterative approach that generates an initial alignment and tree, uses this tree to improve the previous alignment, and does it again a user-defined number of times (default == 3). Alternatively, the user can provide an initial tree. PASTA is able to deal with very large numbers of sequences to align because it splits the data in subsets of sequences, align the subsets, and then combine the alignments. PASTA works with different alignment software, including MAFFT. Using MAFFT through PASTA may give better results than using MAFFT alone, especially because PASTA tends to infer gaps instead of forcing alignment between very divergent sequences. Recent work by people in [T. Warnow's group](http://tandy.cs.illinois.edu/) suggest that a combination of PASTA and [BAli-Phy](http://www.bali-phy.org/) could [work well](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3101-8#Sec15) although more testing on real data is needed (pers. com. from Mike Nute, PhyloSynth symposium, Montpellier, France, August 2018). +### UPP +See the documentation [here](https://github.com/smirarab/sepp/blob/master/README.UPP.md). +UPP splits the data set into more fragmented and less fragmented sequences. It then produces a backbone alignment and Hidden Markov Models (HMM) from the less fragmented sequences and attempts to fit the more fragmented sequences into each HMM. The final alignment is then selected from the best supported HMM. It produces slightly more accurate alignments for fragmented DNA. +**FRAG**: Use UPP -M option for highly fragmented DNA. If the range of sequence lengths is highly variable or most of your fragments are much shorter than the target reference sequence, you may need to specify something like "95th percentile of all sequence lengths in data set" as input for this option. ### Concatenate alignments (if needed) From e5d83d106809363078d3f8fea517a69a611ff339 Mon Sep 17 00:00:00 2001 From: keblat <42971789+keblat@users.noreply.github.com> Date: Thu, 6 Sep 2018 15:24:29 +0100 Subject: [PATCH 3/4] Added optrimal (and others) --- docs/advice/from_sequences_to_phylogenies.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/docs/advice/from_sequences_to_phylogenies.md b/docs/advice/from_sequences_to_phylogenies.md index c27c890..094c4a4 100644 --- a/docs/advice/from_sequences_to_phylogenies.md +++ b/docs/advice/from_sequences_to_phylogenies.md @@ -75,9 +75,11 @@ For instance: ``` trimal -in concatenated.out -out concat_trimmalled.fas -automated -resoverlap 0.65 -seqoverlap 0.65 ``` - + Another suite of tools to perform similar tasks and many others is [phyutility](https://github.com/blackrim/phyutility). +**FRAG**: Trimming can sometimes result in loss of informativeness. It may be worthwhile to check that the trimming parameters did not actually make things worse. For example, the **optrimAl** script uses AMAS to explore the effect of different trimAl gap threshold values on the proportion of parsimony informative sites and amount of data loss. + ### Renaming sequences in all alignments We have scripts for that, just ask! @@ -101,7 +103,8 @@ If you have a big cluster, gene trees can be produced in parallel as a simple ar For concatenated alignments RaxML can also be run in MPI mode, or in HYBRID mode with parallelisation of “coarse grain” processes over nodes (e.g. building separate bootstrap trees) and “fine-grain” processes using multithreading of multiple processors on a single machine (e.g. working on a single tree). See [documentation](https://help.rc.ufl.edu/doc/RAxML). The trees to be used for species tree estimation with ASTRAL (see below) are the RAxML_bipartitions.* trees, NOT the RAxML_bipartitionsBranchLabels.* trees. - + +Another relatively fast and robust option is IQ-Tree http://www.iqtree.org/ ## **3. Spotting alignment problems by observing gene trees** @@ -142,6 +145,8 @@ sed 's/\[[^\[]*\]//g'SpeciesTree.tre > SpeciesTree2.tre sed "s/'//g" SpeciesTree2.tre > SpeciesTree3.tre ``` +**FRAG**: ASTRAL option -t 10 does a polytomy test and may help in assessing if a poorly supported branch could be due to insufficient data or reflect a true polytomy. + ## **5. Rooting trees** **WARNING!** If you use phyparts (see below) or more generally if you have to compare gene trees to your species tree, it is important that the same rooting method is applied to all trees. @@ -256,4 +261,4 @@ Look at [S. Mirarab](https://github.com/smirarab/ASTRAL/blob/master/astral-tutor ## 8. Dating divergence times -**DO NOT** use ASTRAL branch lengths (see S. Mirarab [github](https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md#branch-length-and-support) for explanations and for coming-soon approach to date phylogeneomic datasets) +**DO NOT** use ASTRAL branch lengths (see S. Mirarab [github](https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md#branch-length-and-support) for explanations and for coming-soon approach to date phylogenomic datasets) From eede34ee232539e1406e5baf78c6768dd70194fb Mon Sep 17 00:00:00 2001 From: keblat <42971789+keblat@users.noreply.github.com> Date: Thu, 6 Sep 2018 15:37:10 +0100 Subject: [PATCH 4/4] Added max_overlap --- docs/advice/target_capture_data_analysis.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/advice/target_capture_data_analysis.md b/docs/advice/target_capture_data_analysis.md index df09b64..85958dd 100644 --- a/docs/advice/target_capture_data_analysis.md +++ b/docs/advice/target_capture_data_analysis.md @@ -388,7 +388,13 @@ Replace dna by aa if the reference file contains amino-acids. The table seq_lengths.txt can then be visualized as a heatmap using Matt Johnson's script gene.recovery.heatmap.R, also available in the HybPiper directory. Just open the script in R and follow the instructions at the beginning of the script. - + +Alternative: the **max_overlap** R script estimates capture coverage of target sequences from the output of get_seq_lengths.py by calculating three statistics - +representedness = proportion of species/genes for which sequences were obtained +completeness = proportion of target sequence obtained for each species/gene +evenness = how evenly the sequence lengths are distributed across species/genes, adapted from a measure of species evenness (Pielou EC. 1966. The measurement of diversity in different types of biological collections. J Theor Biol 13: 131-144.) +It also generates a csv which you can then open in a spreadsheet programme of your choice and create a heatmap with appropriate conditonal formatting. + ### Number of reads ON/OFF target See Matt Johnson's script hybpiper_stats.py Example of command: