Releases: oushujun/LTR_retriever
v3.0.1 release
New feature
Add the -stop
parameter to stop the program after a user-specified step. For example, if you only want to obtain the .defalse
and .pass.list
files, you can stop the program after the Major filtering step (i.e., -stop major
). By default, it will finish the full pipeline.
v3.0.0 update
Bug fix
- Update get_range.pl: fix the sequnce ID recognition issue for LTRharvest inputs #177
- Make sure candidates have sufficient flanking sequence to extend (50bp)
v2.9.9 update
New feature
Enable strand-aware outputs
For LTR candidates found in the negative strand, the locus presentation is now 5' -> 3', similar to candidates found in the positive strand. For example, Chr1:7890..3456
suggests the candidate is on the - strand. This information is shown in the first column of the pass.list
, the last column of the gff3
file, and the sequence names of the intact.fa
file. If the element is on the - strand, its sequence in the intact.fa
file will be shown as 5' -> 3' from the negative strand. For example, Chr1:7890..3456
's sequence will be a reverse complement to Chr1:3456..7890
's sequence. For candidates without strand information (i.e., lack of coding sequence), their strangeness will be assumed positive for convenience.
Bug fix
- Ensure candidates have sufficient flanking sequences to extend (default 50bp), which is necessary for LTR_retriever to determine whether the candidate is true or false. Candidates that can't satisfy this criterion will be skipped. Such a scenario is mostly likely found in fragmented genomes. Bug report: oushujun/EDTA#263
v2.9.8 update
New features
- Use the same LTR name for parts of INT and LTR from the same element in preparation for solving @edta#251
- Add the yml file for conda installation
Bug fix
Update get_range.pl
- A bug introduced in Aug, 2023 (# a375c5e) that will output all candidates (both LTR retrotransposons and not LTR repeats) for generating the library file. You will see non-LTR sequences in the library due to this bug (eg., LTR/EnSpm-CACTA). Now it's fixed.
- A bug introduced in May, 2023 (#058ce29) that fails to remove masked sequences in the final library. Now it's fixed.
- Remove the RepeatMasker support to simplify the code since this functionality is never used in the official release.
Bug fix
It just gets better with community efforts!
Major Updates
-
Add TEsorter to help to identify not LTR sequences. Candidate LTRs will be determined as "false" if they contain not-LTR HMM profile matches even the candidate contains LTR/TSD and the TGCA motif. This purging will remove a small number of structurally intact LTR candidates (5/2304 in rice). This implementation offers slight improvements over older versions and should be more significant for larger genomes.
LTR_retriever-harvest_FINDER sens spec accu prec FDR F1 retriever_v2.5 0.967 0.920 0.931 0.789 0.211 0.869 retriever_v2.6 0.963 0.931 0.939 0.811 0.189 0.881 retriever_v2.9.2 0.966 0.926 0.935 0.802 0.198 0.876 retriever_v2.9.4 0.967 0.928 0.937 0.804 0.196 0.878 -
Add more filtering parameters to identify solo LTRs, improve the solo-intact ratio calculation (#111, #110).
-
Resolve RMblast errors when it attempts to overutilize CPUs #137
Other improvements
- Now require sequence IDs for 13 characters or less to accomodate for huge chromosomes up to 999Mb in length.
- Add missing TRF parameter (#133)
- Add check to ensure the input genome is writable (LTR_retriever won't overwrite your genome) (#125).
- Remove gap length for genome size calculation.
Acknowledgements
Andreas Wallberg, @Shokusei, Evan Ernst, @xie-wei-hh, @with9, and users like YOU!
Version 2.9.0: Polishing outputs
Major updates
This version has many improvements in the downstream outputs including:
- standardized the GFF3 output following these criteria and used the updated TE-related sequence ontologies
- combined structural and homological LTR annotations. Homology-based LTR fragments will be replaced by structural-based LTR annotations wherever applicable.
Other improvements
- allow users to provide paths to dependencies in the command-line.
- updated readme
- fixed a number of minor bugs.
Reformat GFF3 outputs
Recovering 10-20% more intact LTR elements
Major update
I recently identified a bug for dropping intact LTR elements, which have an imbalance LTR length > 15bp due to InDels. After manual checks, I determined these are still high-quality intact elements and thus salvage them in the output. This will marginally improve the sensitivity especially for genomes with limited LTR sequences (e.g. Arabidopsis, ~7%) and the margin decreases for those with decent amounts of LTRs, such as rice (~25%) and maize (~75%), because the abundance of intact elements has been sufficient to construct a comprehensive library. However, the number of intact LTR elements could increase for 10-20% comparing to the last version (v2.7), which has some positive effects on the calculation of LAI. Some benchmarking results:
Arabidopsis (TAIR10) | v1.x | v2.0 | v2.8 |
---|---|---|---|
Sensitivity | 90.70% | 90.90% | 95.04% |
Specificity | 99.00% | 99.00% | 98.88% |
Accuracy | 98.50% | 98.50% | 98.64% |
Precision | 86.60% | 86.50% | 84.99% |
Rice (MSUv7) | v1.x | v2.0 | v2.5 | v2.8 |
---|---|---|---|---|
Sensitivity | 95.00% | 95.30% | 96.30% | 96.71% |
Specificity | 95.00% | 94.60% | 94.00% | 93.87% |
Accuracy | 95.00% | 94.80% | 94.50% | 94.54% |
Precision | 85.40% | 84.50% | 83.10% | 83.09% |
Minor updates
- Allow for mirrored candidates produced by LTRharvest
- Improve the convert_ltrdetector.pl for the published version (v1.0) of LtrDetector (contributed by @baozg)
- Add a convertor convert_ltr_finder2.pl to convert LTR_FINDER -w 2 table format into LTRharvest screen output format
- For LAI, allow the -all file to contain other TEs (i.e., whole-genome TE annotation)
Releasing a 100% faster version
Major improvement
I am excited to release this much faster version of LTR_retriever. Its multithreading module has been slowing down the program and I finally get the chance to improve it. This part of the update will not change the program outcome since this is just a more efficient implementation of parallel computation.
With the test on the 14.5 Gb bread wheat genome, a total of 941,338 LTR raw candidates were processed and a non-redundant library was generated. This process only took 8 days 3 hours and 31 minutes for the current version (v2.7) with 10 threads (-threads 10
), which would have required 3 weeks for the last version (v2.6).
Minor changes
- Classification of Copia elements was improved to be more sensitive (#51)
- Print out the program version number on screen.
- Improved genome and sequence reading.