Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating the LAI index as part of the EDTA pipeline #134

Closed
sanyalab opened this issue Nov 20, 2020 · 10 comments
Closed

Integrating the LAI index as part of the EDTA pipeline #134

sanyalab opened this issue Nov 20, 2020 · 10 comments

Comments

@sanyalab
Copy link

Hi Shujun,

I was wondering if we can integrate the LAI index into the EDTA pipeline. We are getting the LTR retriver results anyway, so we should be able to compute the LAI index.

Thanks
Abhijit

@sanyalab
Copy link
Author

I found the answer in the outputs. Thanks this is good.

-Abhijit

@oushujun
Copy link
Owner

oushujun commented Dec 7, 2020

LAI has been included in the LTR_retriever package. Since EDTA is focusing on TE annotation and LAI is to evaluate TE space completeness, I didn't put them together because these are different topics. As you suggested, the EDTA outputs contain input files to run LAI.
e.g.:
perl ./LTR_retriever/LAI -genome genome.fa.mod -intact genome.fa.mod.EDTA.raw/LTR/genome.fa.mod.pass.list -all genome.fa.mod.EDTA.anno/genome.fa.mod.out

@frabanal
Copy link

Hi @oushujun ,

I love the convenience of EDTA, and we use it extensively. For a particular application, I would like to know what are the steps to run with EDTA only the necessary steps to generate the the input files for LAI.

From the "Divide and Conquer" section, I tried to ran the LTR part:
perl $EDTA/EDTA_raw.pl --genome $ASSEMBLY --type ltr --threads $CORES

However, when I resume the EDTA.pl script without overwriting like this:
perl $EDTA/EDTA.pl --overwrite 0 --genome $ASSEMBLY --step filter --anno 1 --sensitive 1 --threads $CORES

it complains that other files such as $ASSEMBLY.mod.EDTA.raw/$ASSEMBLY.mod.TIR.raw.fa does not exist. Tried to overcome this with an empty file, but it still does not run. I would very much appreciate your feedback on this matter. Also, in case it does work, are the --sensitive 1 step necessary? (I'm guessing --anno 1 is, am I correct?)

Thanks,

Fernando

@oushujun
Copy link
Owner

Hello Fernando,

Thank you for liking EDTA. To get input files for LAI, you need to run the full EDTA pipeline including TIR and Helitrons. So:
perl $EDTA/EDTA_raw.pl --genome $ASSEMBLY --type tir --threads $CORES perl $EDTA/EDTA_raw.pl --genome $ASSEMBLY --type helitron --threads $CORES

Then you can run the whole-genome annotation to get the files:
perl $EDTA/EDTA.pl --overwrite 0 --genome $ASSEMBLY --step filter --anno 1 --threads $CORES

With or without --sensitive 1 is fine. If you compare genomes of the same species, you may want to specify the same -totLTR, -iden, and -genome_size in LAI to control for fluctuations.

Best,
Shujun

@frabanal
Copy link

frabanal commented Apr 21, 2022

Hi @oushujun ,

Thanks a lot for your prompt response, and for the great recommendations on LAI. I've now caught up with these steps and have a follow up question, so please let me know if you would prefer I ask it in the LTR_retriever Github instead.

I'm comparing different assemblies intra-species (sometimes even the same genotype), but they are also different in quality (CLR vs HiFi), so the scaffolded genome size also varies substantially.

Following your advice, I added -totLTR 6.88 (which I took from the $ASS.mod.out.LAI output file of the most complete assembly) and -iden 93.90 (which I took from the $ASS.mod.out.LAI.LTR.ava.age output file). Anyway, these numbers are all way too similar across assemblies. Unfortunately, it does not seem that the constant -genome_size 142000000 I'm providing has been picked up for the analysis. Actually, the -genome_size parameter is not even listed among LAI options. Am I missing something? Currently running LTR_retriever_v2.9.0.

Alternatively, would it be valid or too unfair for the smaller assemblies to scale the Intact and Total percentages to the "real" genome size?

Best,

Fernando

@oushujun
Copy link
Owner

The -genome_size feature is updated in 460bb30 now.

@frabanal
Copy link

Thanks for implementing it. I can confirm that the feature worked as intended with the above-mentioned commit.
Fernando

@zhangrengang
Copy link

Hi Shujun, can I use genome.fa.mod.EDTA.final/genome.fa.mod.out instead of genome.fa.mod.EDTA.anno/genome.fa.mod.out to calculate LAI?

@sanyalab
Copy link
Author

sanyalab commented Jul 4, 2023

Hello Zhan,

Yes. Please use genome.fa.mod.EDTA.final/genome.fa.mod.out for LAI

-Abhijit

@oushujun
Copy link
Owner

oushujun commented Jul 4, 2023

Hi Ren-Gang,

The EDTA.final folder may contain a "$genome.RM2.fa.out" file when --sensitive 1 is specified. This out file is annotated using an intermediate library which may not represent the final library, thus the whole-genome annotation is also not final.

Best,
Shujun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants