too many LTR/unknown and most of LTR/unknown are classified as LTR/Copia #51

zhangrengang · 2019-08-12T11:00:18Z

Thousands of LTR in a plant genome are clasified as unkown by LTR_retriever. However, most of them are clasified as Copia on the basis of GyDB as belows:

# *.retriever.scn.extend.fa.aa
  Count LTR_retriever   GyDB
    927 LTR     Copia   LTR     Copia
      2 LTR     Copia   LTR     Gypsy
     41 LTR     Gypsy   -       -
      1 LTR     Gypsy   LTR     Caulimoviridae
      5 LTR     Gypsy   LTR     Copia
   2266 LTR     Gypsy   LTR     Gypsy
      9 LTR     Gypsy   LTR     unknown
      5 LTR     unknown -       -
   1248 LTR     unknown LTR     Copia
     21 LTR     unknown LTR     Gypsy
      5 mixture Copia   -       -
     27 mixture Copia   LTR     Copia
      1 mixture Copia   LTR     Gypsy
      1 mixture Copia   Unknown unknown
     85 mixture Gypsy   LTR     Gypsy
      1 mixture unknown -       -
     14 mixture unknown LTR     Copia
      2 mixture unknown LTR     Gypsy
    352 notLTR  unknown -       -
      1 notLTR  unknown LTR     Caulimoviridae
      8 notLTR  unknown LTR     Copia
     17 notLTR  unknown LTR     Gypsy
     43 -       -       LTR     Copia   
    150 -       -       LTR     Gypsy 
      1 -       -       LTR     unknown 
      2 -       -       Unknown unknown

I think there is an issue in annotate_TE.pl:

	$family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3);
	$family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.3);

Copia has the same wieght (0.3) as Gypsy but Copia only has 8 PFAMs, ~1/3 of 28 PFAMs of Gypsy.

The text was updated successfully, but these errors were encountered:

oushujun · 2019-08-23T21:56:10Z

Hello @zhangrengang,

I think this is a very good point and I agree that the classification of copia and gypsy in LTR_retriever is not the best scheme. I have been using the copia and gypsy specific hmms in rice to assign new LTR elements into these superfamilies. A better way would be to use the GyDB to assign superfamilies as you suggested. Another way I have been thinking of, but not yet get the time to implement, is to use the order of these conserved domains to classify, which is the fundamental difference between gypsy and copia.

If you can implement a better scheme, welcome to contribute! For benchmarking of accuracy, I use the rice curated TE library.

Best,
Shujun

zhangrengang · 2019-08-28T01:27:18Z

Hello Dr. Ou, here is a simple implement. You may test it and/or intergrate it.

oushujun · 2019-09-04T04:32:00Z

Hello @zhangrengang ,

Thank you so much for developing these code in such a short time. I will test it soon and let you know.

Best,
Shujun

oushujun added the enhancement label Aug 23, 2019

oushujun mentioned this issue Sep 6, 2019

Benchmark with rice, dmel, and maize zhangrengang/TEsorter#2

Closed

oushujun added a commit that referenced this issue Sep 15, 2019

improved copia classification #51

2425c77

oushujun closed this as completed Sep 15, 2019

oushujun mentioned this issue Aug 10, 2022

too many LTR/unknown and most of LTR/unknown are classified as LTR/Gypsy #132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many LTR/unknown and most of LTR/unknown are classified as LTR/Copia #51

too many LTR/unknown and most of LTR/unknown are classified as LTR/Copia #51

zhangrengang commented Aug 12, 2019

oushujun commented Aug 23, 2019

zhangrengang commented Aug 28, 2019

oushujun commented Sep 4, 2019

too many LTR/unknown and most of LTR/unknown are classified as LTR/Copia #51

too many LTR/unknown and most of LTR/unknown are classified as LTR/Copia #51

Comments

zhangrengang commented Aug 12, 2019

oushujun commented Aug 23, 2019

zhangrengang commented Aug 28, 2019

oushujun commented Sep 4, 2019