Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many LTR/unknown and most of LTR/unknown are classified as LTR/Copia #51

Closed
zhangrengang opened this issue Aug 12, 2019 · 3 comments

Comments

@zhangrengang
Copy link

Thousands of LTR in a plant genome are clasified as unkown by LTR_retriever. However, most of them are clasified as Copia on the basis of GyDB as belows:

# *.retriever.scn.extend.fa.aa
  Count LTR_retriever   GyDB
    927 LTR     Copia   LTR     Copia
      2 LTR     Copia   LTR     Gypsy
     41 LTR     Gypsy   -       -
      1 LTR     Gypsy   LTR     Caulimoviridae
      5 LTR     Gypsy   LTR     Copia
   2266 LTR     Gypsy   LTR     Gypsy
      9 LTR     Gypsy   LTR     unknown
      5 LTR     unknown -       -
   1248 LTR     unknown LTR     Copia
     21 LTR     unknown LTR     Gypsy
      5 mixture Copia   -       -
     27 mixture Copia   LTR     Copia
      1 mixture Copia   LTR     Gypsy
      1 mixture Copia   Unknown unknown
     85 mixture Gypsy   LTR     Gypsy
      1 mixture unknown -       -
     14 mixture unknown LTR     Copia
      2 mixture unknown LTR     Gypsy
    352 notLTR  unknown -       -
      1 notLTR  unknown LTR     Caulimoviridae
      8 notLTR  unknown LTR     Copia
     17 notLTR  unknown LTR     Gypsy
     43 -       -       LTR     Copia   
    150 -       -       LTR     Gypsy 
      1 -       -       LTR     unknown 
      2 -       -       Unknown unknown 

I think there is an issue in annotate_TE.pl:

	$family="Gypsy" if ($gypsy>$copia and $copia/$gypsy<0.3);
	$family="Copia" if ($copia>$gypsy and $gypsy/$copia<0.3);

Copia has the same wieght (0.3) as Gypsy but Copia only has 8 PFAMs, ~1/3 of 28 PFAMs of Gypsy.

@oushujun
Copy link
Owner

Hello @zhangrengang,

I think this is a very good point and I agree that the classification of copia and gypsy in LTR_retriever is not the best scheme. I have been using the copia and gypsy specific hmms in rice to assign new LTR elements into these superfamilies. A better way would be to use the GyDB to assign superfamilies as you suggested. Another way I have been thinking of, but not yet get the time to implement, is to use the order of these conserved domains to classify, which is the fundamental difference between gypsy and copia.

If you can implement a better scheme, welcome to contribute! For benchmarking of accuracy, I use the rice curated TE library.

Best,
Shujun

@zhangrengang
Copy link
Author

Hello Dr. Ou, here is a simple implement. You may test it and/or intergrate it.

@oushujun
Copy link
Owner

oushujun commented Sep 4, 2019

Hello @zhangrengang ,

Thank you so much for developing these code in such a short time. I will test it soon and let you know.

Best,
Shujun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants