Skip to content
/ MTEC Public

A repo contains historical and updated MTEC libraries.

Notifications You must be signed in to change notification settings

oushujun/MTEC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This repository contains three historical versions of curated maize TE libraries derived from the Maize TE Consortium (MTEC). I combined the three together and further clean the combined library with the following commands and curations. If you are looking for a comprehensive and high-quality maize TE library, look no further, this is the one (usually named like "maizeTE02052020" in the root directory).

Files

  • maizeTE10102014 was download from the MTEC official website (http://maizesequence.org). The website is gone, but I managed to get a mirror of the first page. Please refer to the file history/Maizedatabase_mirror.pdf for more information about the MTEC project.
  • TE_12-Feb-2015_15-35.fa was shared by Nicolas Blavet from https://github.com/mcstitzer/maize_v4_TE_annotation/issues/9.
  • Wessler-Bennetzen_2.fasta was used to annotate the initial B73 genome (Schnable et al. 2009), which was shared by Kapeel Chougule. I believe this is an earlier version of the MTEC library.
  • nonTE.repeat.fa contains 5 non-TE repeats (knob180, knob TR-1, rDNA spacer, subtelomere 4-12-1, and CentC) in maize, which was shared by Jianing Liu.

Combine the three MTEC libraries + nonTE repeats

1. Reformat sequence IDs

for i in history/Wessler-Bennetzen_2.fasta history/maizeTE10102014 history/TE_12-Feb-2015_15-35.fa; do perl -nle 's/\s+$//g; $_=(split)[0]; s/\-/_/g; print $_' $i > $i.mod; done

2. Combine sequences with unique IDs

perl bin/output_by_list.pl 1 <(cat history/*.mod) 1 <(cat history/*.mod|grep \>|sort -u) -FA > history/maizeTE11212019.ori

3. Split the library into consensus and others

perl bin/output_by_list.pl 1 history/maizeTE11212019.ori 1 <(grep consen history/maizeTE11212019.ori) -FA > history/maizeTE11212019.ori.consensus

perl bin/output_by_list.pl 1 history/maizeTE11212019.ori 1 <(grep consen history/maizeTE11212019.ori) -FA -ex > history/maizeTE11212019.ori.others

4. Remove TEs in others that are represented by consensus TEs

RepeatMasker -pa 36 -div 40 -lib history/maizeTE11212019.ori.consensus -cutoff 225 history/maizeTE11212019.ori.others

perl bin/make_masked.pl -rmout history/maizeTE11212019.ori.others.out -genome history/maizeTE11212019.ori.others -maxdiv 20 -minscore 200 -minlen 80 -t 30

perl bin/cleanup_tandem.pl -nc 1000 -nr 0.5 -minlen 80 -cleanN 1 -cleanT 1 -trf 0 -f history/maizeTE11212019.ori.others.new.masked > history/maizeTE11212019.ori.others.new.masked.cln

cat history/maizeTE11212019.ori.consensus history/maizeTE11212019.ori.others.new.masked.cln > history/maizeTE11212019.ori2

5. Remove redundant sequences

perl bin/cleanup_nested.pl -in history/maizeTE11212019.ori2 -cov 0.98 -minlen 80 -miniden 95 -iter 2 -t 36

6. Remove nonTE repeats and tandem repeats

RepeatMasker -pa 36 -div 40 -no_is -norna -nolow -lib history/nonTE.repeat.fa -cutoff 225 history/maizeTE11212019.ori2.cln

perl bin/cleanup_tandem.pl -nc 1000 -nr 0.5 -minlen 80 -cleanN 1 -cleanT 1 -trf 1 -f history/maizeTE11212019.ori2.cln.masked > history/maizeTE11212019.ori2.cln2

cat history/nonTE.repeat.fa history/maizeTE11212019.ori2.cln2 > history/maizeTE11212019.ori3

Improve the combined library

1. Reclassify unknown TEs

python2 TEsorter.py history/maizeTE11212019.ori3 -p 36

2. Find misclassified entries

The file history/maizeTE11212019.ori3.rexdb.cls.tsv contains new classifications of the library. Most of them are consistent with the old classification. What really improved are the LTR/unknown classification.

perl -nle '($info, $cla)=(split)[0,2]; my $oldcla=$1 if $info=~/^([A-Z]+)_/; $cla=~s/EnSpm_CACTA/DTC/; $cla=~s/hAT/DTA/; $cla=~s/PIF_Harbinger/DTH/; $cla=~s/MuDR_Mutator/DTM/; $cla=~s/Tc1_Mariner/DTT/; $cla=~s/Gypsy/RLG/; $cla=~s/Copia/RLC/; print "$oldcla\t$cla\t$info" if $cla ne $oldcla' history/maizeTE11212019.ori3.rexdb.cls.tsv |less

3. Some LTRs appear to have the same name but different classifications (RLG/RLC/RLX)

grep RL history/maizeTE11212019.ori3|perl -nle 's/RL._//; print $_'|sort|uniq -c |sort -k1,1|tac|less blastn -query list.fa -subject list.fa -outfmt=6 > list.fa.out

4. Manually check misclassified sequences (list.fa).

These entries are put in the history/removal.list and removed:

Seq_ID Removal reason
RLC_chr3_D_28761151 rDNA-contained
DTM_Zm08959_AC199876_1 LTRcoding-contained
DTM_Zm22805IC_AC207689_1 LINE-contained
RIX_nugimu_AC203843_0 Duplicted_with_RIL_nugimu_AC203843_0
RLX_fageri_AC204875_8470 misclassified_as_LINE
DTA_ZM00171_consensus misclassified_as_CACTA
DTA_ZM00205_consensus misclassified_as_CACTA
DTA_ZM00284_consensus misclassified_as_CACTA
RLX_teki_AC202867-7492 rDNA-contained
RLG_ajajog_AC191578_3186 A_RLG_nested_in_RLC_ajajog_AC191578_3186
RLC_iwim_AC203300_7761 misclassified_RLG_duplicated
RLC_kupu_AC216069_13264 misclassified_RLG_duplicated
RLX_pute_AC197188_5467 duplicated_RLC_pute_AC197188_5467
RLX_votaed_AC215881_13209 duplicated_RLC_votaed_AC215881_13209
RLC_votaed_AC215881_13209 5-6_LTR_nested_together
RLX_bobeg_AC193485_3670 5_LTR_nested_together

perl bin/output_by_list.pl 1 history/maizeTE11212019.ori3 1 history/removal.list -FA -ex > history/maizeTE11212019.ori3.cln

5. PPP_PPO_AC185414 is changed to DTH_PPO_AC185414 manually

6. Update LTR classifications

perl -nle '($info, $cla)=(split)[0,2]; my $oldcla=$1 if $info=~/^([A-Z]+)_/; $cla=~s/EnSpm_CACTA/DTC/; $cla=~s/hAT/DTA/; $cla=~s/PIF_Harbinger/DTH/; $cla=~s/MuDR_Mutator/DTM/; $cla=~s/Tc1_Mariner/DTT/; $cla=~s/Gypsy/RLG/; $cla=~s/Copia/RLC/; next unless /LTR/; my $info_new=$info; $info_new=~s/$oldcla/$cla/; print "$info|$info_new" if $cla ne $oldcla' history/maizeTE11212019.ori3.rexdb.cls.tsv > history/maizeTE11212019.ori3.rexdb.cls.tsv.LTR

for i in cat history/maizeTE11212019.ori3.rexdb.cls.tsv.LTR; do perl -i -slane 'my ($old, $new)=(split /\|/, $info); s/$old/$new/; print $_' -- -info=$i history/maizeTE11212019.ori3.cln; done

7. Convert sequence names to RepeatMasker format

perl -nle 'my $id=(split)[0]; $id=~s/RLC_(.*)/$1#LTR\/Copia/; $id=~s/RLG_(.*)/$1#LTR\/Gypsy/; $id=~s/RLX_(.*)/$1#LTR\/unknown/; $id=~s/DHH_(.*)/$1#DNA\/Helitron/; $id=~s/DTA_(.*)/$1#DNA\/DTA/; $id=~s/DTC_(.*)/$1#DNA\/DTC/; $id=~s/DTH_(.*)/$1#DNA\/DTH/; $id=~s/DTM_(.*)/$1#DNA\/DTM/; $id=~s/DTT_(.*)/$1#DNA\/DTT/; $id=~s/(RIT_.*)/$1#LINE\/RTE/; $id=~s/(RIL_.*)/$1#LINE\/L1/; $id=~s/(RIX_.*)/$1#LINE\/unknown/; $id=~s/(ZM_CACTA_noncoding.*)/$1#MITE\/DTC/; $id=~s/(ZM_Stowaway.*)/$1#DNA\/DTT/; $id=~s/(ZM_Tourist.*)/$1#DNA\/DTH/; $id=~s/(ZM_hAT_noncoding.*)/$1#MITE\/DTA/; $id=~s/(RST_.*)/$1#SINE\/tRNA/; print $id' history/maizeTE11212019.ori3.cln > history/maizeTE11222019.ori

8. Rename short TIR (<= 600bp) to MITE

perl bin/rename_MITE.pl history/maizeTE11222019.ori > history/maizeTE11222019.ori.rename

Remove gene sequences

1. Mask gene cds

RepeatMasker -pa 36 -q -no_is -norna -nolow -div 40 -lib history/Zea_mays.AGPv4.cds.all.noTE.fa.mod.cln -cutoff 500 history/maizeTE11222019.ori.rename

perl bin/cleanup_tandem.pl -misschar n -Nscreen 1 -nc 1000 -nr 0.3 -minlen 80 -maxlen 5000000 -cleanN 1 -cleanT 1 -trf 0 -f history/maizeTE11222019.ori.rename.masked > history/maizeTE11222019.ori.rename.nogene

2. Finalize

Manually add the sequence CL569186.1#subtelomere/4-12-1 back to history/maizeTE11222019.ori.rename.nogene

cp history/maizeTE11222019.ori.rename.nogene maizeTE11222019

Updates

01/30/2020

Added four CRM sequences (CRM1-4) contributed by Na Wang from Gernot and Presting (2008).

02/03/2020

Added 3-letter names before all consensus seq IDs. IDs like "ZM00034_consensus" were inherited from the 2014 version MTEC, so kept it unchanged.

perl -nle 's/>(.*)#(.*)\/(.*)/>$3_$1#$2\/$3/; print $_' maizeTE01302020 > maizeTE02032020

02/05/2020

Fix namings

perl -nle 's/>(.*)#(.*)\/(.*)/>$3_$1#$2\/$3/ if /consensus/; s/>(.*)/>$1#LTR\/CRM/ if /CRM/; print $_' history/maizeTE01302020 > maizeTE02052020

About

A repo contains historical and updated MTEC libraries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages