-
Notifications
You must be signed in to change notification settings - Fork 23
SB Purge
Minimize a set of sequences by replacing groups of highly similar sequences with a single representative (inspired by the Purge tool from the MEME suite). The criteria used to determine similarity is BLAST bit-score, and a maximum threshold value is passed in as a parameter. For reference, recently duplicated paralogs will often have bit-scores around 400, while easily recognizable orthologs are often around 100-250.
The 'purged' sequence set is returned, as is a mapping of all sequences that have been removed (unless the -q flag is passed in). The mapping consists of each retained sequence ID, followed by the sequences that have been deleted because they match the retained sequence with a bit-score above the indicated threshold.
NOTE: This tool depends on the blastp or blastn binary (depending on input) being present in your system PATH.
An integer specifying the maximum bit score threshold for inclusion in the purged sequence set (i.e., smaller values retain fewer sequences).
#NEXUS
begin data;
dimensions ntax=8 nchar=316;
format datatype=protein missing=? gap=-;
matrix
'Dme-Panxδ1' YKLLGSLKSYLKWQIQTDNAVFRLHNSFTTVLLLTCSLIITATQYVGQPISCIVGVP-HVVNTFCWIHSTFTMPDRREVHPGVDF-KYYTYYQWVCFVLFFQAMACYTPKFLWNKFEGGLMRMIVGLNITRKRDALLDYLIKHVKRHKLY-AYWACEFLCCINIIVQMYLMNRFFDGEFLSYGTNIMKLSDVPQEQRVDPMVYVFPRVTKCTFHKYGPSGSLQKHDSLCILPLNIVNEKTYVFIWFWFWILLVLLGL--VFRCIIFPKFRPRLLNASNRIPMECRLDIGDWWLIYMLGRNLDPVIYKDVMSEFQVP
'Dme-Panxδ2' MDVFGSVKGLLKIDQV-DNNVFRMHYKATVIILIAFSLLVTSRQYIGDPIDCIVEIPLGVMDTYCWIYSTFTVPEGRDVQP--GSEKYHKYYQWVCFVLFFQAILFYVPRYLWKSWEGGRLKMLVDLSVNDKDRKIVDYFG-NLNRHNFYAFFFVCEALNFVNVIGQIYFVDFFLDGEFSTYGSDVLKFTELEPDERIDPMARVFPKVTKCTFHKYGPSGSVQTHDGLCVLPLNIVNEKIYVFLWFWFIILSIMSI-SLIYRIAVAPKLRHLLLRARSRAESEVEVAIGDWFLLYQLGKNIDPLIYKEVISDLEMG
'Dme-Panxδ3' -----GFI---K----IDNMVFRCHYRITAILFTC-CIIVTANNLIGDPISCI--IPMHVINTFCWITYTYTV---A--GPGLE-K--HSYYQWVPFVLFFQGLMFYVPHWVWKM-D-GKIRMITG--VDDRDRIL-KYFVNNT--HNGYSFYFFCELLNFINVIVNIFMVDKFLGGAFMSYGTDVLKFSNMDQ-DRFDPMIEIFPRLTKCTFHKFGPSGSVQKHDTLCVLALNILNEKIYIFLWFWFIILATISGVAVLYSVVI---TR-TIR----------K--EGDFLILHFLSQNLSTRSYSDML-Q----
'Dme-Panxδ4' MAAVKPLSKYLQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPIQCF-G-D-KDMDAFCWIYGAYL-QCAVSK--VVEN--YITYYQWVVLVLLLESFVFYMPAFLWKIWEGGRLKHLCDFKRTHRV--LVNYFETHFR----YFVYVFCEILNLSISILNFLLLDVFFGGFWGRYRNALY-------NQWI-AV---FPKCAKCEYKG-GPSGSSNIYDYLCLLPLNILNEKIFAFLWIWFI-LAMLISLKFLYRLAVLYPMRLQLLRPKKHLQVALNCSFGDWFVLMRVGNNISPELFRKLLEEL---
'Dme-Panxδ5' MSAVKPLSKYLQFKIRIYDSVFTIHSRCTVVILLTCSLLLSARQYFGDPIQCI-S-EEKNIESYCWTMGTYYNEASIAE--GVEIRQYLRYYQWVIILLLFQSFVFYFPSCLWKVWEGRRLKQLCEVDNTRRM--LVKYFDMHFC----YMAYVFCEVLNFLISVVNIIVLEVFLNGFWSKYLRALW-------DRWV-SV---FPKIAKCELKF-GGSGTANVMDNLCILPLNILNEKIFVFLWAWFL-LALMSGLNLLCRLAICSRLREQMIRTKRHVKRALDLTIGDWFLMMKVSVNVNPMLFRDLMQEL---
'Dme-Panxδ6' MAAVKPLSNYLRLKVRIYDPIFTLHSKCTIVILLTCTFLLSAKQYFGEPILCL-S-SERQADSYCWTMGTYWNEQSIAE--GVETRMYLRYYQWVFMILLFQSLLFYFPSFLWKVWEGQRMEQLCEVDRTRQM--LTRYFPIHWC----YSIYAFCELLNVFISILNFWLMDVVFNGFWYKYIHALW-------NLWM-RV---FPKVAKCEFVY-GPSGTPNIMDILCVLPLNILNEKIFAVLYVWFL-FALLAIMNILYRLLICCPLRLQLLNPKSHVREVLSAGYGDWFVLMCVSINVNPTLFRELLEQL--D
'Dme-Panxδ7' --L--SV----R-Q-RIDNIVFKLHYRWTVILLVA-TLLITSRQYIGEHIQCL--VVSPVINTFCFFTPTF-VD--P---PGI--D-RHAYYQWVPFVLFFQALCFYIPHALWKW-EGGRIKALVK--LG-MERVKD---IRDM--RLNWG-HVFAEVLNLINLLLQITWTNRFLGGQFLTLG------HALKN-RSDEVV---FPKITKCKFHKFGDSGSIQMHDALCVMALNIMNEKIYIILWFWYAFLLIVTVLGLLWRLCF---VR-WSL----------P-LASNWMFLFFLRSNLS-----E-L----DN
'Dme-Panxδ8' LDIFRGLKNLVKVSVKTDSIVFRLHYSITVMILMSFSLIITTRQYVGNPIDCVTDIP-DVLNTYCWIQSTYTLKSLVSVYPGIGNKKHYKYYQWVCFCLFFQAILFYTPRWLWKSWEGGKIHALIDLDISEKKKLLLDYLWENLRYHNWW-AYYVCELLALINVIGQMFLMNRFFDGEFITFGLKVIDYMETDQEDRMDPMIYIFPRMTKCTFFKYGSSGEVEKHDAICILPLNVVNEKIYIFLWFWFILLTFLTLLTLIYRVIIFPRMRVYLFRMRFRVRRDIEIKMGDWFLLYLLGENIDTVIFRDVVQDLRL-
;
end;
$: sb Drosophila.nex -prg 230
### Deleted record mapping ###
Dme-Panxδ1
Dme-Panxδ2, Dme-Panxδ8
Dme-Panxδ3
Dme-Panxδ2, Dme-Panxδ8
Dme-Panxδ4
Dme-Panxδ5, Dme-Panxδ6
Dme-Panxδ7
##############################
#NEXUS
begin data;
dimensions ntax=4 nchar=316;
format datatype=protein missing=? gap=-;
matrix
'Dme-Panxδ1' YKLLGSLKSYLKWQIQTDNAVFRLHNSFTTVLLLTCSLIITATQYVGQPISCIVGVP-HVVNTFCWIHSTFTMPDRREVHPGVDF-KYYTYYQWVCFVLFFQAMACYTPKFLWNKFEGGLMRMIVGLNITRKRDALLDYLIKHVKRHKLY-AYWACEFLCCINIIVQMYLMNRFFDGEFLSYGTNIMKLSDVPQEQRVDPMVYVFPRVTKCTFHKYGPSGSLQKHDSLCILPLNIVNEKTYVFIWFWFWILLVLLGL--VFRCIIFPKFRPRLLNASNRIPMECRLDIGDWWLIYMLGRNLDPVIYKDVMSEFQVP
'Dme-Panxδ3' -----GFI---K----IDNMVFRCHYRITAILFTC-CIIVTANNLIGDPISCI--IPMHVINTFCWITYTYTV---A--GPGLE-K--HSYYQWVPFVLFFQGLMFYVPHWVWKM-D-GKIRMITG--VDDRDRIL-KYFVNNT--HNGYSFYFFCELLNFINVIVNIFMVDKFLGGAFMSYGTDVLKFSNMDQ-DRFDPMIEIFPRLTKCTFHKFGPSGSVQKHDTLCVLALNILNEKIYIFLWFWFIILATISGVAVLYSVVI---TR-TIR----------K--EGDFLILHFLSQNLSTRSYSDML-Q----
'Dme-Panxδ4' MAAVKPLSKYLQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPIQCF-G-D-KDMDAFCWIYGAYL-QCAVSK--VVEN--YITYYQWVVLVLLLESFVFYMPAFLWKIWEGGRLKHLCDFKRTHRV--LVNYFETHFR----YFVYVFCEILNLSISILNFLLLDVFFGGFWGRYRNALY-------NQWI-AV---FPKCAKCEYKG-GPSGSSNIYDYLCLLPLNILNEKIFAFLWIWFI-LAMLISLKFLYRLAVLYPMRLQLLRPKKHLQVALNCSFGDWFVLMRVGNNISPELFRKLLEEL---
'Dme-Panxδ7' --L--SV----R-Q-RIDNIVFKLHYRWTVILLVA-TLLITSRQYIGEHIQCL--VVSPVINTFCFFTPTF-VD--P---PGI--D-RHAYYQWVPFVLFFQALCFYIPHALWKW-EGGRIKALVK--LG-MERVKD---IRDM--RLNWG-HVFAEVLNLINLLLQITWTNRFLGGQFLTLG------HALKN-RSDEVV---FPKITKCKFHKFGDSGSIQMHDALCVMALNIMNEKIYIILWFWYAFLLIVTVLGLLWRLCF---VR-WSL----------P-LASNWMFLFFLRSNLS-----E-L----DN
;
end;
$: sb Drosophila.nex -prg 150 -q
#NEXUS
begin data;
dimensions ntax=1 nchar=316;
format datatype=protein missing=? gap=-;
matrix
'Dme-Panxδ1' YKLLGSLKSYLKWQIQTDNAVFRLHNSFTTVLLLTCSLIITATQYVGQPISCIVGVP-HVVNTFCWIHSTFTMPDRREVHPGVDF-KYYTYYQWVCFVLFFQAMACYTPKFLWNKFEGGLMRMIVGLNITRKRDALLDYLIKHVKRHKLY-AYWACEFLCCINIIVQMYLMNRFFDGEFLSYGTNIMKLSDVPQEQRVDPMVYVFPRVTKCTFHKYGPSGSLQKHDSLCILPLNIVNEKTYVFIWFWFWILLVLLGL--VFRCIIFPKFRPRLLNASNRIPMECRLDIGDWWLIYMLGRNLDPVIYKDVMSEFQVP
;
end;