Skip to content
Steve Bond edited this page Oct 28, 2015 · 2 revisions

--clean_seq, -cs

Description

Remove all non-alignment characters from input. This will include any spaces, numbers, stop characters (e.g. '*'), etc., but not dashed gap characters ('-'). Passing in the word 'strict' will also replace ambiguous/degenerate characters in nucleotide sequences with 'N'.

Nucleotide sequences: ATGCURYWSMKHBVDNX will be retained. If 'strict' is specified, only ATGCXNU will be retained.

Protein sequences: ACDEFGHIKLMNPQRSTVWXY will be retained. Using the 'strict' command has no effect.

Arguments

'strict' ( exact string )

Optional. By default, ambiguous nucleotide characters will be retained (i.e., the degenerate alphabet), but these can cause issues for some downstream analysis. Include the word 'strict' to replace ambiguous characters with a unified character ('N' by default).

Replacement character ( char )

Optional. If 'N' is not the desired replacement character for degenerate residues, specify a different one.

Examples

Input file: Mle-Panx_align.fa

>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFMFGSNISCIGF
EKLERNFVEEYCWTQGIYTSKAAYNMP-LHTPYPGIAPCVPEYDPVTQKYWLPCG----V
EEEDKAYHLWYQWVPFYFLAVAVGYYLPFLILKGSKLHQVKPLITYLMNQRNLETDPNHL
VGKLSHWIFRQLVYSRFAATSTIRMYWHDWGLVLLVCSVKILYLTVSLIHLFATAKMFHI
GNWFTYGIMFARR---SNSHTTHVKDVFFPKMVACKIETWSFTGKNHLHGMCVLALNVMN
QYLFLIVWYVNVIIIFLNSISCIYTIVKFCSPNIVHHRIVNSSSLDDHHDFTRMFGYVGP
SGRIILAKMSEHMPGYMLKQVAKKVTEKIDIENEKNRGRAPTIKFTKVNGQPSELARQPL
MHLNALMLGMVPQNLPEPKIQNIQRSQKKVRFLV*
>Mle-Panxα11
M--LISSLVQFSRLSPFKEITIDDGWDQLNRSFMFVLMVICGTIVTVRQHTGNIISCNGF
TKYDGSFSEDYCWTQGLYTIREAYHVSDVNVPYPGV---IPEEIPLCLGDNC---DKLAN
SNTTRVYHLWYQWIPFYFWLASAAFFLPYLIYKRYGFGDIKPLIHMLYNPLDGDEGVKAD
SEKASIWLYHRFS-IYMNEHSMYANFMERHGIGILVIAIKVMYLIISVLLMVMTAMMFEL
ADFKQYGIVWAQQWPDPPANVTGIKDLLFPKMVACEIKRWGPTGLEDENGMCVLAPNVIN
QYIFLILWWALVFTIVSNVFNVLAGVIRIVFIYGSYRRMLASAFLRDDPHYKKVYYKIGT
SGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDD----------------
------------PLL*-------------------

Usage example 1

Convert protein stop characters into gaps

$: alb Mle-Panx_align.fa -cs

Output

>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFMFGSNISCIGF
EKLERNFVEEYCWTQGIYTSKAAYNMP-LHTPYPGIAPCVPEYDPVTQKYWLPCG----V
EEEDKAYHLWYQWVPFYFLAVAVGYYLPFLILKGSKLHQVKPLITYLMNQRNLETDPNHL
VGKLSHWIFRQLVYSRFAATSTIRMYWHDWGLVLLVCSVKILYLTVSLIHLFATAKMFHI
GNWFTYGIMFARR---SNSHTTHVKDVFFPKMVACKIETWSFTGKNHLHGMCVLALNVMN
QYLFLIVWYVNVIIIFLNSISCIYTIVKFCSPNIVHHRIVNSSSLDDHHDFTRMFGYVGP
SGRIILAKMSEHMPGYMLKQVAKKVTEKIDIENEKNRGRAPTIKFTKVNGQPSELARQPL
MHLNALMLGMVPQNLPEPKIQNIQRSQKKVRFLV-
>Mle-Panxα11
M--LISSLVQFSRLSPFKEITIDDGWDQLNRSFMFVLMVICGTIVTVRQHTGNIISCNGF
TKYDGSFSEDYCWTQGLYTIREAYHVSDVNVPYPGV---IPEEIPLCLGDNC---DKLAN
SNTTRVYHLWYQWIPFYFWLASAAFFLPYLIYKRYGFGDIKPLIHMLYNPLDGDEGVKAD
SEKASIWLYHRFS-IYMNEHSMYANFMERHGIGILVIAIKVMYLIISVLLMVMTAMMFEL
ADFKQYGIVWAQQWPDPPANVTGIKDLLFPKMVACEIKRWGPTGLEDENGMCVLAPNVIN
QYIFLILWWALVFTIVSNVFNVLAGVIRIVFIYGSYRRMLASAFLRDDPHYKKVYYKIGT
SGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDD----------------
------------PLL--------------------

Input file: ambiguous_cds.fa

>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
RRRRRRRRRRRRCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
YCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
WGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
SATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
MGTACGACGGTACGAGGTTAAAGT------------------CCAGACCCTGATCAGTTG
KTGTCACCGACGCGGATATCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTG
HCGGCTGCTGCCTTCTTCATGCCCTACCTTCTGTACA------TTGGCATGGGAGATATC
BAGCCTCTCGTGAG------ACACAATCCAGTAGAATCAGACCAGGAGTTAAAGAAGATG
VCAGACAAGGCTGCAACATGGCTGTTCTACAAGTTTGACCTGTACATGAGCGAACAGTCG
DTCCTAGCAAGTCTCACCAGAAAACACGGTCTTGGTCTATCCATGGTCTTTGTAAAGATC
NTATACGCCGCAGTGTCGTTCGGGTGTTTCCTCCTGACCGCTGAGATGTTCTCAATTGGA
XATTTTAAAACCTATGGATCAGAATGGATCAAGAAGTTAAAGTTGGAAGATAATCTAGCT
TAG---------------------------------------------------------

Usage example 2

Restrict alignment characters to the unambiguous character set and 'N'

$: alb ambiguous_cds.fa -cs strict

Output

>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
NNNNNNNNNNNNCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
NCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
NGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
NATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
NGTACGACGGTACGAGGTTAAAGT------------------CCAGACCCTGATCAGTTG
NTGTCACCGACGCGGATATCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTG
NCGGCTGCTGCCTTCTTCATGCCCTACCTTCTGTACA------TTGGCATGGGAGATATC
NAGCCTCTCGTGAG------ACACAATCCAGTAGAATCAGACCAGGAGTTAAAGAAGATG
NCAGACAAGGCTGCAACATGGCTGTTCTACAAGTTTGACCTGTACATGAGCGAACAGTCG
NTCCTAGCAAGTCTCACCAGAAAACACGGTCTTGGTCTATCCATGGTCTTTGTAAAGATC
NTATACGCCGCAGTGTCGTTCGGGTGTTTCCTCCTGACCGCTGAGATGTTCTCAATTGGA
NATTTTAAAACCTATGGATCAGAATGGATCAAGAAGTTAAAGTTGGAAGATAATCTAGCT
TAG---------------------------------------------------------

Usage example 3

Replace ambiguous characters with 'X' instead of 'N'

$: alb ambiguous_cds.fa -cs strict X

Output

>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
XXXXXXXXXXXXCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
XCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
XGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
XATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
XGTACGACGGTACGAGGTTAAAGT------------------CCAGACCCTGATCAGTTG
XTGTCACCGACGCGGATATCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTG
XCGGCTGCTGCCTTCTTCATGCCCTACCTTCTGTACA------TTGGCATGGGAGATATC
XAGCCTCTCGTGAG------ACACAATCCAGTAGAATCAGACCAGGAGTTAAAGAAGATG
XCAGACAAGGCTGCAACATGGCTGTTCTACAAGTTTGACCTGTACATGAGCGAACAGTCG
XTCCTAGCAAGTCTCACCAGAAAACACGGTCTTGGTCTATCCATGGTCTTTGTAAAGATC
XTATACGCCGCAGTGTCGTTCGGGTGTTTCCTCCTGACCGCTGAGATGTTCTCAATTGGA
XATTTTAAAACCTATGGATCAGAATGGATCAAGAAGTTAAAGTTGGAAGATAATCTAGCT
TAG---------------------------------------------------------

Main Toolkit Pages





Further Reading

Clone this wiki locally