Skip to content
Martin Asser Hansen edited this page Oct 1, 2015 · 6 revisions

Biopiece: get_seq

Description

get_seq. The subsequences can be obtained explicitly or from BED/PSL/BLAST entries in the stream.

Usage

get_seq [[options]] <-i index> <-s seq_name>

or

... | get_seq [[options]] <-i index>

Options

[-?          | --help]               #  Print full usage description.
[-s <string> | --seq_name=<string>]  #  Explicit sequence name.
[-b <uint>   | --beg=<uint>]         #  Begin position of subsequence (first residue=1).
[-e <uint>   | --end=<uint>]         #  End position of subsequence.
[-l <uint>   | --len=<uint>]         #  Length of subsequence.
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following FASTA entries in the file test.fna:

>test1
CTTAGTACGTTAGCCATGAA
>test2
GAGCTTAGTACGTTAGCCAT
>test3
TGAGGGTTTAGTTCGTTAAC
>test4
ACATGAGAGCTTAGTACGTG
>test5
TAAACATGAGAGCTTAGTAA

First, we need a sequence index which we make with create_seq_index:

read_fasta -i test.fna | create_seq_index -d my_dir -i my_index -x

This creates the index files:

my_dir/my_index.index
my_dir/my_index.seq

Now, we can make a lookup of test2 like this:

get_seq -i my_dir/my_index -s test2

SEQ: GAGCTTAGTACGTTAGCCAT
SEQ_LEN: 20
SEQ_NAME: test2
---

Now, we can specify a begin position (which is 1-based) like this:

get_seq -i my_dir/my_index -s test2 -b 5

SEQ: TTAGTACGTTAGCCAT
SEQ_LEN: 16
SEQ_NAME: test2
---

If the begin position is longer than the sequnce, no record is emitted.

We can also a end position:

get_seq -i my_dir/my_index -s test2 -b 5 -e 10
SEQ: TTAGTA
SEQ_LEN: 6
SEQ_NAME: test2
---

Or a length:

get_seq -i my_dir/my_index -s test2 -b 5 -l 10

SEQ: TTAGTACGTT
SEQ_LEN: 10
SEQ_NAME: test2
---

Finally, it is possible to get multiple sequences from the stream where get_seq uses the value of SEQ_NAME, BEG, END, and LEN to extract subsequences.

Consider the following table in test.tab:

#SEQ_NAME   BEG   LEN
test1       5     5
test2       7     5
test3       9     5
test4       11    5
test5       13    5
read_tab -i test.tab | get_seq -i my_dir/my_index

BEG: 5
SEQ: TACGT
SEQ_LEN: 5
SEQ_NAME: test1
LEN: 5
---
BEG: 7
SEQ: GTACG
SEQ_LEN: 5
SEQ_NAME: test2
LEN: 5
---
BEG: 9
SEQ: AGTTC
SEQ_LEN: 5
SEQ_NAME: test3
LEN: 5
---
BEG: 11
SEQ: TAGTA
SEQ_LEN: 5
SEQ_NAME: test4
LEN: 5
---
BEG: 13
SEQ: TTAGT
SEQ_LEN: 5
SEQ_NAME: test5
LEN: 5
---

See also

read_fasta

read_tab

create_seq_index

get_genome_seq

extract_seq

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

mail@maasha.dk

September 2009

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

get_seq is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally