This repository has been archived by the owner on Jul 19, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 111
/
demuxbyname.sh
executable file
·152 lines (130 loc) · 7.34 KB
/
demuxbyname.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
#!/bin/bash
usage(){
echo "
Written by Brian Bushnell
Last modified Jan 7, 2020
Description: Demultiplexes sequences into multiple files based on their names,
substrings of their names, or prefixes or suffixes of their names.
Allows unlimited output files while maintaining only a small number of open file handles.
Usage:
demuxbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string...>
Alternately:
demuxbyname.sh in=<file> out=<outfile> delimiter=whitespace prefixmode=f
This will demultiplex by the substring after the last whitespace.
demuxbyname.sh in=<file> out=<outfile> length=8 prefixmode=t
This will demultiplex by the first 8 characters of read names.
demuxbyname.sh in=<file> out=<outfile> delimiter=: prefixmode=f
This will split on colons, and use the last substring as the name; useful for
demuxing by barcode for Illumina headers in this format:
@A00178:73:HH7H3DSXX:4:1101:13666:1047 1:N:0:ACGTTGGT+TGACGCAT
in2 and out2 are for paired reads in twin files and are optional.
If input is paired and there is only one output file, it will be written interleaved.
File Parameters:
in=<file> Input file.
in2=<file> If input reads are paired in twin files, use in2 for the second file.
out=<file> Output files for reads with matched headers (must contain % symbol).
For example, out=out_%.fq with names XX and YY would create out_XX.fq and out_YY.fq.
If twin files for paired reads are desired, use the # symbol. For example,
out=out_%_#.fq in this case would create out_XX_1.fq, out_XX_2.fq, out_YY_1.fq, etc.
outu=<file> Output file for reads with unmatched headers.
stats=<file> Print statistics about how many reads went to each file.
Processing Modes (determines how to convert a read into a name):
prefixmode=t (pm) Match prefix of read header. If false, match suffix of read header.
prefixmode=f is equivalent to suffixmode=t.
barcode=f Parse barcodes from Illumina headers.
chrom=f For mapped sam files, make one file per chromosome (scaffold) using the rname.
header=f Use the entire sequence header.
delimiter= For prefix or suffix mode, specifying a delimiter will allow exact matches even if the length is variable.
This allows demultiplexing based on names that are found without specifying a list of names.
In suffix mode, for example, everything after the last delimiter will be used.
Normally the delimiter will be used as a literal string (a Java regular expression); for example, ':' or 'HISEQ'.
But there are some special delimiters which will be replaced by the symbol they name,
because they are reserved in some operating systems or cause other problems.
These are provided for convenience due to possible OS conflicts:
space, tab, whitespace, pound, greaterthan, lessthan, equals,
colon, semicolon, bang, and, quote, singlequote
These are provided because they interfere with Java regular expression syntax:
backslash, hat, dollar, dot, pipe, questionmark, star,
plus, openparen, closeparen, opensquare, opencurly
In other words, to match '.', you should set 'delimiter=dot'.
substring=f Names can be substrings of read headers. Substring mode is
slow if the list of names is large. Requires a list of names.
Other Processing Parameters:
column=-1 If positive, split the header on a delimiter and match that column (1-based).
For example, using this header:
NB501886:61:HL3GMAFXX:1:11101:10717:1140 1:N:0:ACTGAGC+ATTAGAC
You could demux by tile (11101) using 'delimiter=: column=5'
Column is 1-based (first column is 1).
If column is omitted when a delimiter is present, prefixmode
will use the first substring, and suffixmode will use the last substring.
names= List of strings (or files containing strings) to parse from read names.
If the names are in text files, there should be one name per line.
This is optional. If a list of names is provided, files will only be created for those names.
For example, 'prefixmode=t length=5' would create a file for every unique last 5 characters in read names,
and every read would be written to one of those files. But if there was addionally 'names=ABCDE,FGHIJ'
then at most 2 files would be created, and anything not matching those names would go to outu.
length=0 If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names.
For example, you could create files based on the first 8 characters of read names.
hdist=0 Allow a hamming distance for demultiplexing barcodes. This requires a list of names (barcodes).
replace= Replace some characters in the output filenames. For example, replace=+-
would replace the + symbol in headers with the - symbol in filenames. So you could
match the name ACTGAGC+ATTAGAC in the header, but write to a file named ACTGAGC-ATTAGAC.
Buffering Parameters
streams=4 Allow at most this many active streams. The actual number of open files
will be 1 greater than this if outu is set, and doubled if output
is paired and written in twin files instead of interleaved.
minreads=0 Don't create a file for fewer than this many reads; instead, send them to unknown.
This option will incur additional memory usage.
Common parameters:
ow=t (overwrite) Overwrites files that already exist.
zl=4 (ziplevel) Set compression level, 1 (low) to 9 (max).
int=auto (interleaved) Determines whether INPUT file is considered interleaved.
qin=auto ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
qout=auto ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
Java Parameters:
-Xmx This will set Java's memory usage, overriding autodetection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.
The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an out-of-memory
exception occurs. Requires Java 8u92+.
-da Disable assertions.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}
#This block allows symlinked shellscripts to correctly set classpath.
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
cd "$(dirname "$DIR")"
DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null
#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"
z="-Xmx2g"
z2="-Xms2g"
set=0
if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
usage
exit
fi
calcXmx () {
source "$DIR""/calcmem.sh"
setEnvironment
parseXmx "$@"
if [[ $set == 1 ]]; then
return
fi
freeRam 3200m 84
z="-Xmx${RAM}m"
z2="-Xms${RAM}m"
}
calcXmx "$@"
function demuxbyname() {
local CMD="java $EA $EOOM $z $z2 -cp $CP jgi.DemuxByName2 $@"
echo $CMD >&2
eval $CMD
}
demuxbyname "$@"