Skip to content
PragashSiva edited this page Mar 2, 2016 · 3 revisions

BAMQL filters SAM or BAM files using a simple query language that is more expressive than the view options available in samtools.

The language consists of logical connectives, specifically and, or, not, and a ternary if, and predicates that match properties of the read. Expressions may be grouped using parentheses.

 

LOGICAL CONNECTIVES

The logical connectives are presented in order from lowest precedence to highest precedence.

 

TERNARY IF

cond then then_expression else else_expression

If cond is satisfied, this expression will only be satisfied if then_expression is satisfied. If cond is not satisfied, then this expression will only be satisfied if else_expression is satisfied.

 

DISJUNCTION (OR)

expr | expr

Is satisfied if at least one operand is satisfied.

 

EXCLUSIVE DISJUNCTION (XOR)

expr ^ expr

Is satisfied if exactly one operand is satisfied, but not both.

 

CONJUNCTION (AND)

expr & expr

Is satisfied only if both operands are satisfied.

 

NEGATION (NOT)

! expr

Is satisfied if expr is not satisfied.

 

PREDICATES

The predicates are grouped based on function.

 

BAM FLAGS

These predicates match the sequences in the BAM file. For more details, see sam(5).

paired?

The read is paired in sequencing.

proper_pair?

The read is mapped in a proper pair; that is, the aligner thought the insert size implied by the paired-end reads seemed correct.

unmapped?

The query sequence itself is unmapped.

mate_unmapped?

The mate is unmapped.

mapped_to_reverse?

The read is mapped to the reverse strand.

mate_mapped_to_reverse?

The read's mate is mapped to the reverse strand.

raw_flag(number)

If numeric BAM flags have been burned into your brain, you can check them directly by specfiying number.

read1?

The read is the first read in a pair.

read2?

The read is the second read in a pair.

secondary?

The alignment is not primary.

failed_qc?

The read fails platform/vendor quality checks.

duplicate?

The read is either a PCR or an optical duplicate. That is, another read with the same sequence occurs at exactly the same position in the reference genome.

supplementary?

The alignment is supplementary.

 

MAPPING INFORMATION

chr(glob)

Matches the chromosome name to which the read is mapped. Chromosome names should not start with chr, as that will be automatically checked. Moreover, some human chromosome have differing names, so both are checked. The known rules are:

X == 23

Y == 24

MT == M == 25

Also, case is ignored. Additionally, the chromosome is matched using wildcards from glob(7).

mapping_quality(probability)

Matches the read if the probability of error is less than probability. The mapping quality is approximated in the SAM format, so this will be imperfect. For clarity, setting the probability to 0 will be so stringent as to reject all reads, while setting it to 1 will be so liberal as to accept all reads.

mate_chr(glob)

This works identically to chr, but on the chromosome of the mate pair, if one exists. If the mate is unmapped, this returns false.

split_pair?

Checks if both the reads in a mate pair are mapped, but to different chromosomes.

 

OTHER READ INFORMATION

read_group(glob)

Matches the read group, if specified in the input. The read group may be specified using glob(7)

to match multiple read groups.

aux_char(code, value)

aux_dbl(code, value)

aux_int(code, value)

aux_str(code, value)

Matches a piece of auxiliary data, if specified in the input. The code is the two symbol identifier for the auxilary format. The value must be a single character, float point number, integral numer, or glob(7)

for aux_char, aux_dbl, aux_int, and aux_str, respectively.

 

POSITION

All of the position operations are inclusive: that means they take any reads with nucleotides in the desired range. This means that the start or end of a read can extend beyond the desired positions. BAM files allow reads to have position information while still being marked as unmapped. This operations ignore the official mapping status, and work solely on the position information. If this is undesirable, combine with & !unmapped?. Occasionally, the aligner produces reads which have a position, but no detailed mapping information (i.e., no CIGAR string). In this case, the end position of the read is assumed to be mapped with no insertions or deletions.

after(position)

Matches all sequences that cover the specified position or any higher position (more q-ward on the chromosome).

before(position)

Matches all sequences that cover the specified position or any lower position (more p-ward on the chromosome).

position(start, end)

Matches all sequences that cover the range of position from start to end.

 

SEQUENCE

nt(position, n)

Matches a read has nucleotide n at the provided position, relative to the chromosome. The nucleotide can be any IUPAC-style base (ACGTU, KMYR, BDHV, and N). The match is degenerate; that is, if the nucleotide specified is N, any base will match. It will reject unmapped reads and reads which do not contain the required position.

nt_exact(position, n)

Matches a read has nucleotide n at the provided position, relative to the chromosome. The nucleotide can be any IUPAC-style base (ACGTU, KMYR, BDHV, and N). The match is exact; that is, if the nucleotide specified is N, the base in the read must be N too. It will reject unmapped reads and reads which do not contain the required position.

 

MISCELLANEOUS

header ~ /regex/

Matches a PERL-compatible regular expression agains the read's header. For details, see pcrepattern(3).

true

Always satisfied.

false

Never satisfied.

random(probability)

This chooses a uniform pseudo-random variable and is satisfied with frequency probability. This can be used to provide a random sub-sample of reads. The probability must be between 0 and 1 and can be specified using scientific notation. The random number chosen is selected using drand48(3)

if one is inclined to care about such things.

 

EXAMPLES

Match sequences on chromosome 7 which are from the read group labelled RUN3:

chr(7) & read_group(RUN3)

Sub-sample mitochondrial reads and all the reads that have matched to chromosomes starting with ug.

chr(M) & random(0.2) | chr(ug*)