Streaming algorithm for computing kmer statistics for massive genomics datasets.
To compile just type make
To see the usage just type KmerStream
KmerStream 1.1
Estimates occurrences of k-mers in fastq or fasta files and saves results
Usage: KmerStream [options] ... FASTQ files
-k, --kmer-size=INT Size of k-mers, either a single value or comma separated list
-q, --quality-cutoff=INT Comma separated list, keep k-mers with bases above quality threshold in PHRED (default 0)
-o, --output=STRING Filename for output
-e, --error-rate=FLOAT Error rate guaranteed (default value 0.01)
-t, --threads=INT Number of threads to use (default value 1)
-s, --seed=INT Seed value for the randomness (default value 0, use time based randomness)
-b, --bam Input is in BAM format (default false)
--binary Output is written in binary format (default false)
--tsv Output is written in TSV format (default false)
--verbose Print lots of messages during run
--online Prints out estimates every 100K reads
--q64 set if PHRED+64 scores are used (@...h) default used PHRED+33
Options:
-k
the k-mer size, this should be an integer or a list of integers e.g.-k 31
or-k 31,47,63
, odd values behave better than even values-q
optional quality cutoff values, all k-mers with bases under theq
threshold are discarded-o
filename where the output should be written-e
guarantee on the error of the estimator used, default value is 1%, lower values increase memory usage-t
number of threads to use-s
KmerStream uses random hash functions for computing the statistics, to fix the hash value for reproducibility set the seed to a fixed value, e.g. '-s 42'-b
Input is in BAM format--binary
Write output in binary format, this includes the data necessary for runningKmerStreamJoin
, the output filename is used as a prefix and the file containing the output isPREFIX
+_Q_0_k_31
--tsv
Write output in TSV (tab separated values) format for easier parsing--online
prints estimates every 100K reads, see (https://pmelsted.wordpress.com/2014/07/12/analyzing-data-while-downloading/)[https://pmelsted.wordpress.com/2014/07/12/analyzing-data-while-downloading/] for example usage--q64
Quality values are enchoded in PHRED+64 format rather than the default PHRED+33, use this if your quality values are from@
toh
rather than!
toI
KmerStreamJoin 1.1
Creates union of many stream estimates
Usage: KmerStreamJoin -o output files ...
KmerStreamJoin merged-file
-o, --output=STRING Filename for output
--verbose Print output at the end
KmerStreamJoin, when run with the -o
option takes a list of KmerStream binary output files (created with --binary
option to KmerStream) and creates a single binary output file that is equivalent to having run a single KmerStream run on all of the files. When the -o
option is missing it outputs the KmerStream result of the binary input file.
This utility is useful when distributing the process of creating the binary files or computed incrementally.
KmerStreamEstimate is a python script that reads a tsv file as input (generated using --tsv
) and estimates the genome size (G), error rate (e), and coverage (lambda).