A lightweight and high-performance (see seqkit benchmark) bioinformatics package.
This package has high performance close to the famous C lib
kseq.h
.
To test the performance, three datasets are used:
- dataset_A, bacteria genomes, 2.7G
- dataset_B, human genome, 2.9G
- dataset_C, Illumina reads, 2.2G
Summary by seqkit
:
file seq_format seq_type num_seqs min_len avg_len max_len
dataset_A.fa FASTA DNA 67,748 56 41,442.5 5,976,145
dataset_B.fa FASTA DNA 194 970 15,978,096.5 248,956,422
dataset_C.fq FASTQ DNA 9,186,045 100 100 100
seqtk
(Version 1.1-r92-dirty,
using kseq.h
)
and seqkit
(Version v0.3.1.1,
using this package) were used to test.
Note that seqtk
does not support wrapped (fixed line width) ouputing, so seqkit
uses
-w 0
to disable outputing wrapping.
Script memusg
is used to assess running time
and peak memory usage.
Tests were repeated 5 times and average time and memory usage were computed.
Results:
This package is "go-gettable", just:
go get -u github.com/shenwei356/bio
See the README of sub package.
See documentation on godoc for more detail.
Copyright (c) 2013-2016, Wei Shen (shenwei356@gmail.com)