A toolset for fast and space efficient DNA read set matching and assembly using a simple kmer sampling approach.
Seeding sequence alignment with exact kmer matches is a key component of many bioinformatics methods for DNA sequence matching and data set analysis. Tools exist for rapidly enumerating all kmers for a read set, but representing them all in memory is costly. A natural tool for rapid access is a hash table, but that can take even more space. Recently minhash and minimizer approaches have been introduced, which save space in increase speed by using only a subset of kmers. For these methods kmers are retained based on whether their hash values are low with respect to others in the set (minhash), or minimal with respect to others in a sequence window (minimizer). Here we retain kmers whose hash value is 0 modulo d (an inverse density). This is a simple process which controls the density of kmers directly and ensures that if a kmer is present it will be kept, independent of context, so matching is guaranteed if the kmer is present in two sequences.
Modimisers are good for sampling kmer frequency distributions, and for sketches for long reads, large readsets or genomes. They can lose sensitivity as sources of seeds for short reads, because there is then a finite chance that no kmer in the read generates a modimiser, whereas the minimizer approach guarantees at least one kmer every w bases. However, the choice of which kmer to use as a minimizer depends on nearby bases, so is affected by errors in nearby bases, which reduces sketch matching for sequences with high error rates of the type we obtain currently from PacBio and Oxford Nanopore.
The modimizer package also explores another idea to improve specificity. This is to make strong use of kmers that are believed to be single copy in the underlying genome, for example using them exlusively as seeds for matches. We can identify the single copy kmers from a reference (modmap
) or from shotgun read sets. One idea is to identify likely single copy kmers from an (accurate short read) Illumina data set (using modutils
), then use them to assemble a long read set (modasm
). I would also like to rewrite the 10X Genomics processing methods in hash10x using modimizers (mod10x
).
This repository is built on a slightly revised version of my standard C framework (utils
, array
, hash
, dict
, seqio
), all contained in the package along with a couple of utilities such as composition
to analyse sequence file composition.