Skip to content
osallou edited this page Sep 20, 2011 · 6 revisions

What is it

Cassiopee is a Ruby module to search string with exact match or an allowed distance in an other string (or file). An index can optionally be saved for further searches.

Exact or approximate search is used in many fields among which bioinformatics to search some patterns in DNA/RNA sequences. The software works on small or large sequences.

The code is open source.

Install

Gem is available on RubyGems.org Gem can be created from cassiopee.gemspec

Parsing methods

Two methods DIRECT (default) and SUFFIX are available. DIRECT parse the string on all positions and send results. SUFFIX save all found suffixes then check for a match within suffixes. This last option is RAM intensive for large sequences but speeds up the process when several searches are made in the same context, it avoids reparsing the whole sequence.

Position filter

It is possible to define a filter on start position. If store is not used, it also speeds up the search. Setting max to 0 means no max. This will limit the matches to a window in the indexed string.

Comments

Comments is an array of line start characters. Lines matching one of those chars will be skipped and not indexed.

Optimal method

Optimal methods (length or cost) will remove some matches from final result. This is a post-treatment step.

For length, it will keep the longest match for a same start position.

For cost, it will keep the lower cost (hamming or levenshtein) for a same start position.

Alphabet ambiguity support

It is possible to define an alphabet ambiguity e.g. to associate multiple char values to a singe one. This is common in bioinformatics for dna sequences for example.

Such a file is like:

b=c,g,t

r=a,g

When loading with loadAmbiguityFile, useAmbiguity var is set, and search (exact or ambiguous) will use this alphabet transformation. This has impact on performances, mainly for exact search.

Cache

Class CrawlerCache add (very) basic cache management. If useCache is set in Crawler, then result is saved in a file. If next request is identical or within same scope (positions, errors), cached results (or subset) are sent back instead of reparsing.

Clone this wiki locally