A leaner, faster backend for Whoosh

Whoosh is great, but was working a bit too slow for my purposes (which require fast access to postings and slightly big data (~200GB)). Rather than jump ship to Lucene or some other IR framework, I built this backend for Whoosh, which in addition to having fast postings access also ended up being quite a bit faster than Whoosh's default backend.

Benefits of this backend

over 50% reduction in indexing speed
over 50% reduction in query time
substantial reduction in index size
supports the default BM25 scoring, and should play nice with much of Whoosh
codebase is small and should be easy to understand/modify

Limitations of this backend

very lean / only supports a static index; there is no delete document, only add documents; there is no support for segments, so adding individual documents requires the entire index to be rewritten (i.e., should add documents in bulk).
no block quality (TODO)
bare minimum testing (in the notebook)
the reduction in index size, and a small part of the speed boost is from using this and this via a cython wrapper for postings compression. Very fast, but requires a recent Intel processor (e.g., Haswell). You may need to recompile the cython for this to work in your environment (run python setup.py build_ext --inplace in the streamvbyte directory to make a compatible .so file). It will fall back on pickle if you don't have this, which is not nearly as good (but still faster than default whoosh).
built in Python 3.5 with no eye for backward compatibility, and will not work with Python 2 without modification
takes up a lot of memory! All stored data is held in memory, and entire postings are read into memory; so this takes up a lot more memory than Whoosh (fixing this is a TODO)
lots of Whoosh features are not supported (e.g., term vectors, "unique" properties in the schema, etc.)

Todo

add block quality and stop storing all postings in memory

The IPython notebook has the benchmark calculations + shows how to use this backend with Whoosh.

Benchmarks

Datasets used are text collections from this site.

TCP-ECCO (170mb uncompressed) can be downloaded here
Lincoln (700kb uncompressed) can be downloaded here

Index time

Dataset	Whoosh	Swhoosh	Speedup
Lincoln	~1.03s	~0.32s	69%
TCP-ECCO (single process)	~175.1s	~66.6s	62%
TCP-ECCO (multi process)	~147.7s	~27.7s	81%

Index Size

Dataset	Whoosh	Swhoosh	Space saved
Lincoln	1.5mb	700kb	53%
TCP-ECCO	170mb	102mb	40%

Query Time

All queries disjunctive OR, on TCP-ECCO, using default BM25 scoring.

Query length	Whoosh	Swhoosh	Speedup
3 words	9.07 ms	3.83 ms	58%
6 words	14.36 ms	5.54 ms	61%
30 words	92.54 ms	48.19 ms	48%

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Lincoln		Lincoln
src		src
streamvbyte		streamvbyte
.gitignore		.gitignore
LICENSE		LICENSE
PyIndex.ipynb		PyIndex.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A leaner, faster backend for Whoosh

Benefits of this backend

Limitations of this backend

Todo

Benchmarks

Index time

Index Size

Query Time

About

Releases

Packages

Languages

License

spitis/PyIndex

Folders and files

Latest commit

History

Repository files navigation

A leaner, faster backend for Whoosh

Benefits of this backend

Limitations of this backend

Todo

Benchmarks

Index time

Index Size

Query Time

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages