POC of ColBERT search, compared with vanilla DPR.
- Assumes you have Cassandra (vsearch branch) running locally. Should "just work" with Astra given minor changes to db.py.
- Dataset with DPR ada002 embeddings already computed, this code does not do that (but adding it would just be a few lines)
- Download the ColBERT model from https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz and extract it to the checkpoints/ subdirectory
- cqlsh < create.cqlsh
2a. hack up compute-and-load.py to load your chunks. currently it expects json files that look like this: { 'title': $title, '0': { 'content': $raw_text, 'embedding': $ada002_embedding }, '1': { 'content': $raw_text, 'embedding': $ada002_embedding }, ... }
If you don't have pre-chunked documents, or you don't have or don't want to save a single dense embedding for comparison, then adjust it accordingly.
2b. alternatively, hack up compute.py and load.py instead. compute
computes the colbert embeddings and augments the json file with them, and load
sends those to Cassandra. I did this because I wanted to compute the embeddings on a fast gpu machine.
python serve_httpy.py
and navigate to http://localhost:5000