Hex clusters Discworld's stories.
Clustering and search tool applied to plots of Discworld novels. Currently, given an input sentence, it will find the most similar parts of Discworld books based on their plot summaries from Wikipedia.
This is just a tiny proof-of-concept of using FAISS with transformer language models that could be easily extended to cover much larger datasets.
Should work out of the box with bash
and a couple of prerequisites:
( cd conda && source bootstrap.sh )
conda activate discworld-hex
poetry install
TL;DR (when poetry
is installed and the discworld-hex
conda env is activated):
build
search
To only fetch data and build and export the index:
build
# is just a shortcut for:
poetry run build
To use the index to search:
search
# is just a shortcut for:
poetry run search
To run any python script in this project:
poetry run python src/discworld_hex/any_file.py
To run all checks:
poetry run pre-cmmit
(What the user would notice.)
- Allow custom
wikipedia
queries on the input (and thus custom libraries) - Fine-tune (e.g., standard (masked) language modelling) on the specific subdomains
- Aggregate search results per-book
- Allow merging libraries
- Better CLI, allow to change
k
, pass in multiple sentences, etc., either: - Support other (faster, less accurate) indexes
(What the user shouldn't notice.)
- Less redundant library serialization
- More tests
- Rebuilding Library and the FAISS index