This project aims to implement a minimal database that can be used as an embedded search engine.
>>> from konlsearch.search import KonlSearch
>>> from konlsearch.inverted_index import TokenSearchMode
>>> ks = KonlSearch("./test-db")
>>> index = ks.index("animation_title") # similar to creating DB
>>> # KonlIndex.index() is like inserting into DB
>>> index.index("귀환자의 마법은 특별해야 합니다")
IndexingResult(status_code=<IndexingStatusCode.SUCCESS: 'SUCCESS'>, document_id=1)
>>> index.index("마법소녀 따위는 이제 됐으니까.")
IndexingResult(status_code=<IndexingStatusCode.SUCCESS: 'SUCCESS'>, document_id=2)
>>> index.index("일찍이 마법소녀와 악은 적대하고 있었다.")
IndexingResult(status_code=<IndexingStatusCode.SUCCESS: 'SUCCESS'>, document_id=3)
>>> index.search(["마법"], TokenSearchMode.OR)
[1] # "마법" is indexed in document 1
>>> index.search(["마법소녀"], TokenSearchMode.OR)
[2, 3] # "마법소녀" is indexed in document 2, 3
>>> index.search(["마법소녀", "적대"], TokenSearchMode.AND) # matches only documents that have both "마법소녀" and "적대"
[3]
>>> index.get(2)
IndexGetResponse(status_code=<GetStatusCode.SUCCESS: 'SUCCESS'>, result=IndexGetResult(id=2, document='마법소녀 따위는 이제 됐으니까.'))
>>> index.get_all()
[IndexGetResponse(status_code=<GetStatusCode.SUCCESS: 'SUCCESS'>, result=IndexGetResult(id=1, document='귀환자의 마법은 특별해야 합니다')), IndexGetResponse(status_code=<GetStatusCode.SUCCESS: 'SUCCESS'>, result=IndexGetResult(id=2, document='마법소녀 따위는 이제 됐으니까.')), IndexGetResponse(status_code=<GetStatusCode.SUCCESS: 'SUCCESS'>, result=IndexGetResult(id=3, document='일찍이 마법소녀와 악은 적대하고 있었다.'))]
>>> index.search_suggestions("ㅈ") # searches all tokens that begin with 'ㅈ', useful for autocomplete
['적대', '적대하고']
>>> index.close()
>>> ks.close()
>>> ks.destroy() # deletes DB
More internal and advanced usages can be found under test/test_konlsearch.py
- It takes about 280s to index the entire database of Korean Wikipedia titles (~30MB) on Apple M1 Max 2021 with 32GB RAM.
- You can download the dump file from the official Wikipedia database page. (kowiki-20240801-all-titles-in-ns0.gz)
- It expands to 200MB after indexing with the default RocksDB configurations.
- KonlSearch relies on python-mecab-ko for tokenizing and hangul-toolkit for decomposing Korean characters into consonants and vowels.
- RocksDB is used as a storage engine and RocksDict for RockDB Python binding