Skip to content

Semantic Search

sysid edited this page Mar 17, 2024 · 1 revision

Embeddings

Embeddings are a way to represent text as a vector in a high-dimensional space. The distance between two vectors can be used to measure the similarity between the texts they represent. This powerful principle allows for semantic search, where the meaning of the text is taken into account, not just the words.

However, for embeddings to add value, context is required. Bare bookmarks do not provide a lot of content to work with. bkmr uses the title, tags and comments as input to the OpenAI Embeddings API.

The more context you add via comments, the better the results will be.

sem_search

Usage

The --openai flag enables semantic search using OpenAI's Embeddings. The OPENAI_API_KEY environment variable must be set.

  • embeddings will increase DB size since every embeddings has got 1536 dimensions.
  • requires the --openai flag.
export OPENAI_API_KEY=<your-key>

# Use semantic search
bkmr --openai sem-search "python security"  # requires OPENAI_API_KEY

# Search DB for bookmarks which do nat have embeddings and create them
bkmr --openai backfill

bkrm is smart enough to only reach out to OpenAI if the bookmark has changed. It maintains a content checksum. If you do not want to use this feature, just do not set --openai. Everything works just fine without semantic search.

Semantic Search for all your data

bkmr data model is flexible enough to generalize the idea of semantic search to arbitrary data. When interpreting the URL as a local file, the file content can be used as input to the OpenAI Embeddings API. After importing the data, the bookmark table will hold the embeddings with its associated URL or file path.

bkmr --openai load-texts tokenized.ndjson

The tokenized.ndjson file must contain a list of dictionaries with the following keys:

  • id: unique identifier (URL or file path)
  • content: the text to be embedded for semantic search

Example:

{
  "id": "$VIMWIKI_PATH/help/qk/quarkus.md:0",
  "content": "..."
}
{
  "id": "$VIMWIKI_PATH/help/qk/quarkus.md:1",
  "content": "..."
}
{
  "id": "$VIMWIKI_PATH/help/qk/quarkus.md:2",
  "content": "..."
}

Preparation of data

prepembd is a little utility to scan a directory for markdown files and create a tokenized.ndjson input file for bkmr --openai load-texts. It allows for filtering of sensitive data and ensures that the chunk size is not too large for the OpenAI API.

Clone this wiki locally