prepembd

Installation

Install prepembd:

python3 -m pip install prepembd --upgrade

Requirements

Python >= 3.10

Why

I've been using markdown now for a long time to take notes in every possible scenario. I even manage my Anki cards with markdown (inka2) so finding relevant information again is paramount. With the advent of semantic search via Embeddings search became so much more powerfull. However, to create the embeddings out of markdown the files have to be prepared in order to reduce noice and create the correct chunk sizes.

This Python script automates the process and creates a json representation of all the markdown files which then can be fed into an embedding model. It is basically just a thin wrapper aroung LangChain combined with some bespoke filter to eliminated noise.

Usage

prepembd tokenize --prefix '$VIMWIKI_PATH/' <directory> | tee -a output.ndjson

# cat output.ndjson:
{
  "id": "$VIMWIKI_PATH/help/qk/quarkus.md:0",
  "content": "..."
}
{
  "id": "$VIMWIKI_PATH/help/qk/quarkus.md:1",
  "content": "..."
}
{
  "id": "$VIMWIKI_PATH/help/qk/quarkus.md:2",
  "content": "..."
}

This script integrates particularly well with bkmr.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
.idea		.idea
src/prepembd		src/prepembd
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
VERSION		VERSION
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prepembd

Installation

Requirements

Why

Usage

About

Releases 1

Packages

Contributors 2

Languages

sysid/prepembd

Folders and files

Latest commit

History

Repository files navigation

prepembd

Installation

Requirements

Why

Usage

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages