This is a command line tool for managing vector embeddings database. It uses Langchain and chroma to manage the data ingestion, storage and retrieval.
- Creating and managing indices of document embeddings
- Searching documents based on similarity
- Retrieving embedded documents
- Estimate the cost of embedding a document
- and more...
-
Clone the repository:
git clone <repository_url>
-
Navigate to the repository folder:
cd <repository_folder>
-
Install the required Python packages:
pip install -r requirements.txt
-
Set up the environment variable
VDB_DIR
. This should be the directory where your vector databases are stored. You can do this in your shell's configuration file (e.g.,.bashrc
,.bash_profile
, or.zshrc
), or you can set it in your script before running the python file:In Bash:
export VDB_DIR=/path/to/your/directory
In the Python script:
import os os.environ["VDB_DIR"] = "/path/to/your/directory"
Command:
python main.py list-indices
Description:
Lists all the indices stored in the directory specified by the VDB_DIR
environment variable.
Command:
python main.py create-index <index_name> <input_file>
Description:
Creates an index named <index_name>
from the text file <input_file>
. Text from the file is split into chunks, and an embedding is created for each chunk. The embeddings are then stored in the new index.
Command:
python main.py describe-index <index_name>
Description:
Describes the specified index. Prints the total number of documents in the index and the set of unique sources in the metadata of the indexed documents.
Command:
python main.py contents <index_name>
Description:
Prints the content of the specified index, including both the text and metadata of each indexed document.
Command:
python main.py search-similarity <index_name> <query>
Description:
Searches the specified index for documents that are similar to the provided query. Prints the content of each found document.
Command:
python main.py chat <index_name> <query> [--temperature=<temperature>] [--model=<model>]
Description:
Q & A with GPT using the relevant information from the index.
Command:
python main.py search-keyword <index_name> <keyword>
Description:
Searches the specified index for documents that contain the provided keyword. Prints the content of each found document.
Command:
python main.py insert-text <index_name> <input_file> [--chunk_size=<chunk_size>]
Description:
Inserts text from the file <input_file>
into the specified index. The text is split into chunks (with size specified by the chunk_size
option), an embedding is created for each chunk, and these embeddings are then added to the index.
Command:
python main.py remove-text <index_name> <id>
Description:
Removes the document with the specified ID from the index.
Command:
python main.py estimate-cost <input_file>
Description:
Estimates the cost of embedding the text from the file <input_file>
. Prints the estimated cost.
Contributions are welcome! Please create a pull request with your changes.