simonw/llm-cluster: LLM plugin for clustering embeddings #839
Labels
CLI-UX
Command Line Interface user experience and best practices
embeddings
vector embeddings and related tools
llm
Large Language Models
simonw/llm-cluster: LLM plugin for clustering embeddings
Snippet
Content
LLM plugin for clustering embeddings
Background on this project: Clustering with llm-cluster.
Installation
Install this plugin in the same environment as LLM.
Usage
The plugin adds a new command,
llm cluster
. This command takes the name of an embedding collection and the number of clusters to return.First, use
paginate-json
andjq
to populate a collection. In this case we are embedding the title and body of every issue in thellm
repository, and storing the result in aissues.db
database:The
--store
flag causes the content to be stored in the database along with the embedding vectors.Now we can cluster those embeddings into 10 groups:
If you omit the
-d
option the default embeddings database will be used.The output should look something like this (truncated):
The content displayed is truncated to 100 characters. Pass
--truncate 0
to disable truncation, or--truncate X
to truncate to X characters.Generating summaries for each cluster
The
--summary
flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the--truncate
option) through a prompt to a Large Language Model.This feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.
Since this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.
This feature only works for embeddings that have had their associated content stored in the database using the
--store
flag.You can use it like this:
This uses the default prompt and the default model.
Partial example output:
To use a different model, e.g. GPT-4, pass the
--model
option:The default prompt used is:
To use a custom prompt, pass
--prompt
:A
"summary"
key will be added to each cluster, containing the generated summary.Suggested labels
None
The text was updated successfully, but these errors were encountered: