Project Description

llm-cluster-optimizer

A drop-in replacement for sklearn clustering models, this package optimizes sklearn clusters using LLMs

Installation

By default, this package relies on OpenAI for LLM interactions. This requires having copying your API key from here and putting it in the OPENAI_API_KEY envvar.

Otherwise, you can use any other LLM provider, as described below.

To install llm-cluster-optimizer, run the following command in your terminal:

pip install llm-cluster-optimizer

Usage

Here is a simple example of usage that provides two cluster models as options to be optimized. labels will be an np.ndarray of shape (len(input_texts),) representing which cluster each input text falls into. Use verbose=True in LLMClusterOptimizer.__init__() to get more information on progress.

from llm_cluster_optimizer import LLMClusterOptimizer
from sklearn.cluster import AffinityPropagation, HDBSCAN
import numpy as np
from langchain_openai import OpenAIEmbeddings

def embed_text_openai(text):
    return np.array(OpenAIEmbeddings(model="text-embedding-3-large").embed_query(text))

options = [
    HDBSCAN(min_cluster_size=2, min_samples=1),
    AffinityPropagation(damping=0.7),
]
optimizer = LLMClusterOptimizer(options, embed_text_openai)

input_texts = []  # your list of input texts
labels = optimizer.fit_predict_text(input_texts)

To use a clustering algorithm that takes the number of clusters as an input, you likely want to optimize n using the silhouette score or a similar metric. This requires access to the embeddings. You can pass the model wrapped in a function to achieve this:

from sklearn.cluster import AgglomerativeClustering
from util import guess_optimal_n_clusters

def get_agglomerative_model(embeddings: np.ndarray):
    n_clusters = guess_optimal_n_clusters(embeddings, lambda n: AgglomerativeClustering(n_clusters=n))
    return AgglomerativeClustering(n_clusters=n_clusters)


options = [
    HDBSCAN(min_cluster_size=2, min_samples=1),
    AffinityPropagation(damping=0.7),
    get_agglomerative_model,
]
optimizer = LLMClusterOptimizer(options, embed_text_openai)

Docs

init()

Param	Description	Default	Type
`cluster_models`	List of SKlearn cluster model instances (`ClusterMixin`) to choose from. Each item can be either a `ClusterMixin` or a `Callable` that takes a list of embeddings and returns a `ClusterMixin`.	N/A	`Sequence[Union[ClusterMixin, Callable[[ndarray], ClusterMixin]]]`
`embedding_fn`	Function that takes a string and returns its embedding.	N/A	`Callable[[str], ndarray]`
`recluster_model`	`ClusterMixin` used for reclustering results during optimization stages. Returns clusters that are too small. If not passed, `AffinityPropagation(damping=0.7)` is used.	`AffinityPropagation(damping=0.7)`	`ClusterMixin`
`get_cluster_summary`	Function that takes a list of strings which are samples from a cluster and returns a summary of them. See `default` for default functionality. Requires OpenAI API key in envvars if using default.	Uses GPT-4o-mini to summarize strings sampled from the cluster. Summary will be no more than 10 words.	`Callable[[list[str]], str]`
`calculate_clustering_score`	Function that takes a list of `SummarizedCluster` and returns a float representing the clustering quality. Higher is better. See `default` for default functionality.	Calculates the average minimum pairwise cosine distance between cluster summary embeddings.	`Callable[[list[SummarizedCluster]], float]`
`cosine_distance_thresholds_to_combine`	List of floats representing maximum cosine distance between cluster summary embeddings where clusters should be combined. The best threshold is chosen based on `calculate_clustering_score()`.	N/A	`Sequence[float]`
`summary_sample_size`	Number of items to sample from a cluster.	`5`	`int`
`num_summarization_workers`	Number of workers to use when running `get_cluster_summary()`. Be aware of rate limits depending on the LLM provider being used.	`25`	`int`
`num_embed_workers`	Number of workers to use when running `embedding_fn()`. Be aware of rate limits depending on the LLM provider being used.	`50`	`int`
`verbose`	Output logs to console.	`True`	`bool`

fit_predict_text()

Mimics sklearn.base.ClusterMixin.fit_predict(). Given a sequence of texts, returns the cluster labels for each text. Chooses best of cluster_models passed in __init__() based on calculate_clustering_score() Post-processes clusters so that:

large clusters are broken up
small clusters are combined based on similarity

Param	Description	Default	Type
`texts`	List of strings to cluster	N/A	`list[str]`

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src/llm_cluster_optimizer		src/llm_cluster_optimizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Description

llm-cluster-optimizer

Installation

Usage

Docs

init()

fit_predict_text()

About

Releases

Packages

Languages

License

nsantacruz/LLMClusterOptimizer

Folders and files

Latest commit

History

Repository files navigation

Project Description

llm-cluster-optimizer

Installation

Usage

Docs

init()

fit_predict_text()

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages