SimSort

Using dimensionality reduction and graphs to group similar texts

Purpose

This project provides several implementations (command line accessible python script and configurable notebook) of a simple sort function that groups semantically similar texts together. This algorithm uses some common text embedding and clustering techniques to arrange texts by similarity on a single dimension, resulting in a sorted list of texts much like sorting alphabetically or by text length. Simsort specifically targets instances where it is useful to group similar texts together (as in traditional higher dimensional clustering), but the groups should also be somehow arranged in relation to one another (i.e. sorted in some interpretable way).

Approach

The simsort algorithm takes a series of unordered texts and groups them such that similar texts are together in the resulting output. Traditional sort methods are based on some absolute metric (alphabetical order, number of tokens/characters, presence of given token/character), and have predictable endpoints based on the sort criterion (a -> z, low numbers -> high numbers, short texts -> long texts). In this way, we can think of sorting techniques as placing all data observations on a 1-dimensional line, with the position of an observation corresponding to its relative similarity to one of the two endpoints of the line.

Using this framework, simsort defines the endpoints of the sorting axis as the two most dissimlar observations, with every other observation existing somewhere in between the two endpoints. Similarity is defined using the distance between the embedding vectors for each text, and these embedding vectors are projected onto a 1-dimensional space via 1) dimensionality reduction or 2) solving for the shortest path visiting all nodes of a graph based on distances between embeddings.

Simsort seeks to provide a more powerful and hopefully useful sort utility for working with text data, especially in graphical tools like spreadsheets for manual data analysis.

Algorithm

Steps for the simsort algorithm are as follows:

Embed unsorted texts using model of choice
Pass embedding array into either:
- A dimensionality reduction algorithm, bringing the dimensionality of the embeddings to n = 1 and
- A weighted graph of pairwise distances between embedding vectors, using a shortest path algorithm to find the shortest path that visits all nodes in the graph
Sort the original texts by either the sorted embedding values or shortest path indices

Embedding Models

Most embedding approaches will work in the simsort algorithm, but the present implementation uses the following models:

Order Solvers

The field of applicable dimensionality reduction techniques/shortest path algorithms is more limited, and most are used in the present project:

Outputs

Simsort script Note: If you're running this on your local machine, you may need to run download_models.py to populate embedding model files
Simsort notebook

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
download_models.py		download_models.py
outputs.json		outputs.json
pyproject.toml		pyproject.toml
simsort.ipynb		simsort.ipynb
simsort.py		simsort.py
test_results.json		test_results.json
texts.json		texts.json
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimSort

Using dimensionality reduction and graphs to group similar texts

Purpose

Approach

Algorithm

Embedding Models

Order Solvers

Outputs

About

Releases

Packages

Languages

ryancahildebrandt/simsort

Folders and files

Latest commit

History

Repository files navigation

SimSort

Using dimensionality reduction and graphs to group similar texts

Purpose

Approach

Algorithm

Embedding Models

Order Solvers

Outputs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages