Updated docs

MinishLab · Feb 9, 2025 · 515bf8a · 515bf8a
1 parent eaf840c
commit 515bf8a
Show file tree

Hide file tree

Showing 3 changed files with 17 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -54,12 +54,7 @@ Model2Vec is a technique to turn any sentence transformer into a really small st
 - [Quickstart](#quickstart)
 - [Main Features](#main-features)
 - [What is Model2Vec?](#what-is-model2vec)
-- [Usage](#usage)
-    - [Inference](#inference)
-    - [Distillation](#distillation)
-    - [Training](#training)
-    - [Evaluation](#evaluation)
-- [Integrations](#integrations)
+- [Documentation](#documentation)
 - [Model List](#model-list)
 - [Results](#results)
 
@@ -121,18 +116,10 @@ For advanced usage, such as using Model2Vec in the [Sentence Transformers librar
 
 ## What is Model2Vec?
 
-Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Distillation doesn't need _any_ data, just a vocabulary and a model.
-
-The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence.
-
-Our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) are pre-trained using [tokenlearn](https://github.com/MinishLab/tokenlearn), a technique to pre-train model2vec distillation models. These models are created with the following steps:
-- **Distillation**: We distill a Model2Vec model from a Sentence Transformer model, using the method described above.
-- **Sentence Transformer inference**: We use the Sentence Transformer model to create mean embeddings for a large number of texts from a corpus.
-- **Training**: We train a model to minimize the cosine distance between the mean embeddings generated by the Sentence Transformer model and the mean embeddings generated by the Model2Vec model.
-- **Post-training re-regularization**: We re-regularize the trained embeddings by first performing PCA, and then weighting the embeddings using `smooth inverse frequency (SIF)` weighting using the following formula: `w = 1e-3 / (1e-3 + proba)`. Here, `proba` is the probability of the token in the corpus we used for training.
-
-
-For a much more extensive deepdive, please refer to our [Model2Vec blog post](https://huggingface.co/blog/Pringled/model2vec) and our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/).
+Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Distillation doesn't need _any_ data, just a vocabulary and a model. The core idea is to forward pass a vocabulary through a sentence transformer model, creating static embeddings for the indiviudal tokens. After this, there are a number of post-processing steps we do that results in our best models. For a more extensive deepdive, please refer to the following resources:
+- Our initial [Model2Vec blog post](https://huggingface.co/blog/Pringled/model2vec)
+- Our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/).
+- Our official [documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md)
 
 
 ## Model List

diff --git a/docs/README.md b/docs/README.md
@@ -3,3 +3,4 @@
 This directory contains the documentation for Model2Vec. The documentation is formatted in Markdown. The documentation is organized as follows:
 - [usage.md](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): This document provides a technical overview of how to use Model2Vec.
 - [integrations.md](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md): This document provides examples of how to use Model2Vec in various downstream libraries.
+- [what_is_model2vec.md](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md): This document provides a high-level overview of how Model2Vec works.
diff --git a/docs/what_is_model2vec.md b/docs/what_is_model2vec.md
@@ -0,0 +1,11 @@
+# What is Model2Vec?
+
+This document provides a high-level overview of how Model2Vec works.
+
+The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence.
+
+Our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) are pre-trained using [tokenlearn](https://github.com/MinishLab/tokenlearn), a technique to pre-train model2vec distillation models. These models are created with the following steps:
+- **Distillation**: We distill a Model2Vec model from a Sentence Transformer model, using the method described above.
+- **Sentence Transformer inference**: We use the Sentence Transformer model to create mean embeddings for a large number of texts from a corpus.
+- **Training**: We train a model to minimize the cosine distance between the mean embeddings generated by the Sentence Transformer model and the mean embeddings generated by the Model2Vec model.
+- **Post-training re-regularization**: We re-regularize the trained embeddings by first performing PCA, and then weighting the embeddings using `smooth inverse frequency (SIF)` weighting using the following formula: `w = 1e-3 / (1e-3 + proba)`. Here, `proba` is the probability of the token in the corpus we used for training.