The structure of this folder mimics that of contrastive. Here, however, supervised learning is performed to train a classifier to determine if a pair of sentence constitutes a paraphrase (label=1) or not (label=0). To train a supervised model, simply call the main script with mode==supervised:
python main.py --mode=supervised --config=Supervised_SGD
File | Description |
---|---|
learning_manager.py | Learning Manager class to define the training process |
predictor.py | Perform inference on a dataset of sentence pairs |
model_configs.py | Script to write the model_configs.json |
models.py | Model definition and optimizer selection |
Note: The weights are not released publicly, please contact us with your desired use case via ss56pupo(at)studserv.uni-leipzig.de.
The models are sentence-transformers based on the encoder all-MiniLM-L6-v2. The underlying encoder maps sentence pairs to one 384-dimensional embedding and estimates the probability for a paraphrase based on a single linear layer.
In order to use the model, install all necessary packages featured in requirements.txt:
pip install -r requirements.txt
To apply the model, use the Predictor class provided in predictor.py. You need to provide two inputs:
- A Hugging Face dataset "idx", "sentence1", "sentence2", and "labels" (e.g. GLUE MRPC)
- A valid model name; Refer to model_configs.json for the available models
python
import predictor as p
from datasets import load_dataset
dataset = load_dataset(path="glue", name="mrpc")["validation"]
Predictor = p.Predictor(model_name=model_name)
Predictor.tokenize_dataset(dataset)
logits, labels = Predictor.predict(return_logits=True, batch_size=batch_size)
The F1-scores, precision and recall values for each model can be found in the evaluation folder. The columns relate to the follow datasets which are available on request via HuggingFace:
- Val = Custom validation dataset
- Test = Custom test dataset
- noObf = No obfuscation subset of PAN-13
- randomObf = Random obfuscation subset of PAN-13
- translationObf = Translation obfuscation subset of PAN-13
The models were developed as part of a student research project to compare the performance of contrastive learning on text alignment with that of traditional supervised learning.
The models are intended to be used for paraphrase detection, for instance in the text alignment subtask of text reuse identification. By default, input text longer than 256 word pieces is truncated.
The models were trained on a custom dataset derived from ParaBank and PAWS. All models were trained for a maximum of ten epochs (shorter training occured when validation performance did not improve). The name of each model reflects the optimizer that was used to train it.
- Supervised_SGD: Optimizer = SGD
The values used in training are summarized in model_configs.json.