Name		Name	Last commit message	Last commit date
parent directory ..
models		models
README.md		README.md
learning_manager.py		learning_manager.py
model_configs.py		model_configs.py
models.py		models.py
predictor.py		predictor.py

README.md

Supervised Training

The structure of this folder mimics that of contrastive. Here, however, supervised learning is performed to train a classifier to determine if a pair of sentence constitutes a paraphrase (label=1) or not (label=0). To train a supervised model, simply call the main script with mode==supervised:

python main.py --mode=supervised --config=Supervised_SGD

Files

File	Description
learning_manager.py	Learning Manager class to define the training process
predictor.py	Perform inference on a dataset of sentence pairs
model_configs.py	Script to write the model_configs.json
models.py	Model definition and optimizer selection

Model Card

Note: The weights are not released publicly, please contact us with your desired use case via ss56pupo(at)studserv.uni-leipzig.de.

The models are sentence-transformers based on the encoder all-MiniLM-L6-v2. The underlying encoder maps sentence pairs to one 384-dimensional embedding and estimates the probability for a paraphrase based on a single linear layer.

Usage

In order to use the model, install all necessary packages featured in requirements.txt:

pip install -r requirements.txt

To apply the model, use the Predictor class provided in predictor.py. You need to provide two inputs:

A Hugging Face dataset "idx", "sentence1", "sentence2", and "labels" (e.g. GLUE MRPC)
A valid model name; Refer to model_configs.json for the available models

python
import predictor as p
from datasets import load_dataset

dataset = load_dataset(path="glue", name="mrpc")["validation"]

Predictor = p.Predictor(model_name=model_name)
Predictor.tokenize_dataset(dataset)
logits, labels = Predictor.predict(return_logits=True, batch_size=batch_size)

Evaluation Results

The F1-scores, precision and recall values for each model can be found in the evaluation folder. The columns relate to the follow datasets which are available on request via HuggingFace:

Val = Custom validation dataset
Test = Custom test dataset
noObf = No obfuscation subset of PAN-13
randomObf = Random obfuscation subset of PAN-13
translationObf = Translation obfuscation subset of PAN-13

Background

The models were developed as part of a student research project to compare the performance of contrastive learning on text alignment with that of traditional supervised learning.

Intended uses

The models are intended to be used for paraphrase detection, for instance in the text alignment subtask of text reuse identification. By default, input text longer than 256 word pieces is truncated.

Training procedure

The models were trained on a custom dataset derived from ParaBank and PAWS. All models were trained for a maximum of ten epochs (shorter training occured when validation performance did not improve). The name of each model reflects the optimizer that was used to train it.

Supervised_SGD: Optimizer = SGD

Hyperparameters

The values used in training are summarized in model_configs.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

supervised

supervised

README.md

Supervised Training

Files

Model Card

Usage

Evaluation Results

Background

Intended uses

Training procedure

Hyperparameters

Files

supervised

Directory actions

More options

Directory actions

More options

Latest commit

History

supervised

Folders and files

parent directory

README.md

Supervised Training

Files

Model Card

Usage

Evaluation Results

Background

Intended uses

Training procedure

Hyperparameters