Paraphrase detection with contrastive pre-training

The goal of this repository is to evaluate the effect of contrastive pre-training on classification performance in paraphrase detection. The folder contrastive contains all the code to pretrain an encoder. The folder supervised follows as similar structure and is used to train the encoder and classifier at once. This is meant to create a baseline model against which to compare the performance of the contrastive model. Finally, pretrained is used to train a linear layer on top of the contrastive models after freezing the encoder weights.

Setup

In order to use the repository, install all packages from requirements.txt

pip install -r 'requirements.txt'

Training

The training (contrastive, supervised, and pretrained) can be done using the main.py script by passing the required flags.

The mode-flag is used to switch between contrastive, supervised, and pretrained
The config flag determines the baseline configuration
- Refer to the respective model_configs.json for all available configurations. They can be found in the models folder, e.g. contrastive/models/model_configs.json

Example

python main.py --mode=contrastive --config=Pairwise_LARS

If parameters from the baseline configuration should be adapted, simply pass their name as a flag. For instance, if the epochs should be set to 15 but everything else should remain the same, use the following command:

python main.py --mode=contrastive --config=Pairwise_LARS --epochs=15

Sweeping

In order to obtain optimal hyperparameters, the script in sweeping runs sweeps with Weights and Biases. The usage is similar to the main.py script. Inside the sweeping-folder execute the following command:

python sweeping.py --mode=contrastive

Model cards

The model cards are part of the README in each folder and give an overview on the respective models.

Evaluation

The F1-scores, precision and recall values for each model can be found in the evaluation folder. The columns relate to the follow datasets which are available on request via HuggingFace:

Val = Custom validation dataset
Test = Custom test dataset
noObf = No obfuscation subset of PAN-13
randomObf = Random obfuscation subset of PAN-13
translationObf = Translation obfuscation subset of PAN-13

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
contrastive		contrastive
dataset		dataset
evaluation		evaluation
img		img
pan_preparation		pan_preparation
pretrained		pretrained
supervised		supervised
sweeping		sweeping
visualization		visualization
Dockerfile		Dockerfile
LICENSE		LICENSE
Paraphrase_Detection.ipynb		Paraphrase_Detection.ipynb
README.MD		README.MD
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paraphrase detection with contrastive pre-training

Setup

Training

Sweeping

Model cards

Evaluation

About

Releases

Packages

Contributors 2

Languages

License

sheineking/bdlt_contrastive

Folders and files

Latest commit

History

Repository files navigation

Paraphrase detection with contrastive pre-training

Setup

Training

Sweeping

Model cards

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages