The goal of this repository is to evaluate the effect of contrastive pre-training on classification performance in paraphrase detection. The folder contrastive contains all the code to pretrain an encoder. The folder supervised follows as similar structure and is used to train the encoder and classifier at once. This is meant to create a baseline model against which to compare the performance of the contrastive model. Finally, pretrained is used to train a linear layer on top of the contrastive models after freezing the encoder weights.
In order to use the repository, install all packages from requirements.txt
pip install -r 'requirements.txt'
The training (contrastive, supervised, and pretrained) can be done using the main.py script by passing the required flags.
- The mode-flag is used to switch between contrastive, supervised, and pretrained
- The config flag determines the baseline configuration
- Refer to the respective model_configs.json for all available configurations. They can be found in the models folder, e.g. contrastive/models/model_configs.json
Example
python main.py --mode=contrastive --config=Pairwise_LARS
If parameters from the baseline configuration should be adapted, simply pass their name as a flag. For instance, if the epochs should be set to 15 but everything else should remain the same, use the following command:
python main.py --mode=contrastive --config=Pairwise_LARS --epochs=15
In order to obtain optimal hyperparameters, the script in sweeping runs sweeps with Weights and Biases. The usage is similar to the main.py script. Inside the sweeping-folder execute the following command:
python sweeping.py --mode=contrastive
The model cards are part of the README in each folder and give an overview on the respective models.
The F1-scores, precision and recall values for each model can be found in the evaluation folder. The columns relate to the follow datasets which are available on request via HuggingFace:
- Val = Custom validation dataset
- Test = Custom test dataset
- noObf = No obfuscation subset of PAN-13
- randomObf = Random obfuscation subset of PAN-13
- translationObf = Translation obfuscation subset of PAN-13