- [14/04/2024] ⭐ We revise the data structure and update huggingface version of our dataset.
- [15/03/2024] 🔥 MixSet is accepted to NAACL'24 Findings!
- [11/01/2024] 🌊 Our paper and dataset are released!
The MixSet dataset is a comprehensive collection designed for advanced Machine Learning experiments. It's structured to support a variety of tasks including MGT classification in the era of LLMs, natural language understanding, and more.
The dataset is located in the ./data/MixSet/
directory relative to the project's root. Ensure that this path exists and contains the necessary data files before running any scripts that depend on the MixSet dataset.
Please refer to ./data/MixSet/README.md
for our MixSet data structure and how to leverage our dataset with ease.
- Python = 3.9
- Other dependencies specified in
requirements.txt
To set up your environment to run the code, follow these steps:
- Clone the Repository:
git clone https://github.com/Dongping-Chen/MixSet.git
cd MixSet
- Create and Activate a Virtual Environment (optional but recommended) and Install the Required Packages:
conda create --name mixset python=3.9
conda activate mixset
pip install -r requirements.txt
-
Download Datasets To download the pure MGT and HWT datasets, please refer to this link, then move the dataset folders to
<YOUR PATH>/MixSet/data/MGT_datasets/
and<YOUR PATH>/MixSet/data/pure_processed_HWT/
. -
Download Checkpoints of GPT-Sentinel Download the pre-trained
GPT-Sentinel t5-small
follow the instruction here, download thet5-small.0422.pt
and put to<YOUR PATH>/MixSet/
.
To reproduce the first experiments, run:
./Ex1_run.sh
You should run GPT-Zero by:
./Ex1_run_GPTzero
As for Ghostbuster, we will update the code as soon as possible.
To reproduce the second experiment for binary classification, run:
./Ex2_binary_run
To reproduce the second experiment for three-class classification, run:
./Ex2_three_class_run
To reproduce the third experiment for operation-wise transfer learning, run:
./Ex3_operation_train.sh
./Ex3_operation_test.sh
To reproduce the third experiment for LLM-wise transfer learning, run:
./Ex3_LLM_transfer.sh
Please be aware that the scripts for Experiments 3 and 4 require storing trained checkpoints in the folder path. This may occupy more than 20GB of space. It is essential to ensure that you have sufficient storage available on your device. Failing to allocate the necessary space might lead to interruptions during the code execution. We highly recommend checking and freeing up adequate space before running these scripts to ensure a smooth and uninterrupted experience.
To reproduce the fourth experiment for the ablation study, run:
./Ex4_auto_train.sh
./Ex4_auto_test.sh
Below are the parameters used in the script along with their descriptions:
--Mixcase_filename
: Specifies the filename for the MixText data. Default isNone
.--MGT_only_GPT
: If set, the script will only use MGT (Model Generated Text) from GPT-family models.--test_only
: If set, the script will only perform testing, skipping any training procedures.--train_threshold
: Specifies the threshold for training. Default is10000
.--no_auc
: If set, the script will only calculate the MixText scenarios, which means no Area Under the ROC Curve (AUC) metrics.--only_supervised
: If set, the script will perform only supervised learning without any unsupervised techniques.--train_with_mixcase
: If set, the script will include MixText data in the training process.--seed
: Sets the seed for random number generation to ensure reproducibility. Default is0
.--ckpt_dir
: Specifies the directory to save checkpoints. Default is"./ckpt"
.--log_name
: Specifies the name of the log file. Default is'Log'
.--mixcase_threshold
: Sets the threshold for considering data as MixText. Default is0.8
.--transfer_filename
: Specifies the filename for transfer learning. Default isNone
.--three_classes
: If set, the script will use a three-class classification scheme instead of binary classification.--finetune
: If set, the script will fine-tune the supervised model.--mixcase_as_mgt
: If set, MixText data will be treated as Model Generated Text (MGT).
For any issues, questions, or suggestions related to the MixSet dataset, feel free to contact me or open an issue in the project's repository.
Part of the code is borrowed from MGTBench. The corresponding author Lichao Sun is supported by the National Science Foundation Grants CRII-2246067.
@misc{zhang2024llmasacoauthor,
title={LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?},
author={Qihui Zhang and Chujie Gao and Dongping Chen and Yue Huang and Yixin Huang and Zhenyang Sun and Shilin Zhang and Weiye Li and Zhengyan Fu and Yao Wan and Lichao Sun},
year={2024},
eprint={2401.05952},
archivePrefix={arXiv},
primaryClass={cs.CL}
}