A privacy policy serves as an online internet protocol crafted by service providers, which details how service providers collect, process, store, manage, and use personal information when users engage with applications.
However, these privacy policies are often filled with technobabble and legalese, making them 'incomprehensible'.
As a result, users often agree to all terms unknowingly, even some terms may conflict with the law, thereby posing a considerable risk to personal privacy information.
To tackle these challenges, we introduce a fine-grained CAPP-130 corpus and a TCSI-pp framework.
CAPP-130 contains
Project dependencies can be installed in the following ways:
pip install -r requirements.txt
Equipment: A100 *2
CAPP-130 contains
The guide for Paper and Annotation Guidelines (Chinese version, English version) explains the tags and the process of annotation, which can be found in the Documents. Currently, the Annotation Guidelines are available only in Chinese, but we are working on translating them into English. Table 1 shows the basic statistical information of CAPP-130, and Table 2 shows the pre-sliced data information used for TCSI-pp. They are stored in the CAPP-130 Corpus.
Table 1: Basic Statistics of Corpus CAPP-130.
Data Practice Categories | Quantity | Percentage (%) | Median | Mea |
---|---|---|---|---|
Information Collection | 6967 | 17.9 | 58 | 70 |
Permission Acquisition | 1852 | 4.8 | 54 | 62 |
Sharing and Disclosure | 4740 | 12.2 | 52 | 63 |
Usage | 3589 | 9.2 | 64 | 75 |
Storage | 1360 | 3.5 | 41 | 46 |
Security Measures | 3000 | 7.7 | 53 | 60 |
Special Audiences | 1416 | 3.6 | 54 | 60 |
Management | 5324 | 13.7 | 43 | 49 |
Contact Information | 712 | 1.8 | 41 | 54 |
Authorization and Revisions | 1049 | 2.7 | 35 | 43 |
Cessation of Operations | 110 | 0.3 | 64 | 68 |
Important | 20555 | 52.8 | 52 | 61 |
Risks | 1815 | 4.7 | 40 | 46 |
Table 2: The pre-sliced data from CAPP-130 is used to train TCSI-pp.
sub dataset | train samples | validation samples | test samples |
---|---|---|---|
important_identification_dataset | 27222 | 5833 | 5834 |
risk_identification_dataset | 14338 | 3083 | 3084 |
topic_identification_dataset | 14190 | 3043 | 3035 |
rewritten_sentences | 15656 | 1957 | 1957 |
we provide a Topic-Controlled Framework for Summarization and Interpretation of Privacy Policy (TCSI-pp). Unlike previous methods that only extract specific sentences, TCSI-pp first retrieves relevant sentences based on the topics chosen from data practice categories by users using a classification model. Then, a generative model is used to rewrite these sentences clearly and concisely for the understanding of the general public, with potentially risky sentences emphasized.
These are specifically utilized for binary classification models such as "Important Identification" and "Risk Identification", as well as multi-classification models like "Topic Identification".
The model is placed in the XXX_pretain (where XXX is the model name) directory and each directory contains three files:
- pytorch_model.bin
- bert_config.json
- vocab.txt
Pre-trained model download address from here.
After decompression, put it in the corresponding directory according to the above, and confirm the file name is correct.
We have independently acquired three sets of classification benchmarks from six different models: RoBERTa, BERT, mBERT, sBERT, Pert, and ERNIE.
You can be used in the following ways:
# Train and test binary classification model:
python run.py --model 'model_name' --data 'data_name'
# Train and test multi-classification model:
python run_multi.py --model 'model_name' --data 'data_name'
Please note that the above code examples are for illustrative purposes only and you may need to make appropriate adjustments based on your specific situation.
We provide classification baselines for "Important Identification", "Risk Identification", and "Topic Identification". They are respectively trained and tested on the 'important_identification_dataset', 'risk_identification_dataset', and 'topics_identification_dataset' in the sub-dataset. Table 3 displays the evaluation metrics of six models.
Table 3: Evaluation Metrics with F1 for Classification Models.
Methods | topic-Micro | topic-Macro | important-Micro | important-Macro | risk-Micro | risk-Macro |
---|---|---|---|---|---|---|
RoBERTa | 0.819 | 0.841 | 0.897 | 0.899 | 0.920 | 0.711 |
Bert | 0.802 | 0.820 | 0.895 | 0.896 | 0.921 | 0.719 |
mBERT | 0.809 | 0.821 | 0.889 | 0.889 | 0.918 | 0.709 |
SBERT | 0.781 | 0.794 | 0.875 | 0.874 | 0.917 | 0.689 |
PERT | 0.801 | 0.812 | 0.895 | 0.897 | 0.922 | 0.716 |
ERNIE | 0.807 | 0.821 | 0.895 | 0.896 | 0.921 | 0.702 |
(New) We will upload all model parameters to here.
A generative model is used to rewrite these sentences clearly and concisely for the understanding of the general public, with potentially risky sentences emphasized.
For rewriting sentences, we fine-tuned the following models based on the transformer encoder-decoder architecture: mT5, Bert2Bert, Bert2gpt, RoBerta2gpt, and ERNIE2gpt. These models were initialized with parameters from publicly available models, such as mT5-small, Bert-base-Chinese, ernie-3.0-base-zh, chinese-roberta-wwm-ext, and gpt2-base-chinese. These models can be found on Hugging Face model repository.
You can be used in the following ways:
# train and test:
python model_name.py
#The model_name needs to be changed to mT5, Bert2Bert, Bert2gpt, RoBerta2gpt, or ERNIE2gpt.
Please note that the above code examples are for illustrative purposes only and you may need to make appropriate adjustments based on your specific situation.
Table 4 displays the ROUGE, Bert-score, Bart-score, and Carburacy evaluation metrics for these models:
Table 4: Evaluation metrics for the rewrite models.
Methods | rouge-1 | rouge-2 | rouge-l | Bert-score | Bart-score | Carburacy |
---|---|---|---|---|---|---|
mT5 | 0.753 | 0.609 | 0.733 | 0.888 | -4.577 | 0.833 |
RoBERTa2gpt | 0.749 | 0.577 | 0.719 | 0.872 | -4.975 | 0.755 |
Bert2bert | 0.718 | 0.535 | 0.689 | 0.864 | -5.020 | 0.747 |
Bert2gpt | 0.751 | 0.574 | 0.720 | 0.872 | -4.964 | 0.764 |
ERNIE2gpt | 0.623 | 0.406 | 0.581 | 0.809 | -5.716 | 0.715 |
(New) We will upload all model parameters to here.
we select the most effective RoBERTa and mT5 to implement the Chinese application privacy policy summary tool (TCSI-pp-zh). Experiments on real privacy policies show that TCSI-pp-zh is superior over GPT-4 and other models, demonstrating higher readability and reliability in the task of summarizing Chinese application privacy policies.
You can be used in the following ways:
# train and test:
python ./TCSI_pp_zh/TCSI_pp_zh.py --binary_model 'binary_model_name' --multi_model 'multi_model_name' --rewrite_model 'rewrite_model_name' --topic_list 'choose_a_topic_list' --data 'a_privacy_policy'
Please note that the above code examples are for illustrative purposes only and you may need to make appropriate adjustments based on your specific situation.
Figure 1 displays the summarization of GPT-4 and TCSI-pp-zh in a privacy policy, where text having the same background color represents descriptions of the same part of the privacy policy generated by different algorithms; red text emphasizes incorrect content produced in the summary.
Figure 1: Summarization of GPT-4 and TCSI-pp-zh.
If you use the data or code of this project, or if our work is helpful to you, please state the citation
@inproceedings{zhu2023capp,
title={CAPP-130: a corpus of chinese application privacy policy summarization and interpretation},
author={Zhu, Pengyun and Wen, Long and Liu, Jinfei and Xue, Feng and Lou, Jian and Wang, Zhibo and Ren, Kui},
booktitle={Proceedings of the 37th International Conference on Neural Information Processing Systems},
pages={46773--46785},
year={2023}
}
We will continue to update this repository on GitHub.