Code for KDD 2020 paper Robust Spammer Detection by Nash Reinforcement Learning.
Yingtong Dou, Guixiang Ma, Philip S. Yu, Sihong Xie.
[Paper][Slides][Video][Toolbox][Chinese Blog]
Nash-Detect is an algorithm proposed by the above paper to train a robust spam review detector using reinforcement learning. The robust detector is composed of five base detectors and is trained through playing a minimax game between the spammer and the defender. There are five base spamming strategies used by the spammer to synthesize the mixed spamming strategy.
This repo includes the spamming attack implementation and generation code, the detector implementation code, and the training & testing code for Nash-Detect and all baselines.
Note that we only investigate the shallow graph and behavior-based spam detectors in this paper; there is no text or deep neural network involved. Nonetheless, there is no hurdle to apply Nash-Detect to train robust neural networks or text-based spam detectors.
To run the code, you need the Yelp Spam Review Datasets. Please send email with the title Yelp Dataset Request
to ytongdou@gmail.com to download the file with metadata and ground truth. You can unzip the dataset file under the root directory of the project.
You can download the project and install required packages using following commands:
git clone https://github.com/YingtongDou/Nash-Detect.git
cd Nash-Detect
pip3 install -r requirements.txt
To run the code, you need to have Python 3.6 or later version.
- Run
attack_generation.py
withmode = "Training"
to generate fake reviews for training - Run
worst_case.py
to compute the worst-case performance of single attacks vs. single detectors - Run
training.py
to train a robust detector configuration using Nash-Detect - Run
attack_generation.py
withmode = "Testing"
to generate fake reviews for testing - Run
testing.py
to test the performance of the optimal detector trained by Nash-Detect and other baselines
To facilitate the training and testing, we have stored all generated fake reviews in directories /Training
and /Testing
. So you can skip Steps 1 and 4 to play the game and evaluation code directly. Moreover, you can play each single detector using the eval_XXX.py
under the /Detector
repository or using our UGFraud toolbox.
The experimental settings and model parameters can be found at the beginning of the main
functions of training.py
and testing.py
.
The repository is organized as follows:
Attack/
contains the implementations of four spamming attack strategies, theSingleton
attack is implemented inattack_generation.py
;Detector/
contains the implementations and evaluations of five spam detectors;Testing/
contains generated fake reviews for testing;Training/
contains generated fake reviews for training;Utils/
contains:- functions for loading graphs/features from dataset/manifest files (
iohelper.py
); - utility functions for training and testing (
eval_helper.py
); - functions for extracting and updating features and prior beliefs (
yelpFeatureExtraction.py
); - the manifest file for features (
feature_configuration.py
).
- functions for loading graphs/features from dataset/manifest files (
@inproceedings{dou2020robust,
title={Robust Spammer Detection by Nash Reinforcement Learning},
author={Dou, Yingtong and Ma, Guixiang and Yu, Philip S and Xie, Sihong},
booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
year={2020}
}