Text->(word embedding through word2vec)-> convolution -> max pooling -> sentence feature + extra feature layer -> softmax
- word embedding : use word2vec for training, weibo data filter length by 10, 5kw weibo
- conv-net: use multiple filter size, each filter can get one feature through max-pooling keyphrase extraction: for each filter size, get the most selected phrase
- feature combination: 300 sentence level feature + sentiment and word entitiy faeture
preprocess: CNNPreprocess.java , extra_feature: WeiboFeature/WeiboFeatureExtrator.java
- process_data_rumor.py
- input: word2vec file (pre trained on large scale data set), pkfile, nfold
- data_folder: weibo messages by split words
- extra_fea: selected feature by IG, mid, feature
- word not in word2vec, initialize by uniform(-0.25,0.25)
- vocabulary: 0 for NULL, and others start from 1 (0 also used for padding null words)
- output pkfile: sentenses, word2vec, random_vectors, word->id, vocab, id->word
- fist process must get the max_length and set for cnn
- min-batch for training, and each epoch use only 90% data for trianing
- shuffle batch
- weight initialize
- adadelta for update weight
- dropout for hidden layer