brightmart
diff --git a/‎README.md
+64-13 b/‎README.md
+64-13
diff --git a/‎a00_boosting/a08_boosting.py
+69 b/‎a00_boosting/a08_boosting.py
+69
diff --git a/‎a02_TextCNN/__init__.py b/‎a02_TextCNN/__init__.py
diff --git a/‎a02_TextCNN/__pycache__/data_util.cpython-36.pyc
3.36 KB b/‎a02_TextCNN/__pycache__/data_util.cpython-36.pyc
3.36 KB
diff --git a/‎a02_TextCNN/__pycache__/p7_TextCNN_model.cpython-36.pyc
4.86 KB b/‎a02_TextCNN/__pycache__/p7_TextCNN_model.cpython-36.pyc
4.86 KB
diff --git a/‎a02_TextCNN/data_util.py
+129 b/‎a02_TextCNN/data_util.py
+129
diff --git a/‎a02_TextCNN/other_experiement/__init__.py b/‎a02_TextCNN/other_experiement/__init__.py
@@ -1,6 +1,15 @@
 Text Classification
 -------------------------------------------------------------------------
-the purpose of this repository is to explore text classification methods in NLP with deep learning. 
+the purpose of this repository is to explore text classification methods in NLP with deep learning.
+
+UPDATE: 
+
+1. <a href='https://github.com/brightmart/ai_law'>
+    Apply AI in law cases task(AI_LAW): Predict the name of crimes(accusations), relevant-articles given facts of law cases</a>, has been released 
+
+2. <a href='https://github.com/brightmart/nlu_sim'>sentence similarity project has been released</a> you can check it if you like.
+
+3. if you want to try a model now, you can go to folder 'a02_TextCNN', run 'python -u p7_TextCNN_train.py', it will use sample data to train a model, and print loss and F1 score periodically.
 
 it has all kinds of baseline models for text classificaiton.
 
@@ -20,7 +29,9 @@ we implement two memory network. one is dynamic memory network. previously it re
 
 the second memory network we implemented is recurrent entity network: tracking state of the world. it has blocks of key-value pairs as memory, run in parallel, which achieve new state of art. it can be used for modelling question answering with contexts(or history). for example, you can let the model to read some sentences(as context), and ask a question(as query), then ask the model to predict an answer; if you feed story same as query, then it can do classification task. 
 
-if you need some sample data and word embedding pertrained on word2vec, you can find it in closed issues, such as:<a href="https://github.com/brightmart/text_classification/issues/3">issue 3</a>
+if you need some sample data and word embedding pertrained on word2vec, you can find it in closed issues, such as:<a href="https://github.com/brightmart/text_classification/issues/3">issue 3</a>. 
+
+you can also find some sample data at folder "data". it contains two files:'sample_single_label.txt', contains 50k data with single label; 'sample_multiple_label.txt', contains 20k data with multiple labels. input and label of is separate by "   __label__".
 
 if you want to know more detail about dataset of text classification or task these models can be used, one of choose is below:
 https://biendata.com/competition/zhihu/
@@ -39,9 +50,10 @@ Models:
 8) Dynamic Memory Network
 9) EntityNetwork:tracking state of the world
 10) Ensemble models
-11) Stacking for single model level (TODO): 
+11) Boosting: 
 
     for a single model, stack identical models together. each layer is a model. the result will be based on logits added together. the only connection between layers are label's weights. the front layer's prediction error rate of each label will become weight for the next layers. those labels with high error rate will have big weight. so later layer's will pay more attention to those mis-predicted labels, and try to fix previous mistake of former layer. as a result, we will get a much strong model.
+    check a00_boosting/boosting.py
 
 and other models:
 
@@ -70,21 +82,21 @@ Training| 10m     |  2h   |10h    | 2h   | 2h         |3h         |3h       |5h
 
  Notice: 
 
- 'm' stand for minutes; 'h' stand for hours;
+ `m` stand for **minutes**; `h` stand for **hours**;
 
-'HierAtteNet' means Hierarchical Attention Networkk;
+`HierAtteNet` means Hierarchical Attention Networkk;
 
-'Seq2seqAttn' means Seq2seq with attention;
+`Seq2seqAttn` means Seq2seq with attention;
 
-'DynamicMemory' means DynamicMemoryNetwork;
+`DynamicMemory` means DynamicMemoryNetwork;
 
-'Transformer' stand for model from 'Attention Is All You Need'.
+`Transformer` stand for model from 'Attention Is All You Need'.
 
 Useage:
 -------------------------------------------------------------------------------------------------------
-1) model is in xxx_model.py
-2) run python xxx_train.py to train the model
-3) run python xxx_predict.py to do inference(test).
+1) model is in `xxx_model.py`
+2) run python `xxx_train.py` to train the model
+3) run python `xxx_predict.py` to do inference(test).
 
 Each model has a test method under the model class. you can run the test method first to check whether the model can work properly.
 
@@ -94,7 +106,9 @@ Environment:
 -------------------------------------------------------------------------------------------------------
 python 2.7+ tensorflow 1.1
 
-(tensorflow 1.2 also works; most of models should also work fine in other tensorflow version, since we use very few features bond to certain version; if you use python 3.5, it will be fine as long as you change print/try catch function)
+(tensorflow 1.2,1.3,1.4 also works; most of models should also work fine in other tensorflow version, since we use very few features bond to certain version; if you use python 3.5, it will be fine as long as you change print/try catch function)
+
+TextCNN model is already transfomed to python 3.6
 
 -------------------------------------------------------------------------
 
@@ -104,6 +118,19 @@ Some util function is in data_util.py;
 typical input like: "x1 x2 x3 x4 x5 __label__ 323434" where 'x1,x2' is words, '323434' is label;
 it has a function to load and assign pretrained word embedding to the model,where word embedding is pretrained in word2vec or fastText. 
 
+Pretrain Work Embedding:
+-------------------------------------------------------------------------------------------------------
+if word2vec.load not works, you may load pretrained word embedding, especially for chinese word embedding use following lines:
+
+import gensim
+
+from gensim.models import KeyedVectors
+
+word2vec_model = KeyedVectors.load_word2vec_format(word2vec_model_path, binary=True, unicode_errors='ignore')  #
+
+or you can turn off use pretrain word embedding flag to false to disable loading word embedding.
+
+
 Models Detail:
 -------------------------------------------------------------------------
 
@@ -119,6 +146,7 @@ result: performance is as good as paper, speed also very fast.
 
 check: p5_fastTextB_model.py
 
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/fastText.JPG)
 -------------------------------------------------------------------------
 
 2.TextCNN:
@@ -141,15 +169,25 @@ Thirdly, we will concatenate scalars to form final features. It is a fixed-size
 
 Finally, we will use linear layer to project these features to per-defined labels.
 
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/TextCNN.JPG)
+
 -------------------------------------------------------------------------
 
 
 3.TextRNN
 -------------
-Structure:embedding--->bi-directional lstm--->concat output--->average----->softmax
+Structure v1:embedding--->bi-directional lstm--->concat output--->average----->softmax layer
 
 check: p8_TextRNN_model.py
 
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/bi-directionalRNN.JPG)
+
+Structure v2:embedding-->bi-directional lstm---->dropout-->concat ouput--->lstm--->droput-->FC layer-->softmax layer
+
+check: p8_TextRNN_model_multilayer.py
+
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/emojifier-v2.png)
+
 
 -------------------------------------------------------------------------
 
@@ -205,6 +243,7 @@ for left side context, it use a recurrent structure, a no-linearity transfrom of
 
 check: p71_TextRCNN_model.py
 
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/RCNN.JPG)
 
 -------------------------------------------------------------------------
 
@@ -226,6 +265,8 @@ Structure:
 
 5) FC+Softmax
 
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/HAN.JPG)
+
 In NLP, text classification can be done for single sentence, but it can also be used for multiple sentences. we may call it document classification. Words are form to sentence. And sentence are form to document. In this circumstance, there may exists a intrinsic structure. So how can we model this kinds of task? Does all parts of document are equally relevant? And how we determine which part are more important than another?
 
 It has two unique features: 
@@ -254,6 +295,8 @@ In my training data, for each example, i have four parts. each part has same len
 
 check:p1_HierarchicalAttention_model.py
 
+for attentive attention you can check <a href='https://github.com/brightmart/text_classification/issues/55'>attentive attention</a>
+
 -------------------------------------------------------------------------
 
 9.Seq2seq with attention
@@ -264,6 +307,8 @@ I.Structure:
 
 1)embedding 2)bi-GRU too get rich representation from source sentences(forward & backward). 3)decoder with attention.
 
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/seq2seqAttention.JPG)
+
 II.Input of data:
 
 there are two kinds of three kinds of inputs:1)encoder inputs, which is a sentence; 2)decoder inputs, it is labels list with fixed length;3)target labels, it is also a list of labels.
@@ -308,6 +353,8 @@ For every building blocks, we include a test function in the each file below, an
 
 Sequence to sequence with attention is a typical model to solve sequence generation problem, such as translate, dialogue system. most of time, it use RNN as buidling block to do these tasks. util recently, people also apply convolutional Neural Network for sequence to sequence problem. Transformer, however, it perform these tasks solely on attention mechansim. it is fast and acheive new state-of-art result.
 
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/attention_is_all_you_need.JPG)
+
 It also has two main parts: encoder and decoder. below is desc from paper:
 
 Encoder:
@@ -365,6 +412,8 @@ b. get weighted sum of hidden state using possibility distribution.
 
 c. non-linearity transform of query and hidden state to get predict label.
 
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/EntityNet.JPG)
+
 Main take away from this model:
 
 1) use blocks of keys and values, which is independent from each other. so it can be run in parallel.
@@ -391,6 +440,8 @@ Outlook of Model:
 
 4.Answer Module:generate an answer from the final memory vector.
 
+![alt text](https://github.com/brightmart/text_classification/blob/master/images/DMN.JPG)
+
 Detail:
 
 1.Input Module:
 
@@ -0,0 +1,69 @@
+# -*- coding: utf-8 -*-
+import sys
+reload(sys)
+sys.setdefaultencoding('utf8')
+import tensorflow as tf
+
+#main process for boosting:
+#1.compute label weight after each epoch using validation data.
+#2.get weights for each batch during traininig process
+#3.compute loss using cross entropy with weights
+
+#1.compute label weight after each epoch using validation data.
+def compute_labels_weights(weights_label,logits,labels):
+    """
+    compute weights for labels in current batch, and update weights_label(a dict)
+    :param weights_label:a dict
+    :param logit: [None,Vocabulary_size]
+    :param label: [None,]
+    :return:
+    """
+    labels_predict=np.argmax(logits,axis=1) # logits:(256,108,754)
+    for i in range(len(labels)):
+        label=labels[i]
+        label_predict=labels_predict[i]
+        weight=weights_label.get(label,None)
+        if weight==None:
+            if label_predict == label:
+                weights_label[label]=(1,1)
+            else:
+                weights_label[label]=(1,0)
+        else:
+            number=weight[0]
+            correct=weight[1]
+            number=number+1
+            if label_predict==label:
+                correct=correct+1
+            weights_label[label]=(number,correct)
+    return weights_label
+
+#2.get weights for each batch during traininig process
+def get_weights_for_current_batch(answer_list,weights_dict):
+    """
+    get weights for current batch
+    :param  answer_list: a numpy array contain labels for a batch
+    :param  weights_dict: a dict that contain weights for all labels
+    :return: a list. length is label size.
+    """
+    weights_list_batch=list(np.ones((len(answer_list))))
+    answer_list=list(answer_list)
+    for i,label in enumerate(answer_list):
+        acc=weights_dict[label]
+        weights_list_batch[i]=min(1.5,1.0/(acc+0.001))
+    #if np.random.choice(200)==0: #print something from time to time
+    #    print("weights_list_batch:",weights_list_batch)
+    return weights_list_batch
+
+#3.compute loss using cross entropy with weights
+def loss(logits,labels,weights):
+    loss= tf.losses.sparse_softmax_cross_entropy(labels, logits,weights=weights)
+    return loss
+
+#######################################################################
+#util function
+def get_weights_label_as_standard_dict(weights_label):
+    weights_dict = {}
+    for k,v in weights_label.items():
+        count,correct=v
+        weights_dict[k]=float(correct)/float(count)
+    return weights_dict
@@ -0,0 +1,129 @@
+# -*- coding: utf-8 -*-
+import codecs
+import random
+import numpy as np
+from tflearn.data_utils import pad_sequences
+from collections import Counter
+import os
+import pickle
+
+PAD_ID = 0
+UNK_ID=1
+_PAD="_PAD"
+_UNK="UNK"
+
+
+def load_data_multilabel(traning_data_path,vocab_word2index, vocab_label2index,sentence_len,training_portion=0.95):
+    """
+    convert data as indexes using word2index dicts.
+    :param traning_data_path:
+    :param vocab_word2index:
+    :param vocab_label2index:
+    :return:
+    """
+    file_object = codecs.open(traning_data_path, mode='r', encoding='utf-8')
+    lines = file_object.readlines()
+    random.shuffle(lines)
+    label_size=len(vocab_label2index)
+    X = []
+    Y = []
+    for i,line in enumerate(lines):
+        raw_list = line.strip().split("__label__")
+        input_list = raw_list[0].strip().split(" ")
+        input_list = [x.strip().replace(" ", "") for x in input_list if x != '']
+        x=[vocab_word2index.get(x,UNK_ID) for x in input_list]
+        label_list = raw_list[1:]
+        label_list=[l.strip().replace(" ", "") for l in label_list if l != '']
+        label_list=[vocab_label2index[label] for label in label_list]
+        y=transform_multilabel_as_multihot(label_list,label_size)
+        X.append(x)
+        Y.append(y)
+    X = pad_sequences(X, maxlen=sentence_len, value=0.)  # padding to max length
+    number_examples = len(lines)
+    training_number=int(training_portion* number_examples)
+    train = (X[0:training_number], Y[0:training_number])
+    valid_number=min(1000,number_examples-training_number)
+    test = (X[training_number+ 1:training_number+valid_number+1], Y[training_number + 1:training_number+valid_number+1])
+    return train,test
+
+
+def transform_multilabel_as_multihot(label_list,label_size):
+    """
+    convert to multi-hot style
+    :param label_list: e.g.[0,1,4], here 4 means in the 4th position it is true value(as indicate by'1')
+    :param label_size: e.g.199
+    :return:e.g.[1,1,0,1,0,0,........]
+    """
+    result=np.zeros(label_size)
+    #set those location as 1, all else place as 0.
+    result[label_list] = 1
+    return result
+
+#use pretrained word embedding to get word vocabulary and labels, and its relationship with index
+def create_vocabulary(training_data_path,vocab_size,name_scope='cnn'):
+    """
+    create vocabulary
+    :param training_data_path:
+    :param vocab_size:
+    :param name_scope:
+    :return:
+    """
+
+    cache_vocabulary_label_pik='cache'+"_"+name_scope # path to save cache
+    if not os.path.isdir(cache_vocabulary_label_pik): # create folder if not exists.
+        os.makedirs(cache_vocabulary_label_pik)
+
+    # if cache exists. load it; otherwise create it.
+    cache_path =cache_vocabulary_label_pik+"/"+'vocab_label.pik'
+    print("cache_path:",cache_path,"file_exists:",os.path.exists(cache_path))
+    if os.path.exists(cache_path):
+        with open(cache_path, 'rb') as data_f:
+            return pickle.load(data_f)
+    else:
+        vocabulary_word2index={}
+        vocabulary_index2word={}
+        vocabulary_word2index[_PAD]=PAD_ID
+        vocabulary_index2word[PAD_ID]=_PAD
+        vocabulary_word2index[_UNK]=UNK_ID
+        vocabulary_index2word[UNK_ID]=_UNK
+
+        vocabulary_label2index={}
+        vocabulary_index2label={}
+
+        #1.load raw data
+        file_object = codecs.open(training_data_path, mode='r', encoding='utf-8')
+        lines=file_object.readlines()
+        #2.loop each line,put to counter
+        c_inputs=Counter()
+        c_labels=Counter()
+        for line in lines:
+            raw_list=line.strip().split("__label__")
+
+            input_list = raw_list[0].strip().split(" ")
+            input_list = [x.strip().replace(" ", "") for x in input_list if x != '']
+            label_list=[l.strip().replace(" ","") for l in raw_list[1:] if l!='']
+            c_inputs.update(input_list)
+            c_labels.update(label_list)
+        #return most frequency words
+        vocab_list=c_inputs.most_common(vocab_size)
+        label_list=c_labels.most_common()
+        #put those words to dict
+        for i,tuplee in enumerate(vocab_list):
+            word,_=tuplee
+            vocabulary_word2index[word]=i+2
+            vocabulary_index2word[i+2]=word
+
+        for i,tuplee in enumerate(label_list):
+            label,_=tuplee;label=str(label)
+            vocabulary_label2index[label]=i
+            vocabulary_index2label[i]=label
+
+        #save to file system if vocabulary of words not exists.
+        if not os.path.exists(cache_path):
+            with open(cache_path, 'ab') as data_f:
+                pickle.dump((vocabulary_word2index,vocabulary_index2word,vocabulary_label2index,vocabulary_index2label), data_f)
+    return vocabulary_word2index,vocabulary_index2word,vocabulary_label2index,vocabulary_index2label
+
+#training_data_path='../data/sample_multiple_label3.txt'
+#vocab_size=100
+#create_voabulary(training_data_path,vocab_size)