Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Linear Chain Conditional Random Field #4090

Closed
phipleg opened this issue Oct 17, 2016 · 60 comments
Closed

Feature Request: Linear Chain Conditional Random Field #4090

phipleg opened this issue Oct 17, 2016 · 60 comments

Comments

@phipleg
Copy link
Contributor

phipleg commented Oct 17, 2016

I implemented a Linear Chain CRF layer for sequence tagging tasks inspired by the paper:

Lample et al. Neural Architectures for Named Entity Recognition (Neural Architectures for Named Entity Recognition)

The layer is the identity function during training and applies the forward-backward algorithm during inference. For that it holds a set of trainable parameters which is accessed by a specific loss function.

You can see the API in the short gist for pos tagging on the penn treebank (provided by NLTK):
https://gist.github.com/phipleg/adfccb0ad96b777eecc9bb0f16ab54fc

Currently, it is only implemented in Theano and supports fixed length sequences (no masking).

Is anybody interested in seeing the layer in Keras? The need was raised in issue 824 but the issue is closed.

I could refactor my code and make a pull request in a few days. For that I would need to add a few functions to the Theano backend because I make use of Theano's scan function. I could also provide an example.

@tmills
Copy link

tmills commented Oct 21, 2016

I would find this useful.

@phipleg
Copy link
Contributor Author

phipleg commented Oct 23, 2016

Ok, great! Before I make the pull request I try to figure out the implementation in tensorflow in order to get the ChainCRF going with both backends.

@kaya27
Copy link

kaya27 commented Oct 31, 2016

I'm interesting,I would find this very useful.

@efosler
Copy link

efosler commented Nov 2, 2016

Following as well. I was about to embark on an implementation myself. (May still just for the exercise, but if you have something I would be interested.)

@matteotosi
Copy link

Interested as well! ;)

@c4n
Copy link

c4n commented Nov 3, 2016

This is awesome!

@sonalgupta
Copy link

This will be great!

@phipleg
Copy link
Contributor Author

phipleg commented Nov 11, 2016

Thanks for your support! I am almost done, except for a bug in my tensorflow implementation. I hope to resolve this issue in a few days.

@danche354
Copy link

danche354 commented Nov 12, 2016

It will be an awesome function!

@xiangyanchao
Copy link

Great, waiting for update!

@phipleg
Copy link
Contributor Author

phipleg commented Nov 28, 2016

Sorry, for the delay. I am still working on it, but in my spare time. The layer is complete, but the example is not finished yet.

@nreimers
Copy link

nreimers commented Dec 1, 2016

I'm interested as well. I have the problem that RNNs are not able to capture e.g. BIO-encoding correctly and produce ill formated BIO-tags (e.g. starting an I-tag without a previous B-tag).

Thanks for contributing and looking forward to your implementation.

@sunbohit
Copy link

sunbohit commented Dec 7, 2016

python3 conll2000_bi_lstm_crf.py
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Traceback (most recent call last):
File "conll2000_bi_lstm_crf.py", line 16, in
from keras.layers import Dense, Embedding, ChainCRF, LSTM, Bidirectional, Dropout
ImportError: cannot import name 'ChainCRF'

I have run the setup.py in https://github.com/fchollet/keras/tree/bba6b521abc462261dd65883be59c94e1467b7cf
And I can see the 'crf,py' in /keras/layers/, but when I ran the '/examples/conll2000_bi_lstm_crf.py', I got the ImportError.

What is the right way to run this file?

@phipleg
Copy link
Contributor Author

phipleg commented Dec 7, 2016

This should work if the library is properly installed. I guess you had a previous keras version in your conda environment. Then your install didn't update existing files but just added the new ones. For example, keras/layers/__init__.py which is likely to be the source of the error.

Try again:

python setup.py install --force

You can check the installation by running

python3 -c "from keras.layers import ChainCRF"

If this doesn' throw an ImportError, the example should work.

@sunbohit
Copy link

sunbohit commented Dec 7, 2016

Thanks.
After uninstalled my previous keras version, I successfully imported the package.

@lemmonation
Copy link

hi I'm running a blstm-crf model but before the training begins I meet the following error:
Train on 1860255 samples, validate on 206696 samples
Epoch 1/5
Traceback (most recent call last):
File "train_keras_model.py", line 125, in
args.batchsize, args.maxlen, args.maxepochs, args.hiddenunits, args.dropout)
File "train_keras_model.py", line 96, in train
nb_epoch = maxepochs, validation_data = (test_X,Y_test))
File "/home/junliangguo/keras-b/keras/models.py", line 652, in fit
sample_weight=sample_weight)
File "/home/junliangguo/keras-b/keras/engine/training.py", line 1111, in fit
initial_epoch=initial_epoch)
File "/home/junliangguo/keras-b/keras/engine/training.py", line 826, in _fit_loop
outs = f(ins_batch)
File "/home/junliangguo/keras-b/keras/backend/tensorflow_backend.py", line 1096, in call
updated = session.run(self.outputs + [self.updates_op], feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 894, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (20, 4) for Tensor u'chaincrf_1_target:0', which has shape '(?, ?, ?)'

and my model is as follows:
model = Sequential()
model.add(Embedding(output_dim = embeddingDim, input_dim = vocabSize + 1,
input_length = maxlen, mask_zero = False, weights = [embeddingWeights]))
model.add(Bidirectional(LSTM(output_dim = hiddenDims, return_sequences = True), merge_mode = 'concat'))
model.add(Dropout(dropout))
model.add(TimeDistributed(Dense(outputDims)))
crf = ChainCRF()
model.add(crf)
model.compile(loss = crf.loss, optimizer = 'adam', metrics = ["accuracy"])

before that I add a TimeDistributed wrapper to make the input dim of CRF be correct.But I don't know what this error means.Could somebody help me?

@phipleg
Copy link
Contributor Author

phipleg commented Dec 7, 2016

In your setting, the targets must be one-hot encoded and hence of dimension 3 (and not 2), i.e:

Y_test.shape = (nb_samples, timesteps, nb_classes)

@lemmonation
Copy link

Thanks for the reply but I'm not very clear about the shape.

After my preprocess I use
Y_test = np_utils.to_categorical(test_y, LabelDims)
So one batch of Y has the size (batchsize, LabelDims) ,i.e. (nb_samples, nb_classes)

So how can I take the dimension transition from (nb_samples, nb_classes) to (nb_samples, timesteps, nb_classes)``? Is there any functions or do I need to have a change in the preprocess step?

Thanks a lot.

@phipleg
Copy link
Contributor Author

phipleg commented Dec 8, 2016

You cannot make the desired dimension transition. The model works only for temporal data, but your preprocessing shows that this is not true in your case. Why are you trying to use a ChainCRF?

@lemmonation
Copy link

lemmonation commented Dec 8, 2016

I use lstm and crf to Chinese sequence segmentation. In my preprocess I use a window sliding the sentence so the training data has X.size = (total_count, window_size) and Y.size = (total_count, ).

It seems that I should set a sequence length as timesteps to form the data shape to be X.size = (batch_size, seq_len, window_size) and Y.size = (batch_size, seq_len)

But I have used an embedding layer which make the output dim to the LSTM layer become (batch_size, window_size, embedding_dim).And the network works well and has a good result because it takes window_size as time steps, with a set of return_sequences = False in the last LSTM layer, which make the dimension correct between network's output and Y.

And I notice that the ChainCRF layer doesn't support mask_zero argument yet. So does it means I should discard the embedding layer and retrim the data dimension to be X.size = (batch_size, seq_len, window_size) and Y.size = (batch_size, seq_len, n_classes) (after a categorical)?

@lemmonation
Copy link

lemmonation commented Dec 9, 2016

Thanks for the advice and I have already fixed the problem by discrad embedding layer and retrim data dimension.

@LopezGG
Copy link

LopezGG commented Dec 30, 2016

Can I use this Chain CRF to implement BiLSTM with CRF for NER tagging as shown in the code here https://github.com/glample/tagger

@SamihYounes
Copy link

got this error any help ?

Loading data...
Unique words: 17260
Unique pos_tags: 45
Unique chunk tags: 24
X_words_train shape: (8936, 80)
X_words_test shape: (2012, 80)
y_train shape: (8936, 80, 1)
y_test shape: (2012, 80, 1)
Build model...
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 569, in call
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 632, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 164, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python2.7/dist-packages/keras/layers/crf.py", line 122, in call
y_pred = K.crf_inference(x, self.U, self.b)
AttributeError: 'module' object has no attribute 'crf_inference'

@phipleg
Copy link
Contributor Author

phipleg commented Jan 27, 2017

Hi @SamihYounes,

Please update to the latest version of the pull request #4621.

@phipleg
Copy link
Contributor Author

phipleg commented Jan 27, 2017

Hi @LopezGG,

yes this will be possible. You can start with the chunking example.
When the CRF layer is finally merged, I will provide an example.

@NianzuMa
Copy link

I have a question about the example code

n_words = 10000
maxlen = 32
(X_train, y_train), (X_test, y_test) = load_treebank(nb_words=n_words, maxlen=maxlen)

n_samples, n_steps, n_classes = y_train.shape

model = Sequential()
model.add(Embedding(n_words, 128, input_length=maxlen, dropout=0.2))
model.add(LSTM(64, return_sequences=True))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes)))
model.add(Dropout(0.2))
crf = ChainCRF()
model.add(crf)
model.compile(loss=crf.loss, optimizer='rmsprop', metrics=['accuracy'])

After LSTM layer, Dense layer convert each timestep of the LSTM output into n_classes dimension.
For instance, the output of LSTM is (batch_size, timestep, lstm_output), which is (64, 3, 100). Why this output should be converted to (64, 3, nb_class) using Dense layer. Couldn't data of shape (64, 3, 100) be the direct input to CRF layer and then make the output of CRF layer be (64, 3, nb_class)?

CRF could take features of size 100 for each timestep and then output tags of size 8 right?

@phipleg
Copy link
Contributor Author

phipleg commented Mar 26, 2017

Dear @JacobIsrael123,

Of course, we could integrate a dense layer for input dimension conversion but this is not always necessary (for example if the preceding layer is recurrent layer with the output dimension nb_classes). While designing the ChainCRF layer, I decided to keep it simple as possible.

@rvadaga
Copy link

rvadaga commented May 13, 2017

@phipleg Can you share the document/tutorial you used for implementing ChainCRF module? Thanks.

@sujanucsc
Copy link

The above issue with loss value is only observed with Theano backend. Tensorflow works fine.

@liyzhang
Copy link

new to git hub, how can I run setup in https://github.com/fchollet/keras/tree/bba6b521abc462261dd65883be59c94e1467b7cf?

if I use cntk, can I still use the crf layer?

@FredRodrigues
Copy link

@liyzhang

python setup.py install --force

@dfalci
Copy link

dfalci commented Jul 27, 2017

I am working on a sequence labeling task based on a bi-directional LSTM architecture with variable sequence length (I'm not padding sentences). Thus, during training, I have a lot of mini-batches, including those with size 1. @phipleg said in a previous post that "mini batches of size 1 are problematic". Does this mean that this implementation won't work in such situation?

@nreimers
Copy link

@dfalci This was fixed in a later commit, now the CRF implementation works fine for mini-batches of size 1.

A hint on speeding up your idea: What I do is to group sentences by sentence length and then create mini-batches of sentences with the same length. If your train data is large enough and the sentence are approx. of the same length, you will only have few mini-batches with a single sentence.

@enewe101
Copy link

enewe101 commented Aug 1, 2017

@phipleg I'm interested in using your implementation, but am wondering if you could elaborate on what this means:

The layer is the identity function during training and applies the forward-backward algorithm during inference. For that it holds a set of trainable parameters which is accessed by a specific loss function.

What I think this means is that the second-from-last layer, the one just before the CRF, is actually trying to predict the target directly, and learns to do so based on a loss function that is distinct from the CRF (e.g. cross-entropy). Meanwhile, the CRF learns a set of transition probabilities, based on its own loss function --- the log likelihood calculated from the forward-backward algorithm.

So, training of the CRF could be decoupled from training from the rest of the network, since the CRFs parameters do not affect the loss function as seen by the rest of the network ("The layer is the identity function during training").

Have I understood correctly?

While this seems reasonable, my reading of Bidirectional LSTM-CRF Models for Sequence Tagging (Huang et al 2015) and Neural Architectures for Named Entity Recognition (Lample 2016) is that their Bi-LSTM-CRF implementations are trained by back-propagating the CRF's log-likelihood loss function through the entire network. (I could be mistaken of course.)

@nreimers
Copy link

nreimers commented Aug 7, 2017

@enewe101 I can maybe answer that.

The CRF-Layer is updated using back-propagation during the training to learn transition probabilities. However, for training, we already know the correct labels. Hence, training and inference for a CRF or Hidden-Markov-Model in a Neural Network is distinct from those layers at inference. But the error function of the CRF layer is used for training and the transistions are updated with each epoch.

You could of course also decouple the training of the network from the training of the CRF / HMM. But this is seldom done as it introduces further complexity.

The paper of Collobert et al. 'NLP almost from scratch' explains the process well how to add a HMM to a network and how the training and inference must be modified.

In my implementation (https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf) I achieve on par results with Huang et al., Lample et al., and Ma & Hovy for various tasks using this CRF implementation. So it appears that this CRF implementation works well

@phipleg
Copy link
Contributor Author

phipleg commented Aug 8, 2017

Hi @enewe101,

as @nreimers pointed out the CRF layer only applies the costly inference at prediction, and not at training because the target labels are already known. During training it acts as the identiy but at the same time holds the parameters for the CRF loss. You need to use this loss in your model (and not some cross-entropy like you said). Otherwise, by taking gradients, the CRF parameters won't get any updates.

@enewe101
Copy link

enewe101 commented Aug 8, 2017

thanks @nreimers and @phipleg !

@chaxor
Copy link

chaxor commented Aug 12, 2017

@phipleg I see that typically the CRF layer loss is applied to a single output layer. Is it possible to have two outputs, as in the functional API demo, while using the CRF loss form two different CRFs?

@phipleg
Copy link
Contributor Author

phipleg commented Aug 12, 2017

Hi @chaxor,

have you tried already something like this?

input_for_crf1 = ...
input_for_crf2 = ...
crf1 = ChainCRF(params_for_crf1)(input_for_crf1)    
crf2 = ChainCRF(params_for_crf1)(input_for_crf2)

model = Model(inp, [out1,out2])
model.compile(optimizer = ...., loss = [crf1.loss, crf2.loss])

@chaxor
Copy link

chaxor commented Aug 13, 2017

@phipleg Well, that was a simple fix. Thank you so much for your help!

@Peydon
Copy link

Peydon commented Aug 15, 2017

hey, i just use the newest ChainCRF layer but the result is strange.The acc of train set first increase and then decrease,and the acc of val increase continuously, i am not clear about that, can you explain it ?@phipleg
here is the train process.

image

@phipleg
Copy link
Contributor Author

phipleg commented Aug 17, 2017

Dear @KARABAER,

it is hard to say without knowing your complete model, the data and training code. Please give more details.

@Ethan1214
Copy link

Hi, @phipleg

I think transistions(a square matrix) should be updated after every batch-training and should be stored if I want to save model, right?

@jinfengr
Copy link

jinfengr commented Oct 19, 2017

Hi @phipleg

In my problem setting, the input tensor has 4 dimension (batch_size, session_len, query_len, num_classes), so I built a TimeDistributed layer on top of CRF like below:

def create_model(session_len, maxlen, lstm_size, n_classes):
    crf = ChainCRF()
    input = Input(shape=(session_len, maxlen, lstm_size))
    ivec = TimeDistributed(TimeDistributed(Dense(n_classes)))(input)
    output = TimeDistributed(crf, name="output1")(ivec)
    sequential_model = Model(input, output)
    return sequential_model, crf

sequential_model, crf = create_model(session_len, maxlen, lstm_size, n_classes)
sequential_model.compile(loss=crf.loss, optimizer='sgd')
print(sequential_model.summary())

x = np.random.random((batch_size, session_len, maxlen, lstm_size))
y = np.random.randint(n_classes, size=(batch_size, session_len, maxlen))
y = np.eye(n_classes)[y]
sequential_model.fit(x, y, nb_epoch=10)

But I found it always throws an error when I trying to fit the sequential model. Does current CRF implementation not support adding a TimeDistributed layer on top? If it's that case, is there any alternative to support a 4D input?

Thanks!

@jinfengr
Copy link

jinfengr commented Oct 19, 2017

Hi @phipleg
A followup to previous question:

I found it's okay to define the create_model function as above, the error comes from the CRF loss function which it only accepts 3D predictions. But my predictions are 4D tensors. So I extend the loss function to accept 4D predictions as below:

def keras_extract_dim(x, i):
    return x[:, i, :, :]

def time_distributed_loss(self, y_true, y_pred):
    '''CRF Time Distributed Loss. '''
    mask = self._fetch_mask()
     _, session_len, _, _ = K.int_shape(y_pred)
    loss = 0.0
    for i in range(session_len):
       extract_layer = Lambda(lambda x: keras_extract_dim(x, i), output_shape=lambda inp_shp: (
                inp_shp[0], inp_shp[2], inp_shp[3]))
       sub_y_true = extract_layer(y_true)
       sub_y_pred = extract_layer(y_pred)
       loss = loss + chain_crf_loss(sub_y_true, sub_y_pred, self.U, self.b_start, self.b_end, mask)
   return loss

sequential_model.compile(loss=crf.time_distributed_loss, optimizer='sgd')
sequential_model.fit(x, y, epochs=10)

I finally found above solution worked! So for anyone having the same question as me, maybe you can take a look at above solution.

One more question: I found the model is training properly and the test results also make sense. However, the validation loss is about 100X larger than the training loss. Does anyone have any idea of what may cause this issue? Thanks!

@ersinyar
Copy link

ersinyar commented Jan 31, 2018

Although @phipleg and @nreimers explained I am still confused about joint training of CRF and LSTM. What I'd expect is that first we should obtain LSTM outputs and CRF layer should make a decision using these outputs and randomly initialized transition matrix by calculating a score function. Then a loss function is calculated with true and predicted labels and the gradient of this function wrt to the parameters that we try to learn should be calculated. Finally we update parameters accordingly.

I don't understand how to back propagate the errors from the output of CRF layer. In LSTM without CRF layer one may use cross entropy to do sequence to sequence learning by calculating a loss function using predicted and true(maybe one hot representation) labels at each time step. But how to back propagate error in case of CRF?

@lzfelix
Copy link

lzfelix commented Mar 20, 2018

Is there any preference or major difference between this implementation and the ChainCrf in keras-contrib?

@nreimers
Copy link

@lzfelix This implementation worked better for me (see https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf). Worked better means: Higher F1-score for standard sequence tagging tasks like NER, Chunking etc.

Die ChainCrf implementation in keras-contrib has the issue that it produced several invalid BIO-tags, i.e., it starts an I-tag without a previous B-tag. This is not the case for this implementation.

However: This implementation only works for Keras 1, while ChainCrf in keras-contrib also works for Keras 2.

@lzfelix
Copy link

lzfelix commented Mar 20, 2018

Hi @nreimers, thank you for the feedback and your careful evaluation of both models. I'm quite new to CRFs and coincidently I was reading your code before reaching this thread.

Regarding the limitation on Keras version that you mentioned, I was able to overcome it through minor code modifications. On this repository I show such changes and evaluate the CRF on the POS task using a very simple model, maybe you can use this as well to upgrade your code to Keras 2, as pointed out on your repository's readme.

@nreimers
Copy link

Hi @lzfelix , that is great, I will have a look.

@lzfelix
Copy link

lzfelix commented Mar 21, 2018

@nreimers, thank you! Please let me know if you have any comments or find anything strange on my code. As I mentioned previously, I'm still learning CRFs.

@Jeffyrao, I have observed that my loss on dev is much higher than on training as well, but not as much as you reported. It's possible to see this behaviour on my demo code here.

@dynamicwebpaige
Copy link

Closing this issue, as linear chain CRF is supported as a Keras layer in TensorFlow Addons. Thank you for the feature request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests