-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear Chain CRF layer and a text chunking example #4621
Conversation
keras/layers/crf.py
Outdated
''' | ||
return chain_crf_loss(y_true, y_pred, self.U, self.b) | ||
|
||
def sparse_loss(self, y_true, y_pred): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting trick.
Hmm, interesting. I also have an implementation, and I am thinking of making a pull request and I saw your PR. My implementation supports masking, Viterbi decoding, computes marginal probability, and allows the CRF to be used as an intermediate layer (instead of last layer only) if using marginal mode (instead of join distribution mode). My PR is here #4646 |
Sorry for the late reply. I'm lying in bed with a bad cold. Good idea to join forces @linxihui! It is very nice that you don't need further extension of the keras backend. I guess I could refactor my code, to make use of the K.rnn method instead of using theano.scan and tensorflow.scan. Then no extension of the backend would be necessary. However this is not so clear to me, since I do a lot of workarounds to avoid converting the sparse targets to one-hot encoded vectors. Moreover, I could make masking work if that is really necessary for the first release. In my experience masking makes everything very slow without much benefit. That's why I didn't put much effort in it. Then, if I understand your implementation correctly, using your CRF as a last layer with learn_mode=join and test_mode=viterbi is almost the same as a dense layer followed by my ChainCRF. However, the latter supports a boundary condition on the left, by providing a bias weight to learn the transition between a virtual start label and the first label of the target sequence. I've seen implementations of linear CRF's, where the input and target sequences are embedded in sequences of length plus two in order to deal with the left and right boundary condition. I didn't handle the end boundary condition since in my application (sentence tagging), the sequences have variable length and are always padded on the right, so that the padding elements act as a virtual end label. You can see the benefit of handling these boundary conditions in my integration tests. There, the test data has the property that the input x_t and the target y_t are independent, except when t=0, and y_t = y_{t-1} + 1 (modulo nb_classes). Nonetheless, the network is able to learn the correct tag sequence (with accuracy >= 0.94), which is only possible when the left boundary condition is handled correctly. Admittedly, this data is quite artificial though and doesn't resemble real data like the one seen in text chunking. What are the applications of using your CRF layer as an intermediate layer with returning marginal probabilties? Should this functionality be in the same class, or do you see a way to move it to another CRF class? |
That is a better approach. Note that |
|
@phipleg Do you agree that we should merge our code? Use my implementation (if you think it could be better option), but with your unit test and example? I have some code for handling boundary energies and I will make a commit after work. |
Hey @linxihui ! I've completed my rewrite the past few days, so that the CRF makes use of
If it is ok for you, I would be happy to leave it now as it is. We can add more functionality with another pull request, ok? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For masking, please see my comments in your code. Yes, you can just have right padding, but then you have to highlight on your documentation that left padding is not accepted. People will take it for grant as RNN supports either.
- I apologize for mention too much about using CRF as intermediate layer. Indeed, this is just a side effect I discovered after implementing the marginal mode. The marginal mode was originally for computing the marginal probabilities, since Viterbi only gives labels. I don't encourage people use it, but just mention it in case anyone interest in it.
keras/layers/crf.py
Outdated
def _forward_step(x_t, states): | ||
alpha_tm1, U_shared = states | ||
B = K.expand_dims(alpha_tm1, 2) + K.expand_dims(x_t, 1) + K.expand_dims(U_shared, 0) | ||
alpha_t = K.logsumexp(B, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With not mask, how do you handle a batch with different sequence lengths? Said a batch of 2, with first input have lenght 2 and the 2nd has a lenght 4. Then you would do a right padding on the 1st sequence to make it have length 4. In you implementation without mask, for the 1st input, the true energy should be
b_start + x1' y1 + x2' y2 + y1' U y2 + b_end
but in your implementation, there isn't a b_end
, but with an additional
x3' y3' + x4' y4 + y2' U y3 + y3' U y4
,
where y3 = y4 = 0
.
The consequence is, the above two formulations are not equivalent, at least when you take derivative with respect to U_00
(top-left element in matrix U
), the derivative isn't the same. Right? (also, U_00
and U_11
are not exchangable
, but why we treat label 0 and label 1 differently?)
Also when you compute the normalization constance (free energy in your code), you have to integrate over y3, y4
(which are paddings). I guess that's what you mean by "padding elements act as a virtual end label". However, if when you think about taking derivative with respect to U
or b_end
, your approach is not equivalent to a real CRF.
One very obvious observation is, y3, y4
, the padding, affects the derivative with respect to U
, and therefore, the paddings plays a role on the final outcome. The more paddings you have, the more impact the paddings affects the outcome. This is unexpected from my point of view.
Lastly, another simple observation, a model with and without the end energy (b_end
), the numbers of trainable parameters are not the same. So the two models are not the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for these remarks. You are completely right, that the models are not the same.
I don't handle batches of different length explicitely. They are embedded in sequences of fixed length by design, so the user has to do the conversion himself.
Hi, I had a problem when saving the model with keras model.save(). It threw a exception, that an h5py object could not be initialized. The bug is located in the crf.py on line 267 & 268: Change it like this |
I think deserialization using When I load the model like
I get the error during the model.compile() call, that 'self.U' does not exist in the sparse_loss-Function. The reason is, that when the model is constructed from the saved model, a new CRF object is created. However, the passed 'sparse_loss' function is from a different CRF-object, not the one that the reconstructed model is using and therefore, the parameters U, start_b and end_b are not set. Any ideas how to fix this / how to pass the custom_objects properly so that I can load a stored model? |
Hi @nreimers, Thank you very much for pointing out the bug in the definition of Regarding your question about model loading, I wrote a method |
Thanks for the quick reply / quick fix. Just as general note: |
Hi @phipleg, With this implementation, I achieve an F1 score of about 0.89 on CoNLL 2003 NER. This implementation produces between 0 - 3 wrong tags on the dev/test data, i.e. an I-tag starts without a previous B-tag. With the implementation of #4646, the number of wrong BIO tags are between 20 - 60 on dev/test set. In both cases I use the word embeddings by Levy et al., a 100-dim. Bi-LSTM, a dense hidden layer with linear activation function and the CRF. |
@phipleg you could consider opening another PR (in parallel) in the contrib repo 🎉 ! There are more official reviewers so you could have more feedback. |
Hi @nreimers, Thanks a lot for this detailed test. You probably used the english dataset, right? |
Hi @tboquet, Thanks a lot for your suggestion. I will do that tomorrow evening. |
Hi @phipleg, |
Hi @phipleg thanks for adding the ChainCRF layer. However, I think it is breaking functionality of keras when using import numpy as np
from keras import backend as K
from keras.models import Sequential
from keras.layers import Dense, ChainCRF, TimeDistributed
nb_classes=3
X = np.random.randn(10,5,11)
y = np.random.randint(0,nb_classes,size=(10,5))
y = np.expand_dims(y, -1)
y_mask = np.random.randint(0,2,size=(10,))
X.shape, y.shape, y_mask.shape
model = Sequential()
model.add(TimeDistributed(Dense(nb_classes),
input_shape=(5,11), name="temporal_dense"))
crf = ChainCRF()
model.add(crf)
model.summary()
model.compile(loss=crf.sparse_loss, optimizer='sgd')
model.fit(X, y, sample_weight=y_mask, nb_epoch=1) However, when I edit the following modification of the above code which uses temporal sample weights fails: y_mask = np.random.randint(0,2,size=(10,5))
model = Sequential()
model.add(TimeDistributed(Dense(nb_classes, #activation='softmax'
),
input_shape=(5,11), name="temporal_dense"))
crf = ChainCRF()
model.add(crf)
model.summary()
model.compile(loss=crf.sparse_loss, optimizer='sgd', sample_weight_mode='temporal')
model.fit(X, y, sample_weight=y_mask, nb_epoch=1) I get the following error:
Will it be appropriate to consider the sample weight per time step in the CRF loss or should the sample weighing be done at the output level? |
@phipleg also it might be helpful to utilize the tensorflow API for CRF loss, as described in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/crf/python/ops/crf.py The tensorflow API also uses the masks per time_step item (the masks are created based on sequence lengths, currently). I am not sure how to go about it, but in the current approach the is a requirement for holding the CRF object for the ChainCRF layer, in order to define the loss function using |
keras/layers/crf.py
Outdated
''' | ||
y_true = K.cast(y_true, 'int32') | ||
y_true = K.squeeze(y_true, 2) | ||
return sparse_chain_crf_loss(y_true, y_pred, self.U, self.b_start, self.b_end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should fetch the mask in the sparse_loss
function as well, the same way you do in loss
function.
Also, all the loss should be the mean loss per batch, right now you are simply doing the sum of the losses. This makes the losses for different batch sizes to be of different scales.
Hi @napsternxg! Thank you very much for your review! I added the missing mask in crf.sparse_loss. However, I couldn't fix the problem with the temporal sample weights. The problem is, that usually, the loss functions return a value for each batch element and each time step, i.e. return a tensor of shape (batch_size, maxlen, ...). The total loss with calculating means and sample weighting is done later on in Also, thank you for pointing out the CRF implementation in TensorFlow. Unfortunately, I don't see a way to integrate it in Keras. The main problem is that the loss functions in Keras are usually pure functions depending only on the predicted and the true values: They cannot depend on trainable weights. Currently, I have no better answer. |
@phipleg thanks for the update. I think in that case masking is the best option for getting the loss without padding. |
Hi, The first issue is a minor one: In line 289 of crf.py is the following code: This assertion fails when the number of steps is not defined a priori. Removing this assertion solves the issue. The second issue is Tensorflow specific (i.e. the CRF layer works perfect with Theano). I get the following message when switching to Tensorflow as backend:
I use the current Tensorflow 0.12.1 and Keras in version 1.2.1. My model looks something like this:
As the length of my sentences & mini batches varies, I omit the input_length=maxlen parameter for the Embedding layer. With Theano it works perfect, however, with Tensorflow I get the above error. |
…emantic (method add_weight) in order to remove deprecated regularizers property.
Hi @nreimers, I'am very glad that you test the layer so thoroughly! If you need any help, please let me know. The problems when working with mini batches of size 1 should be fixed (by upgrading K.logsumexp in tensorflow_backend). Happy testing |
When approximately we will see the appearance of the final version of CRF layer in the master branch? |
As suggested by @tboquet I opened the parallel PR keras-team/keras-contrib#25 in order to speed up the reviewing process! 🎉 |
Hello @nreimers , @phipleg thanks for the code! The error value is: regarding I've replaced n_steps >= 2 Thanks in advance |
I have tried it with 1000 outputs and got memory error: Why it tries to allocate much more memory? |
Closing outdated PR. If you still care about the content of the PR, please submit a new PR to |
I would love to see this layer included in Keras 2.0 I use it for several NLP task and in all tasks it shows a strong performance increase in comparison to a softmax classifier. For me this layer is a must have if you do sequence tagging, e.g. for sequence tagging for NLP. |
Sorry for the late reply, life got in the way. Thanks for testing! I will work on another PR soon, in order to give the CRF a change to get into Keras 2.0. @kaya27: I introduced an error while handling the step size problem, consider this fixed. @djstrong: I haven't investigated your problem yet. At some point the current implementation converts sparse outputs to dense ones as an intermediate step. This might be related to your problem. |
Is there paper about your spare loss definition? I can't really understand it…… |
@phipleg |
Dear @harryhaos, The spare loss is defined in https://arxiv.org/pdf/1603.01360.pdf . |
Dear @mtmvu, The update is almost done (see https://github.com/phipleg/keras/tree/crf), but can't fix the failing tests and the persistence workaround doesn't work anymore. If your time permits, I would be happy you could join me! Best and happy weekend. |
@phipleg Or, whether this version is ok? |
hi @zhhongzhi |
To those of you for whom it was not so obvious how to install this: must be crf branch!!! $ git checkout crf Any chance to get this merged into master keras? It would really be great. Is it possible to use CRF objective with a FFNN instead of RNN? Just wondering if there was some inherent limitation because all the examples I've ever seen use RNN. Just adding the normal Sequence layer and then ChainCRF will work? |
Does this only work with Theano backend or tensorflow as well? |
@utkrist I've been using a project with ChainCRF and I've switched between both backends and both seem to work well. |
@nreimers hello,I am a freshman in NER, can you share your code about NER code with CRF layer on Conll 2003 dataset to help understand of ?Thx |
@jxwb088047 I will publish my code soon on Github (beginning to mid July). I will let you now as soon as I pushed it to public github.. |
@nreimers I am waiting for your great code, and following your update. |
@jxwb088047 Hi, I uploaded my BiLSTM-(CNN)-CRF code here: It can be used to train the models from Huang et al (BiLSTM-CRF), from Ma & Hovy (BiLSTM-CNN-CRF) and from Lample et al. (BiLSTM-LSTM-CRF). I hope the documentation and the code is helpful enough to give you a good start into this topic. The code uses CRF code by phipleg, thank you again for contributing this to the community. However, it currently works only with Keras 1.x. Maybe we can update at some point to work with Keras 2.x |
@nreimers Thank you for sharing your codes and relevant materials, I will dig into all your share materials and hope to comprehensively understanding of NER. |
This pull request relates to issue #4090. It adds a new layer ChainCRF with dedicated loss function and Viterbi style decoding for infering the best tag sequence. To demonstrate the use, an example for text chunking is given as well.