Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the crf model #101

Closed
HaoDreamlong opened this issue Jan 4, 2021 · 10 comments
Closed

Question about the crf model #101

HaoDreamlong opened this issue Jan 4, 2021 · 10 comments
Assignees

Comments

@HaoDreamlong
Copy link

In the crf model of version v0.3.2, the encoder is endswith a Tanh layer and Scale layer. Why is it necessary to add these two layers?

@davidcpage
Copy link
Contributor

This constrains the output scores to lie in a range given by the scale factor - e.g. for Scale(5.0) this is a soft clipping function to the range (-5.0, 5.0). Scores are in log space and this should allow plenty of dynamic range whilst improving training stability but it's possible the Tanh layer could be removed.

@HaoDreamlong
Copy link
Author

Thank you for your reply,I have one more question about the model of the GlobalNorm layer. I read the pytorch version of the logZ calculation in seqdist.sparse ,I guess it is for some sort of normalization usage. Does it have something to do with the Scale layer? and what is the exactly usage for the GlobalNorm layer?

@HaoDreamlong
Copy link
Author

And about the first question. If it is for the log space posibilities, isn't it should be in range of (0,1)? and log_p should less than 0?

@davidcpage
Copy link
Contributor

The outputs of the network represent scores in a linear-chain CRF. You can use them to compute the log probability of a particular (aligned) output sequence by adding log scores for the transitions at each timestep and subtracting the log of the global sum over (aligned) sequence scores, logZ. Scale() controls dynamic range of the log scores but these do not lie in (0,1) as they are not log probs.

@HaoDreamlong
Copy link
Author

oh I get it. So since the output repesent the log scores, the loss of the model is the sum of correct paths( aligned? ) scores,and backwards is making the -loss smaller meanwhile making the correct paths reach highest scores. And the decoder should work as finding the way which has highest score. Is it a proper description?

@iiSeymour
Copy link
Member

Yes, that is right @HaoDreamlong

@HaoDreamlong
Copy link
Author

Thank you very much. The variables named stay_indices/scores and move_indices/scores, I have a little problem understanding them. Since the stay_indices is representing 5-position 4 hex ,and the move_indices is stay_indices add previous step value . At some extrem situation like stay_indices=341(1 1 1 1 1) and move_indices=342(1 |1 1 1 1 1). Don't they represent the same situation?

@davidcpage
Copy link
Contributor

We distinguish between being in state 1 1 1 1 1 and emitting a blank symbol (stay_index/score) and being in state 1 1 1 1 1 and emitting a 1 symbol (move_index/score). This leads to the same pair of before and after states, but a different emitted sequence. The inclusion of a blank symbol makes this a kind of CTC model except here the conditional independence condition is replaced with a CRF.

@HaoDreamlong
Copy link
Author

The model's decode function, only simply calculate logZ(for different S)twice, and obtain the gradient by auto_grad. It is hard to understand how this work as viterbi decoder. Could you tell me why such a delicate algorithm can produce the right answer?

@HaoDreamlong
Copy link
Author

@davidcpage The rnn model doesn't need chunk_lengths, is it because the rnn can deal with the blank padding at the end of the input? or I have to make the input full of useful information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants