Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU gradient definitions #69

Open
xkr1 opened this issue May 23, 2020 · 3 comments
Open

CPU gradient definitions #69

xkr1 opened this issue May 23, 2020 · 3 comments

Comments

@xkr1
Copy link

xkr1 commented May 23, 2020

Hi,

Firstly, thanks so much for making this code available. It's a really useful resource for learning more about RNN-T.

I have two questions, if I may:

i.) in cpu_rnnt.h the gradients are defined as:

ProbT g = alphas[idx(t, u)] + betas[idx(t+1, u)];
grad[idx(t, u, blank_)] = -std::exp(log_probs[idx(t, u) * 2] + g - loglike);

and

ProbT g = alphas[idx(t, u)] + betas[idx(t, u+1)];
grad[idx(t, u, labels[u])] = -std::exp(log_probs[idx(t, u) * 2 + 1] + g - loglike); 

Following rnnt_notes.pdf, as far as I understand the formula used here is equation 9. From this, I can account for the following part of the above code:

ProbT g = alphas[idx(t, u)] + betas[idx(t+1, u)];
grad[idx(t, u, blank_)] = -std::exp(g - loglike);

but I'm not sure where the log_probs[] comes in here?

ii.) Following on from my first question, it might be obvious that I'm not an expert in calculus and so I was wondering if you'd be able to point me to any resources or give any hints so that I can learn more myself about how to derive the gradient of the loss from eq. 5, 7 and 8 in rnnt_loss.pdf? Even the smallest hint would be greatly appreciated!

Thanks again!

@HawkAaron
Copy link
Owner

Well, the gradidents here are to log probability, \frac{\partial L}{\partial \log p} = \frac{\partial L}{\partial p} \frac{\partial \exp{\log p}}{\partial \log p} = p \frac{\partial L}{\partial p}.

@xkr1
Copy link
Author

xkr1 commented May 24, 2020

Thank you! That makes a lot more sense to me now :-)

Referring now to the component g - loglike (in log domain): I understand how g is derived but it's so far been a mystery to me why the -loglike is there.

Would I be right in saying that this is because the loss, L, is defined in log domain, i.e. L = -lnP(y*|x) and so we end up dividing through by this P(y*|x) term (loglike) as a result of applying the chain rule to take the derivative of the log?

@q121q
Copy link

q121q commented May 30, 2020

how is this different than CTC loss?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants