-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KL div v/s xentropy #12
Comments
Technically, the loss would be different, but the gradients would be correct as |
So long as the targets prob is not changing. These two losses are equalvalent. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The Hinton distillation paper states:
"The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. The second objective function is the cross entropy with the correct labels."
In https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L13-L15, the first objective function is computed using
kl_div
which is different fromcross_entropy
.kl_div
computes (- \sum t log(x/t))cross_entropy
computes (- \sum t log(x))In general,
cross_entropy
iskl_div
+entropy(t)
Did I misunderstand something, or did you use a slightly different loss in your implementation?
The text was updated successfully, but these errors were encountered: