KL div v/s xentropy #12

arunmallya · 2017-10-31T22:41:21Z

The Hinton distillation paper states:
"The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. The second objective function is the cross entropy with the correct labels."

In https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L13-L15, the first objective function is computed using kl_div which is different from cross_entropy.
kl_div computes (- \sum t log(x/t))
cross_entropy computes (- \sum t log(x))
In general, cross_entropy is kl_div + entropy(t)

Did I misunderstand something, or did you use a slightly different loss in your implementation?

The text was updated successfully, but these errors were encountered:

arunmallya · 2017-10-31T22:53:39Z

Technically, the loss would be different, but the gradients would be correct as entropy(t) doesn't contribute to gradient w.r.t x.

DanteLuo · 2018-07-16T08:17:04Z

So long as the targets prob is not changing. These two losses are equalvalent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KL div v/s xentropy #12

KL div v/s xentropy #12

arunmallya commented Oct 31, 2017

arunmallya commented Oct 31, 2017

DanteLuo commented Jul 16, 2018

KL div v/s xentropy #12

KL div v/s xentropy #12

Comments

arunmallya commented Oct 31, 2017

arunmallya commented Oct 31, 2017

DanteLuo commented Jul 16, 2018