-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjusting softmax function #8
Comments
Thanks for commenting!
I think my initial intuition for how to implement it was something similar, but since in this case the cross-entropy loss is only being applied over (I was about to post this reply, then thought "Wait, what about the softmax?" - but the softmax has already been applied in order to get to |
Hm... There is this mention in the article:
So I'm thinking, does this have a significant impact on the loss from each entry estimate Also, the article actually reads "rather than applying a softmax directly...". Does that imply that they adjust it before the softmax? I couldn't get that to make sense in my mind, so that's why I assumed it was like we both seem to have thought. |
Sooooo I asked Jan Leike nicely and it turns out he still had a copy of the original code for the paper lying around :) Searching for '0.1' and '0.9' in the codebase, the only relevant thing I could find is: def __init__(self, *args, epsilon=0.1, **kwargs):
self.epsilon = epsilon
...
def p_preferred(self, obs1=None, act1=None, obs2=None, act2=None):
reward = [prediction1, prediction2]
mean_rewards = tf.reduce_mean(tf.stack(reward, axis=1), axis=2)
p_preferred_raw = tf.nn.softmax(mean_rewards)
return (1 - self.epsilon) * p_preferred_raw + self.epsilon * 0.5 This is the approach you originally suggested, which, yeah, I'm pretty sure doesn't affect gradients...so either a) this just isn't necessary and they never found out because they didn't do an ablation, or b) it only makes a difference at inference time in situations where, as you say, when the reward predictor is particularly sure. (Though...considering that the most important inference function is predicted reward, rather than predicted preference, I lean towards a). shrug Everyone makes mistakes. Or it serves some completely different purpose that's not ocurring to me right now...) |
Okay! Nice that you took the time. I'm playing around with this a bit now, so I guess I'll give you a comment if I find something substantial related to this. I love Sherlock btw. Made me laugh when I saw your meme. |
After talking about this with a friend that has greater knowledge in the statistics field, my understanding is this:
And the adjusted probability should then be:
I'm not completely sure this is the correct way, but I think so. Great work implementing this btw!
The text was updated successfully, but these errors were encountered: