StereoSet benchmark for GPT2 #8

Lj1ang · 2024-08-24T05:05:11Z

I noticed that GPT2Tokenizer is used when evaluating GPT2, which doesn't have a mask_token. Will this impact the evaluation result?
I think I should add a new one manually but I'm unsure which one I should add.

jacqueline-he · 2024-08-25T00:30:40Z

Hi, thanks for your interest in our work! :)

That’s a great question! GPT-2 is trained auto-regressively and therefore cannot be evaluated in the same manner as a masked language model. Instead of evaluating as a fill-in-the-blank problem, it's recommended that you compute the probability of the sentence when the blank is filled with a stereotypical term, and then with an anti-stereotypical term, and score based on whichever is more likely.

I would defer to Section 6.2 in the original StereoSet paper for more details.

Lj1ang · 2024-08-27T21:54:25Z

Thanks for your reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StereoSet benchmark for GPT2 #8

StereoSet benchmark for GPT2 #8

Lj1ang commented Aug 24, 2024

jacqueline-he commented Aug 25, 2024

Lj1ang commented Aug 27, 2024

StereoSet benchmark for GPT2 #8

StereoSet benchmark for GPT2 #8

Comments

Lj1ang commented Aug 24, 2024

jacqueline-he commented Aug 25, 2024

Lj1ang commented Aug 27, 2024