-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rescale and specify certain model #46
Comments
hi @areejokaili , sorry for the confusion. The code below should meet your use case.
|
Hi @Tiiiger, thanks for the quick reply.
It works now, but I'm getting different scores than before. I was doing my own multi-refs scoring before, so maybe this is why. |
were you using baseline rescaling before? according to the hash you were not? |
this is what I used before |
Cool, that looks correct. Let me know if you have any further question. |
Hi @Tiiiger again, sorry for asking again but I did a dummy test to compute the similarity between 'server' and 'cloud computing' using two different environments. First env has bert-score 0.3.0, transformers 2.5.0 and got scores 0.379 0.209 0.289 The second env, has bert-score 0.3.2, transformers 2.8.0 and got scores -0.092, -0.167 -0.128 |
hi @areejokaili , thank you for letting me know. I suspect that there could be some bugs in the newer version and I would love to fix those. I am looking into this. |
hi I quickly tried a couple of environments. Here are the results:
I believe this is due to an update in the RoBERTa tokenizer. Running
I encourage you to checkout issue 2778 to understand this change. So, as I understand, this is not a change in our software. If you want to keep the same results as before, then you should downgrade Again, thank you for giving me the heads-up. I'll add a warning to our README. |
Hi @Tiiiger cands=['I like lemons.']
refs = [['I am proud of you.','I love lemons.','Go go go.']]
(P, R, F), hash_code = score(cands, refs, lang="en", rescale_with_baseline=True, return_hash=True)
P, R, F = P.mean().item(), R.mean().item(), F.mean().item()
print(">", P, R, F)
print("manual F score:", (2 * P * R / (P + R)))
> 0.9023454785346985 0.9023522734642029 0.9025075435638428
manual F score: 0.9023488759866588 Do you know why the F score directly from the method is different than when I do it manually? |
Hi @areejokaili, The reason is that you are using |
Thanks @felixgwu cands=['I like lemons.', 'cloud computing']
refs = [['I am proud of you.','I love lemons.','Go go go.'],
['calculate this.','I love lemons.','Go go go.']]
print("number of cands and ref are", len(cands), len(refs))
(P,R,F), hash_code = score(cands, refs, lang="en", rescale_with_baseline=False, return_hash=True)
P, R, F = P.mean().item(), R.mean().item(), F.mean().item()
print(">", P, R, F)
print("manual F score:", (2 * P * R / (P + R)))
> 0.9152767062187195 0.9415446519851685 0.9280155897140503
manual F score: 0.9282248763666026 Appreciate the help, |
Hi
Thank you for making your code available.
I have used your score before the last update (before muti-refs were possible and before scorer). I used to get the hash of the model to make sure I get the same results always.
With the new update, I'm struggling to find how to set a specific model and also rescale.
For example, would like to do like this
out, hash_code= score(preds, golds, model_type="roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)", rescale_with_baseline= True, return_hash=True)
roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0) is the hash I got from my earlier runs couple of months ago.
Appreciate your help
Areej
The text was updated successfully, but these errors were encountered: