-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Small bugfix for semantic similarity evaluation #1079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Handling of non-Latin OOV pairs logging in evaluate_word_pairs function.
Please ignore Python 2.6 and Python 3.3 build errors. They are due to latest PyEMD update. The 3.4 and 3.5 errors are due to this PR though. |
Fixed. |
This doesn't look right. What "OOV pairs"? What exactly fails and how? |
@akutuzov Python 2.6 and 3.3 failures should be fixed now. Please kindly merge in the changes from develop to this PR. |
@piskvorky The problem arises when we have non-Latin word pairs in the dataset, and they are absent in the model. The logger prints these pairs on debug level here, and it fails in Python 2. To fix it, I explicitly state that the whole string is Unicode. For Python 3 it's irrelevant, as far as I can see. |
Ah, I see what you mean -- you're talking about non-ASCII (not non-Latin) characters in I'm not very familiar with the I'd prefer the fix to use lazy argument formatting: Independently: why |
As per @piskvorky comment. Also, added information about an example dataset included in Gensim distribution.
@tmylk please review before merging... |
Reviewed and tested simple formatting |
@akutuzov can you please fix the comments above? |
@akutuzov In particular @piskvorky is referring to lazy formatting. Lazy: Eager: |
As requested in piskvorky#1079.
Fixed in #1084 |
Handling of non-Latin OOV pairs logging in evaluate_word_pairs function (they were not converted to UTF-8, which caused logger to fail).