Skip to content

Small bugfix for semantic similarity evaluation #1079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 8, 2017

Conversation

akutuzov
Copy link
Contributor

@akutuzov akutuzov commented Jan 6, 2017

Handling of non-Latin OOV pairs logging in evaluate_word_pairs function (they were not converted to UTF-8, which caused logger to fail).

Handling of non-Latin OOV pairs logging in evaluate_word_pairs function.
@tmylk
Copy link
Contributor

tmylk commented Jan 6, 2017

Please ignore Python 2.6 and Python 3.3 build errors. They are due to latest PyEMD update.

The 3.4 and 3.5 errors are due to this PR though.

@akutuzov
Copy link
Contributor Author

akutuzov commented Jan 7, 2017

Fixed.

@piskvorky
Copy link
Owner

piskvorky commented Jan 8, 2017

This doesn't look right. What "OOV pairs"? pairs is a filesystem path (bad variable name, but that's a separate issue).

What exactly fails and how?

@tmylk
Copy link
Contributor

tmylk commented Jan 8, 2017

@akutuzov Python 2.6 and 3.3 failures should be fixed now. Please kindly merge in the changes from develop to this PR.

@akutuzov
Copy link
Contributor Author

akutuzov commented Jan 8, 2017

@piskvorky The problem arises when we have non-Latin word pairs in the dataset, and they are absent in the model. The logger prints these pairs on debug level here, and it fails in Python 2. To fix it, I explicitly state that the whole string is Unicode. For Python 3 it's irrelevant, as far as I can see.
The pairs variable indeed is the path to the dataset file (I agree that's not the best variable name). The logger outputs this path in case there are incorrect lines in the dataset. Again, in this PR I explicitly state that this warning is Unicode, to handle rare cases when the path contains non-Latin characters.

@piskvorky
Copy link
Owner

Ah, I see what you mean -- you're talking about non-ASCII (not non-Latin) characters in line, which is a unicode string.

I'm not very familiar with the .format fancy formatting, but normal "%s" should output them (upcast to unicode) just fine.

I'd prefer the fix to use lazy argument formatting: logger.info("%s", unicode_string). It's easier to read and clearer (and doesn't fail).

Independently: why pairs (the filesystem path) was being converted with .encode('utf8') I don't know. Unless there's an explicit comment that documents this (and there isn't), I'd consider it a bug.

As per @piskvorky comment.
Also, added information about an example dataset included in Gensim distribution.
@tmylk tmylk merged commit 8ae570b into piskvorky:develop Jan 8, 2017
@piskvorky
Copy link
Owner

@tmylk please review before merging...

@tmylk
Copy link
Contributor

tmylk commented Jan 8, 2017

Reviewed and tested simple formatting

@piskvorky
Copy link
Owner

@akutuzov can you please fix the comments above?

@tmylk
Copy link
Contributor

tmylk commented Jan 9, 2017

@akutuzov In particular @piskvorky is referring to lazy formatting.

Lazy: logger.info("%s", unicode_string).

Eager: logger.info("%s" % unicode_string).

akutuzov added a commit to akutuzov/gensim that referenced this pull request Jan 10, 2017
@akutuzov
Copy link
Contributor Author

Fixed in #1084

tmylk pushed a commit that referenced this pull request Jan 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants