Small bugfix for semantic similarity evaluation #1079

akutuzov · 2017-01-06T22:12:39Z

Handling of non-Latin OOV pairs logging in evaluate_word_pairs function (they were not converted to UTF-8, which caused logger to fail).

Handling of non-Latin OOV pairs logging in evaluate_word_pairs function.

tmylk · 2017-01-06T23:28:52Z

Please ignore Python 2.6 and Python 3.3 build errors. They are due to latest PyEMD update.

The 3.4 and 3.5 errors are due to this PR though.

akutuzov · 2017-01-07T13:32:28Z

Fixed.

piskvorky · 2017-01-08T03:41:07Z

This doesn't look right. What "OOV pairs"? pairs is a filesystem path (bad variable name, but that's a separate issue).

What exactly fails and how?

tmylk · 2017-01-08T06:11:04Z

@akutuzov Python 2.6 and 3.3 failures should be fixed now. Please kindly merge in the changes from develop to this PR.

akutuzov · 2017-01-08T10:51:16Z

@piskvorky The problem arises when we have non-Latin word pairs in the dataset, and they are absent in the model. The logger prints these pairs on debug level here, and it fails in Python 2. To fix it, I explicitly state that the whole string is Unicode. For Python 3 it's irrelevant, as far as I can see.
The pairs variable indeed is the path to the dataset file (I agree that's not the best variable name). The logger outputs this path in case there are incorrect lines in the dataset. Again, in this PR I explicitly state that this warning is Unicode, to handle rare cases when the path contains non-Latin characters.

piskvorky · 2017-01-08T13:03:35Z

Ah, I see what you mean -- you're talking about non-ASCII (not non-Latin) characters in line, which is a unicode string.

I'm not very familiar with the .format fancy formatting, but normal "%s" should output them (upcast to unicode) just fine.

I'd prefer the fix to use lazy argument formatting: logger.info("%s", unicode_string). It's easier to read and clearer (and doesn't fail).

Independently: why pairs (the filesystem path) was being converted with .encode('utf8') I don't know. Unless there's an explicit comment that documents this (and there isn't), I'd consider it a bug.

@piskvorky

As per @piskvorky comment. Also, added information about an example dataset included in Gensim distribution.

piskvorky · 2017-01-08T15:01:26Z

@tmylk please review before merging...

tmylk · 2017-01-08T15:26:30Z

Reviewed and tested simple formatting

piskvorky · 2017-01-09T02:37:04Z

@akutuzov can you please fix the comments above?

tmylk · 2017-01-09T09:32:53Z

@akutuzov In particular @piskvorky is referring to lazy formatting.

Lazy: logger.info("%s", unicode_string).

Eager: logger.info("%s" % unicode_string).

As requested in piskvorky#1079.

akutuzov · 2017-01-10T01:47:21Z

Fixed in #1084

As requested in #1079.

Update keyedvectors.py

a161932

Handling of non-Latin OOV pairs logging in evaluate_word_pairs function.

Encoding fixes for different Python versions

1cd81ed

Merge remote-tracking branch 'upstream/develop' into patch-1

2c0f6b8

Lazy formatting in evaluate_word_pairs logging

1345952

As per @piskvorky comment. Also, added information about an example dataset included in Gensim distribution.

tmylk merged commit 8ae570b into piskvorky:develop Jan 8, 2017

akutuzov added a commit to akutuzov/gensim that referenced this pull request Jan 10, 2017

Lazy formatting in evaluate_word_pairs

da623e7

As requested in piskvorky#1079.

akutuzov mentioned this pull request Jan 10, 2017

Lazy formatting in evaluate_word_pairs #1084

Merged

tmylk pushed a commit that referenced this pull request Jan 10, 2017

Lazy formatting in evaluate_word_pairs (#1084)

9112ee7

As requested in #1079.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Small bugfix for semantic similarity evaluation #1079

Small bugfix for semantic similarity evaluation #1079

Uh oh!

akutuzov commented Jan 6, 2017

Uh oh!

tmylk commented Jan 6, 2017 •

edited

Loading

Uh oh!

akutuzov commented Jan 7, 2017

Uh oh!

piskvorky commented Jan 8, 2017 •

edited

Loading

Uh oh!

tmylk commented Jan 8, 2017 •

edited

Loading

Uh oh!

akutuzov commented Jan 8, 2017

Uh oh!

piskvorky commented Jan 8, 2017

Uh oh!

piskvorky commented Jan 8, 2017

Uh oh!

tmylk commented Jan 8, 2017

Uh oh!

piskvorky commented Jan 9, 2017

Uh oh!

tmylk commented Jan 9, 2017

Uh oh!

akutuzov commented Jan 10, 2017

Uh oh!

Uh oh!

Uh oh!

Small bugfix for semantic similarity evaluation #1079

Small bugfix for semantic similarity evaluation #1079

Uh oh!

Conversation

akutuzov commented Jan 6, 2017

Uh oh!

tmylk commented Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akutuzov commented Jan 7, 2017

Uh oh!

piskvorky commented Jan 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmylk commented Jan 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akutuzov commented Jan 8, 2017

Uh oh!

piskvorky commented Jan 8, 2017

Uh oh!

piskvorky commented Jan 8, 2017

Uh oh!

tmylk commented Jan 8, 2017

Uh oh!

piskvorky commented Jan 9, 2017

Uh oh!

tmylk commented Jan 9, 2017

Uh oh!

akutuzov commented Jan 10, 2017

Uh oh!

Uh oh!

tmylk commented Jan 6, 2017 •

edited

Loading

piskvorky commented Jan 8, 2017 •

edited

Loading

tmylk commented Jan 8, 2017 •

edited

Loading