Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement shrink_windows argument for Word2Vec. #3169

Merged
merged 14 commits into from
Jun 29, 2021

Conversation

M-Demay
Copy link
Contributor

@M-Demay M-Demay commented Jun 9, 2021

Fixes #3159.
The idea is to allow the user to use a window with fixed-size, contrary to the current default behaviour which is to pick the window size in [1, window] uniformly at random. We want this alternative behaviour for a project.
We did it by adding a boolean parameter reduced_windows. For compatibility, we put it at the end of the arguments list, default to False.
The user can set it directly in the constructor of the class Word2Vec, or at any further step (for example, to retrain the model in a second step), since we propagated it to the downstream calls.

Co-Authored-By: Mathis Demay <mathis.demay.md@gmail.com>
gensim/models/word2vec_inner.pyx Outdated Show resolved Hide resolved
gensim/models/word2vec_inner.pyx Outdated Show resolved Hide resolved
gensim/models/word2vec_corpusfile.pyx Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
@pandrey-fr
Copy link
Contributor

Tests are failing on old models' reloading due to the added shrink_windows attribute. I will try to understand where we need to edit the code to add the attribute, although I do not quite understand how old models' reloading work - apparently it does not run things through the current version's object instantiation process? Otherwise the newly-introduced parameter should be added to the loaded models (with a default value of True that implements the same behaviour as the legacy one).

@gojomo
Copy link
Collaborator

gojomo commented Jun 15, 2021

In general the right place to path older models, upon load, to ensure they have all currently-required properties is _load_specials() - and a new parameter should be assigned the value that would preserve the prior behavior. See for example all the current check-and-add-if-missing logic:

https://github.com/RaRe-Technologies/gensim/blob/7b6e73a7dd2bf6e74c6b7099c02fe49d282156ea/gensim/models/word2vec.py#L1955-L1960

Separately: it occurs to me that it should be checked that the new option will work in Word2Vec's closely-related subclasses, Doc2Vec and FastText, as well. (It will likely require similar changes in their .pyx files, and perhaps some attention to their __init__() methods.)

@pandrey-fr
Copy link
Contributor

Thank you for clarifying things about old models' reloading; it is now fixed.

I also deployed the parameter to Doc2Vec and FastText (both exposing it as part of their Python API and implementing similar tweaks to the backend pyx files as for word2vec). Otherwise the behaviour would have stayed coherent with the legacy shrink_windows=True, but it does feel right to have it properly implemented by children models!

On my Linux machine all tests passed, although I am wondering whether it would be appropriate to tweak those as well, so that some test models may also run with the non-default shrink_windows=False parameter? Apart from the code compiling I did check manually that it worked for Word2Vec but not for the other models. Not that I have actual doubts at this point about the implementation working (although I may of course be mistaken), but perhaps adding tests would be suited for future maintainability.
Do you have any opinion on that matter, and if you do want tests to be added, could you please tell me what you would expect? The simplest seems to be to add specific tests just to check that instantiation/training runs with the non-default parameter value, but I can't think of a simple way to additionally assert that the expected backend behaviour is indeed implemented.

@gojomo gojomo changed the title Implemented reduced_windows argument for Word2Vec. Implement shrink_windows argument for Word2Vec. Jun 16, 2021
@gojomo
Copy link
Collaborator

gojomo commented Jun 16, 2021

Yes, it'd be good to ensure at least one test method, for each class, tries the non-default shrink_windows=False code path.

If there were an easy way to test that it's having the desired effect, that'd be great - but it's a subtle thing to test. Its effects on final vectors aren't easily characterizable. The tangible external effect will be: training runs about twice as long. But, running some training twice, once each way, timing it, and asserting one was "around" twice as long seems a bit flaky & low-value to me, compared to the amount of runtime it'd add to the automated testing.

Because of that, I think careful review of the source for intent, along with the old path definitely still working as before, and the new path not erroring, is enough for the unit-testing of this.

It might make sense to mark the doc-comment for such a new, niche feature as: "New in 4.1. Experimental."

* `c.reduced_windows[:] = 0` syntax is not supported;
  as a consequence, all zero-value assignments due to
  `shrink_windows=False` have been rewritten as `for`
  loops.
* Those changes have now been tested (cf. next commit).
* NOTE: another way to proceed could be to initialize
  the `reduced_windows` array with zeros as values; it
  would then only be altered when `shrink_windows=True`.
@pandrey-fr
Copy link
Contributor

Alright, I just pushed three additional commits:

(1) Backend changes; as it turns out, the previously-suggested use of c.reduced_windows[:] = 0 instead of a for loop is not working (it compiled but failed); I also tried using c.reduced_windows.fill(0) but it turns out that in the pyx files, the arrays are treated as lists instead of numpy arrays, so either approach failed. I switched back to for loops; it adds a few lines of (readable) code but it does work.

(2) Additional tests, following @gojomo's recommendations. I actually added 4 tests to both Word2Vec, Doc2Vec and FastText so as to test all underlying code branches (cbow and skipgram, on an iterable corpus or on a corpus file). If you would rather remove some of these to limit the extra testing time, feel free to trim some off.

(3) As you suggested, I added the "experimental" status line to the parameter's docstring in all three models.

On the side, I note that:

  • Most Doc2vec tests labeled to be running from a corpus file look like they actually use list_corpus instead. I might be wrong, but it might be worth taking a look and fixing things (in a separate PR) if needed.
  • FastText tests could use some refactoring. To avoid copy/pasting long blocks of redundant code, I tweaked some of the existing ones to simply call them a second time passing shrink_window=False in the ones that this PR adds.

@piskvorky
Copy link
Owner

I switched back to for loops

This is desirable, thanks. The for-loop is clearer and I see you also made it more efficient (less looping).

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I'm late to the party, so I see most of the issues have been ironed out. I did manage to find a few nitpicks and left you some comments. Please have a look.

@@ -365,8 +365,12 @@ def train_document_dbow(model, doc_words, doctag_indexes, alpha, work=None,

if c.train_words:
# single randint() call avoids a big thread-synchronization slowdown
for i, item in enumerate(model.random.randint(0, c.window, c.document_len)):
c.reduced_windows[i] = item
if model.shrink_windows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit repetitive.

Wouldn't it be better to define a model.initialize_reduced_windows method that does this, and call it when needed?

def initialize_reduced_windows(self, ...):
  if model.shrink_windows:
    for i, item in ...:
        ...
  else:
    ...

That way, if we find a logic error in the initialization code, we don't have to remember to fix it in a half a dozen other places (e.g. https://github.com/RaRe-Technologies/gensim/pull/3169/files#r649243075).

We should probably also do a bounds check on c.reduced_windows to ensure that it's at least document_len elements long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you on the potential for refactored code that would ease future maintenance and updates. That being said, I actually built on the existing, repetitive code, so in my humble opinion this could be handled as a distinct PR (this one would add a feature, while a second one would enhance the validated, working code).

I also think that your suggestion of a model.initialize_reduced_windows is not entirely suitable here:
If I am not mistaken, in this code, c is not model: it is a special C structure (of a different class depending on the trained model class: word2vec, doc2vec or fasttext), and the reduced_windows PyArray is initialized to have the proper size, so there should not be a need to do a bounds check.
I do not know how we could implement a common method (or function) that would also fill the array with either deterministic or random values; but I do agree with you that it could be worth it.

To be honest I lack experience with cython, so that I do not feel entirely at ease with diving into this refactoring. I am willing to give it a try at some point if you want, but would prefer it to be a distinct effort (and PR) from the current one - partly because Mathis and I are pushing the shrink_windows feature due to our needing it for a current research project at work, while contributing to refactoring the code base would be something I would (willingly) do on my time off.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, there's a lot of pre-existing cut & paste duplication between all these related algorithms & modes that could be refactored. I think that'd be OK to note-and-defer-for-later, as either an issue or a FIXME comment in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can deal with this separately.

gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/test/test_fasttext.py Show resolved Hide resolved
@@ -497,12 +497,12 @@ def test_sg_hs_training(self):
overlap_count, 2,
"only %i overlap in expected %s & actual %s" % (overlap_count, expected_sims_words, sims_gensim_words))

def test_sg_hs_training_fromfile(self):
def test_sg_hs_training_fromfile(self, shrink_windows=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.pytest.org/en/6.2.x/example/parametrize.html

Suggested change
def test_sg_hs_training_fromfile(self, shrink_windows=True):
@pytest.mark.parametrize('shrink_windows', [True, False])
def test_sg_hs_training_fromfile(self, shrink_windows):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed two commits making use of that feature. The first one only applied to the FastText tests, as they were the one you attached this review to. The second one applies to the other two test sets (wor2vec and doc2vec); if you feel like the latter should not have been changed, we can discard this commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: actually, pytest.mark.parametrize does not work as expected on methods (although it does on functions). I will try to find a way to make it work, otherwise I will undo the last two commits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, in fact it probably will not work due to the test classes being unittest.TestCase subclasses.

https://docs.pytest.org/en/6.2.x/unittest.html

The following pytest features do not work [in unittest.TestCase subclasses], and probably never will due to different design philosophies: Fixtures (except for autouse fixtures, see below); Parametrization; Custom hooks;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am revoking the last two commits as a consequence.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I didn't know pytest parameterization is incompatible with unittest.TestCase.

Luckily, in this particular case, the unittest.TestCase class isn't doing anything. I'll get parameterization working in a separate commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we both learned something :)

@mpenkov mpenkov merged commit 2b9b1b3 into piskvorky:develop Jun 29, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Jun 29, 2021

Merged. Thank you @M-Demay !

mpenkov added a commit that referenced this pull request Jun 29, 2021
The repo wasn't accepting maintainer commits, so I'm taking care of this
here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable to master the window size variations during training
5 participants