-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining #2944
[MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining #2944
Conversation
cd2e326
to
52adad8
Compare
The paths for first, or post-vocab-expansion, allocation/initialization of There's a chance this also resolves very-old Doc2Vec-segfault #1019 - the test code there doesn't trigger a segfault anymore, but maybe it was other recent changes that fixed it, or maybe it's just become harder to trigger. Making the But, at the moment the prior support for memmapped vector-arrays hasn't yet been maintained in this refactor, and the |
When the larger #2939 is merged, I propose rebasing this on that & applying - it fixes the original goal, and then some. But, mmap functionality is currently broken in |
#2939 is merged now. @gojomo will you be able to finish this & the broken I'd love to release this week. Only the documentation items in #2960 (for me) and this ticket (for you, hopefully) are left now. Plus the decision around supporting pre-3.8.3 models, but that's best discussed live – see email. |
2387b5d
to
a21cad6
Compare
Subject to the provisos...
...this is ready-for-final-review & merging! |
|
||
def prep_vectors(target_shape, prior_vectors=None, seed=0, dtype=REAL): | ||
"""Return a numpy array of the given shape. Reuse prior_vectors values instance or values | ||
to extent possible. Initialize new values randomly if requested.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a public API function. We should document the parameters so they appear in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ultimately may not be, pending the mmap work & other initialization clean-up, which might also jostle the internal names a bit. (At the moment, this is only called from resize_vectors()
which may be the preferable public-entry point, because outside callers are less-likely to have a prior_vectors
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but this stuff will show up in the docs, right?
I can think of several better ways forward:
- Make the docstring a code-comment so it doesn't show up in the docs
- Mark the function as internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the PR description up to date? W.r.t. functionality contained within, and issues fixed by it.
PR titles and descriptions is what we have to go with for release notes. Plus that's what interested users see when they click through for more details on a change.
267594d
to
dda970e
Compare
Title & top-comment updated to describe what's addressed. Any lingering TODOs/doc-comment/naming stuff will be addressed in a followup PR related to #2955 or others. |
So, that one distant test-failing is a real head-scratcher, as noted in the code-comment I've added in 1edbb4c: rng = np.random.default_rng(seed=seed) # use new instance of numpy's recommended generator/algorithm
# FIXME: `uniform` passes all tests, but generates temporary double-sized np.float64 array,
# then cast-down ito right-sized np.float32, which means momentary 3x RAM usage on the model's
# largest structure (often GB in size)
new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(dtype)
# Meanwhile, this alternative, which by docs/reasoning/visual-inspection should be equivalent
# while never creating the unneeded oversized np.float64 array, passes all *2Vec class
# functional tests, but mysteriously (but reliably!) fails one obscure barely-sensible test
# of a fringe downstream functionality: `TestBackMappingTranslationMatric.test_infer_vector`.
# I'd adjust or jettison that test entirely *except* that the failure is *so* reliable, and
# *so* mysterious, that it may be warning of something very subtle. So for now, very briefly,
# sticking with the RAM-wasteful-but-all-tests-passing approach above, TODO debug/fix ASAP.
# new_vectors = rng.random(target_shape, dtype=dtype) # [0.0, 1.0)
# new_vectors *= 2.0 # [0.0, 2.0)
# new_vectors -= 1.0 # [-1.0, 1.0)
new_vectors /= vector_size So, in order to get this PR mergeable, I've reverted the code to what works-but-is-inefficient, with that blaring comment commitment to remedy as part of the next PR. (The inefficiency is what |
4de243e
to
1edbb4c
Compare
On further investigation, the flimsiness of the failing test is the real problem, and it was only passing in the 1st place via lucky seeding. (Over a range of seeds, either code alternative makes that test fail 20-30% of the time.) So I'll disable that test, & created #2977 to eventually – not at all urgently – clean up the related classes/docs. |
Thanks! Merge will have to wait for @mpenkov's review, but I already branched off here for my tickets, so no more rebasing / force-pushing please (normal merges are OK). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments. Please have a look @gojomo .
gensim/models/keyedvectors.py
Outdated
Vector dimensions will default to `np.float32` (AKA `REAL` in some Gensim code) unless | ||
another type is provided here. | ||
mapfile_path : string, optional | ||
TODO: UNDER CONSTRUCTION / SUBJECT TO CHANGE - pending mmap work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we do about this parameter? I don't think leaving a TODO in publicly visible documentation is a good look.
TODO: UNDER CONSTRUCTION / SUBJECT TO CHANGE - pending mmap work | |
Currently unimplemented. |
But really, if we're not currently using this parameter, let's just get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my comments 4 days ago, I'm not proposing any of these comments become part of the permanent codebase - they're all just temporary until #2955 is addressed, to prevent this PR from growing to become a diverse grab-bag of things unrelated to its title and initial focus. As @piskvorky has already started building on this in another PR, and I've got other functional things that depend on this, polishing comments here just risks making the tangle of PRs-on-PRs worse.
|
||
def prep_vectors(target_shape, prior_vectors=None, seed=0, dtype=REAL): | ||
"""Return a numpy array of the given shape. Reuse prior_vectors values instance or values | ||
to extent possible. Initialize new values randomly if requested.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but this stuff will show up in the docs, right?
I can think of several better ways forward:
- Make the docstring a code-comment so it doesn't show up in the docs
- Mark the function as internal
Co-authored-by: Michael Penkov <m@penkov.dev>
For some reason, github isn't showing any option to reply to @mpenkov's comment about
So, I'd say: let's see how it settles in that other work. |
Does the approval mean it's OK for me to push the Github 'rebase & merge' button? |
OK from me. Don't know about @mpenkov . Definitely no rebase though – just merge! |
I'm just looking at the default button in the Github interface (which I suspect results in the cleanest trunk history), and might not foul any history elsewhere... nor foul other similar merges). But: if you or @mpenkov can just merge this as soon as it's OK, that'll be quickest - in my opinion, this PR has long been functionally done, and I'd prefer for simplicity to base the followup PR off |
Fixes #2853 & fixes #2943.
Refactors a bunch of
FastText
/Word2Vec
/Doc2Vec
initialization & update code for clarity & less-duplication. Each of these classes now share the faster bulk style of vector-random-initalization previously onlyFastText
used, a slight change in prior per-vector seeding behavior.Seems to fix the test code for triggering ancient
Doc2Vec
update crash in #1019 - but there may still be problems there, and there's been no design/testing forbuild_vocab(..., update=True)
onDoc2Vec
, so current status there is still "not supported".There was some purported (but possibly not useful, and not covered by unit-tests)
mmap_path
-specifying code inKeyedVectors
that's been stubbed out here with FIXMEs/comments, pending intended (#2955) repair/rationalization of that functionality before 4.0.0 final release.