-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC 2018] Multistream API for vocabulary building in *2vec #2078
Merged
menshikh-iv
merged 43 commits into
piskvorky:develop
from
persiyanov:feature/gsoc-multistream-vocab
Jul 12, 2018
Merged
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
92e6e22
multistream scan vocab for doc2vec, word2vec & fastText
2618a2e
fixes
7960af8
fix tags for doc2vec
b8da97a
fix tests
16be716
removed benchmark vocab
c2d674a
addressing comments
85e689c
make interfaces and documentation more pretty
0d5ae38
add word2vec multistream tests
df3ae5f
fix pep8
49357cb
iteritems -> items
0365eea
more precise test
812ab8c
add doc2vec tests
f11f44d
add fasttext tests
941dfd8
remove prints
36e7238
fix seed=42
fa57f7a
fixed tests
9ea007d
add build_vocab test for fasttext
aec68ea
fix
07f3fd4
change size from 10 to 5 in fasttext test because of appveyor memory …
8b49fb8
another test with memory error
d0c11d9
fix py3 tests
5974448
fix iteritems for py3
1419847
fix functools reduce
280e826
addressing comments
7d489f4
addressing @jayantj comments
49a1ee6
fix language
1cbad7f
add final vocab pruning in multistream modes
d024625
keys -> iterkeys
5e4de19
use heapq.nlargest
74e7b02
fix
0d12d8b
multistream flag to input_streams param
25d00cd
fix tests
2281265
fix flake 8
543a9e0
fix doc2vec docstrings
d520d68
fix merging streams
d11a0b8
fix doc2vec
ecd8f39
max_vocab_size -> max_vocab_size / workers
a96d5a4
fixed
0a327b0
/ -> // (py3 division)
62873fb
fix
5f61219
Merge branch 'develop' into feature/gsoc-multistream-vocab
c67f964
fix docstring
a16cec0
Merge branch 'develop' into feature/gsoc-multistream-vocab
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
You are viewing a condensed version of this merge commit. You can view the full changes here.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a comment about the doc-improvements here, which all look good, but a side observation about this method, which I only noticed during related reviews last week, is that its strategy of re-launching a fresh 'producer' thread and fresh 'worker' threads for each epoch, as I believe was introduced in #1777, likely drives down overall throughput and CPU utilization compared to the prior strategy. The one potential advantage I'd see for adopting such a full-sync teardown&restart between epochs would be allowing the user to specify some callback for mid-training reporting at each epoch's end – but that hasn't yet ben added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo why would that drive down throughput & CPU utilization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When some threads have finished an epoch, but others haven't, cores will be idle not because of GIL/etc but because there's no thread even trying to move forward onto the next epoch's data. Plus any overhead of re-launching threads (magnitude unclear). Old strategy launched exactly
workers + 1
threads. This one launchesepochs * (workers + 1)
threads.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, you're worried that at the end of each epoch, some threads may be idle (while other threads are finishing their last job) until the next epoch starts.
Isn't that idleness infinitesimal, since any single job takes almost no time at all? I may be missing something but this type of delay shouldn't be even measurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of the magnitude, only the direction: this means more idle cores, every time an epoch rolls over. Of course only measuring could tell for sure, and the proportionate impact becomes smaller with larger corpuses.
As it's not yet clear to me the relative interplay of existing GIL/queue sync bottlenecks that have been preventing higher throughput (or else I would have tried more to fix them), adding yet more thread launches/waits/syncs-against-a-not-yet-filled-queue is something I'd have been wary of doing without measurement at the time. Even the coarse once-a-second progress logging tended to show slower throughput at the beginning of training; that slow-start might now be repeated at each epoch - for example, via GIL-contention between the 1st few new worker threads getting a job, and the new producer thread, trying to get ahead of the workers again.