-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LdaModel trains beyond size of corpus when using an iterable #2553
Comments
Hm, that's pretty weird. Thanks for the detailed report. I'll try to have a look what's going on, but probably not very soon, sorry :( |
Hi Radim, thanks a lot -- looking forward to your take on it! |
Is this still available? |
@Hiyorimi yes! Sorry to @gochristoph , I never got around to this. |
@mpenkov I believe that just getting the list won't do the trick, since it will slow down the training ? |
Perhaps you want to iterate and return the first element instead of converting to a list? |
@Hiyorimi the issue here is that LdaModel is reported to train indefinitely, on the given I can't imagine why it would do that – LdaModel only uses the iterable via the standard Python It has nothing to do with converting to a list, that's just a way of debugging the iterable. |
…problematic iterable
I added simple test case in separate branch. Problem is that here we get wrapped chunk as : wrapped_chunk = [list(itertools.islice(it, int(chunksize)))] And called on It happens because I am not sure if that it is the valid case for LDAModel, probably @gochristoph can clarify, if that resolves the matter. I can think only on one simple check, which is suggested in PR #2633 |
@piskvorky any news?) |
Why does Is it an issue with how the iterable is constructed, or with how Gensim consumes it? |
It is an issue with how the iterable is constructed. It gets index reset every time iter is called. |
I think that's desirable: each iteration = each |
Because on line 1163 we have: it = iter(iterable) I can't say I'm professional with iterables, but that is supposed to prevent iterator index from resetting, isn't it? |
OK, I had a look. The issue is indeed that So we're iterating over the same chunk of initial 2000 documents = result of @gochristoph the solution is to make import logging
from gensim.models import LdaModel
logging.basicConfig(format='%(asctime)s %(message)s', datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO)
class TestIterable:
def __iter__(self):
logging.info('TestIterable() __iter__ was called')
cursor = 0
while cursor < 50000:
cursor += 1
yield [(0, 2), (3, 1), (6, 1), (100, 2)]
corpus = TestIterable()
# uncommenting this part will make a list out of the corpus
# corpus = [document for document in corpus]
logging.info('performing lda training')
trained_model = LdaModel(corpus, num_topics=2) This code works and is also simpler. Does that make sense? |
@Hiyorimi, @piskvorky Thanks for looking into it. The issue that I see is that this leads to inconsistent behavior, and that the workaround is implemented on the outside (so not in LdaModel, but in the user code). Word2Vec works perfectly with the iterable, but LdaModel does not. And I guess the behavior should be consistent. @piskvorky We did not choose they way of wrapping a generator as an iterable because it is more transparent to directly construct the iterable for our purposes (fetching data from a database with a server-side cursor). Can you point to why the TestIterable in the initial expample is not a proper iterable? As far as I can see it iter() and next() are doing the jobs they are supposed to do. Thanks a lot! |
Perhaps. This behaviour was surprising to me too. I guess one solution would be to replace the built-in Since your Where you fetch data from (database or otherwise) is not relevant, you can do that with a generator too. |
IMO, That an iterator works fine with one particular client (in this case For example, I would expect any iterable to be able to pass the following tests: def test_serial_iterators(iterable):
i1 = iter(iterable)
len1 = sum(1 for _ in i1)
i2 = iter(iterable)
len2 = sum(1 for _ in i2)
assert len1 == len2 # repeated iterations should work & report same number of items
def test_concurrent_iterators(iterable):
assert sum(1 for _ in iter(iterable)) > 1 # ensure at least 2 items present
i1 = iter(iterable)
_ = next(i1)
i2 = iter(iterable)
len1 = 1 + sum(1 for _ in i1)
len2 = sum(1 for _ in i2)
assert len1 == len2 # concurrent iterations should work & report same number of items The supplied (That first |
Looks like we're in agreement here. I still think getting rid of |
It's a little odd to be calling To that end, this implementation has always seemed a bit clunky to me. It has always "felt like" there should be a more Pythonic, compact idiom for doing this sort of grouping – but I've not come across one. (The
Maybe replace existing gensim |
AFAIK what we have now is the "standard", idiomatic way of grouping an iterable. Replacing that with another convoluted To be clear, this discussion is no longer about the original issue (where we agree the user-supplied "shared-iterators-iterable" was the problem). It's about a cosmetic change where we could group an iterable while calling |
Looking how others solve this, I googled up But it's basically what we have in Gensim now (i.e. |
I don't particularly see any 1-to-2-line application of That I see the gensim Also, I'm not sure about the theory supporting the "memory opt: wrap the chunk and then pop(), to avoid leaving behind a dangling reference" wrap/pop. (I don't see how that would effectively break any "dangling reference".) In fact, for generalizability/composability, the utility function could yield generators in the style of
|
I see what you mean, but -1 on that, unless profiling shows some tremendous performance benefit. Way too clever, unreadable. My preference is a plain: def chunkize_serial(iterable, chunksize):
current_chunk = [[]]
for value in iterable:
current_chunk[0].append(value)
if len(current_chunk[0]) >= chunksize:
yield current_chunk.pop()
current_chunk = [[]]
if current_chunk[0]:
yield current_chunk.pop() Or maybe even without the "memory trick" of avoiding a named variable refcount via |
Problem description
When streaming documents/bag of words to LdaModel via a custom iterable, LdaModel will train beyond the size of the corpus, with output like
19-07-05 22:53:43 PROGRESS: pass 0, at document #178000/50000
-- where the number left to the/
is higher than the number right to it.Steps/code/corpus to reproduce
Using the TestIterable() will result in LdaModel training indefinitively. Converting the TestIterable() corpus to a list will lead to the expected result of a proper training.
I have not written too many iterables so far, and of course there could be a problem there. But as far as I could infer from the LdaModel documentation, all that is required is an interable -- and to the best of my knowledge,
corpus = TestIterable()
is a proper iterable, and iterator as well.Thanks a lot!
Versions
Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.5.1804-Core
Python 3.6.4 (default, Apr 10 2018, 07:54:00)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
NumPy 1.14.2
SciPy 1.0.1
gensim 3.7.3
FAST_VERSION 0
The text was updated successfully, but these errors were encountered: