LdaModel trains beyond size of corpus when using an iterable #2553

gochristoph · 2019-07-05T21:11:42Z

Problem description

When streaming documents/bag of words to LdaModel via a custom iterable, LdaModel will train beyond the size of the corpus, with output like 19-07-05 22:53:43 PROGRESS: pass 0, at document #178000/50000 -- where the number left to the / is higher than the number right to it.

Steps/code/corpus to reproduce

from gensim.models import LdaModel
import logging
logging.basicConfig(format='%(asctime)s  %(message)s', \
    datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO)

class TestIterable:
    def __init__(self):
        self.bag_of_words = [(0,2), (3,1), (6,1), (100,2)]
        self.cursor = 0

    def __iter__(self):
        self.cursor = 0
        logging.info('TestIterable() __iter__ was called')
        return self

    def __next__(self):
        if self.cursor < 50000:
            self.cursor += 1
            return self.bag_of_words
        else:
            logging.info('TestIterable() returned StopIteration')
            raise StopIteration


corpus = TestIterable()
# uncommenting this part will make a list out of the corpus
# corpus = [document for document in corpus]

logging.info('performing lda training')
trained_model = LdaModel(corpus, num_topics=2)

Using the TestIterable() will result in LdaModel training indefinitively. Converting the TestIterable() corpus to a list will lead to the expected result of a proper training.

I have not written too many iterables so far, and of course there could be a problem there. But as far as I could infer from the LdaModel documentation, all that is required is an interable -- and to the best of my knowledge, corpus = TestIterable() is a proper iterable, and iterator as well.

Thanks a lot!

Versions

Linux-3.10.0-862.14.4.el7.x86_64-x86_64-with-centos-7.5.1804-Core
Python 3.6.4 (default, Apr 10 2018, 07:54:00)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
NumPy 1.14.2
SciPy 1.0.1
gensim 3.7.3
FAST_VERSION 0

The text was updated successfully, but these errors were encountered:

piskvorky · 2019-07-05T23:08:52Z

Hm, that's pretty weird. Thanks for the detailed report.

I'll try to have a look what's going on, but probably not very soon, sorry :(

gochristoph · 2019-07-09T14:06:41Z

Hi Radim, thanks a lot -- looking forward to your take on it!

Hiyorimi · 2019-10-09T11:16:43Z

Is this still available?

piskvorky · 2019-10-09T12:02:36Z

@Hiyorimi yes! Sorry to @gochristoph , I never got around to this.

Hiyorimi · 2019-10-10T12:55:07Z

@mpenkov I believe that just getting the list won't do the trick, since it will slow down the training ?

mpenkov · 2019-10-11T02:26:15Z

Perhaps you want to iterate and return the first element instead of converting to a list?

piskvorky · 2019-10-11T08:54:51Z

@Hiyorimi the issue here is that LdaModel is reported to train indefinitely, on the given TestIterable object.

I can't imagine why it would do that – LdaModel only uses the iterable via the standard Python for x in iterable: … construct, IIRC. So I suspect the issue is with the iterable, not with LdaModel. But I can't see anything wrong at a glance, so the next step will be to replicate the issue and investigate.

It has nothing to do with converting to a list, that's just a way of debugging the iterable.

…problematic iterable

Hiyorimi · 2019-10-14T19:12:14Z

I added simple test case in separate branch.

Problem is that here we get wrapped chunk as :

wrapped_chunk = [list(itertools.islice(it, int(chunksize)))]

And called on TestIterable (or ProblematicIterable in tests), it will iterate forever.

It happens because TestIterable sets self.cursor on iter call. Being perfectly valid for getting sequence length, this setting triggers infinite chunk generation, once islice() being called on iterator.

I am not sure if that it is the valid case for LDAModel, probably @gochristoph can clarify, if that resolves the matter.

I can think only on one simple check, which is suggested in PR #2633

…length. Closes piskvorky#2553

Hiyorimi · 2019-10-16T00:21:20Z

@piskvorky any news?)

piskvorky · 2019-10-16T07:38:28Z

Why does islice() create infinite recursion? I still don't get it, sorry.

Is it an issue with how the iterable is constructed, or with how Gensim consumes it?

Hiyorimi · 2019-10-16T07:55:39Z

It is an issue with how the iterable is constructed. It gets index reset every time iter is called.

piskvorky · 2019-10-16T10:12:34Z

I think that's desirable: each iteration = each for x in iterable: should start from the beginning. Why is that an issue?

Hiyorimi · 2019-10-16T12:13:18Z

Because on line 1163 we have:

it = iter(iterable)

I can't say I'm professional with iterables, but that is supposed to prevent iterator index from resetting, isn't it?

piskvorky · 2019-11-03T12:28:47Z

OK, I had a look. The issue is indeed that islice calls __iter__(self), which in turn resets self.cursor with each islice call.

So we're iterating over the same chunk of initial 2000 documents = result of islice(corpus, 2000), over and over, instead of advancing through the corpus.

@gochristoph the solution is to make TestIterable a proper iterable, returning a fresh iterator (in the example below, a generator) from each call to __iter__:

import logging
from gensim.models import LdaModel

logging.basicConfig(format='%(asctime)s  %(message)s', datefmt='%y-%m-%d %H:%M:%S', level=logging.INFO)

class TestIterable:
    def __iter__(self):
        logging.info('TestIterable() __iter__ was called')
        cursor = 0
        while cursor < 50000:
            cursor += 1
            yield [(0, 2), (3, 1), (6, 1), (100, 2)]

corpus = TestIterable()
# uncommenting this part will make a list out of the corpus
# corpus = [document for document in corpus]

logging.info('performing lda training')
trained_model = LdaModel(corpus, num_topics=2)

This code works and is also simpler.

Does that make sense?

gochristoph · 2019-11-03T13:18:36Z

@Hiyorimi, @piskvorky Thanks for looking into it.

The issue that I see is that this leads to inconsistent behavior, and that the workaround is implemented on the outside (so not in LdaModel, but in the user code). Word2Vec works perfectly with the iterable, but LdaModel does not. And I guess the behavior should be consistent.

@piskvorky We did not choose they way of wrapping a generator as an iterable because it is more transparent to directly construct the iterable for our purposes (fetching data from a database with a server-side cursor). Can you point to why the TestIterable in the initial expample is not a proper iterable? As far as I can see it iter() and next() are doing the jobs they are supposed to do.

Thanks a lot!

piskvorky · 2019-11-03T18:27:33Z

Perhaps. This behaviour was surprising to me too.

I guess one solution would be to replace the built-in itertools.islice() in utils.grouper() (used in LdaModel to read chunks of documents from your corpus) by something that treats its input as an iterator, without calling iter() on it again.

Since your TestIterable is both an iterable and its own iterator, sharing the same state, calling islice behaves weirdly. But even if we avoid islice, it's possible you'll hit issues elsewhere. Returning a new iterator from your __iter__ looks both saner and safer.

Where you fetch data from (database or otherwise) is not relevant, you can do that with a generator too.

CC @mpenkov @gojomo thoughts?

gojomo · 2019-11-03T19:44:01Z

IMO, __iter__() should, by Python's conventions, return a new iterator, and any call to __iter__() should have no effect on any prior returned iterators from __iter__() (no matter their state/progress/exhaustion).

That an iterator works fine with one particular client (in this case Word2Vec) isn't proof it's a proper iterator for all clients who are counting on usual behavior.

For example, I would expect any iterable to be able to pass the following tests:

def test_serial_iterators(iterable):
    i1 = iter(iterable)
    len1 = sum(1 for _ in i1)
    i2 = iter(iterable)
    len2 = sum(1 for _ in i2)
    assert len1 == len2  # repeated iterations should work & report same number of items

def test_concurrent_iterators(iterable):
    assert sum(1 for _ in iter(iterable)) > 1  # ensure at least 2 items present
    i1 = iter(iterable)
    _ = next(i1)
    i2 = iter(iterable)
    len1 = 1 + sum(1 for _ in i1)
    len2 = sum(1 for _ in i2)
    assert len1 == len2  # concurrent iterations should work & report same number of items

The supplied TestIterable won't pass the second test (while all properly-behaving Python iterables will), thus it is the code that needs fixing, not islice() of gensim, which is just expecting the usual behavior.

(That first test_serial_iterators() method is what demonstrates an iterator itself, capable of only a single iteration, is not a full iterable.)

piskvorky · 2019-11-04T08:51:21Z

Looks like we're in agreement here.

I still think getting rid of islice() in utils.grouper() is desirable. I see no reason for such weird behaviour, it's confusing – we should simply advance the it variable inside the loop, without calling iter on it each time (treat it as an iterator, not iterable).

gojomo · 2019-11-04T18:16:59Z

It's a little odd to be calling iter() on something that all subsequent paths (via islice()) also effectively call iter() on again, but I'm not sure there's any weird/surprising externally-observable behavior as long as a proper iterable-sequence is provided. (The parameter's name is, after all, iterable.) So I'm not sure there's anything needing fixing, though perhaps the code could be clearer/prettier.

To that end, this implementation has always seemed a bit clunky to me. It has always "felt like" there should be a more Pythonic, compact idiom for doing this sort of grouping – but I've not come across one. (The grouper() in the itertools docs recipes is almost, but not quite, right due to the filling Nones.) It now occurs to me that groupby() with a synthetic grouping-key could do the trick. EG, roughly (not yet considering the numpy-optimization option):

from itertools import count, groupby
def grouper_no_filling(iterable, n): 
    groups = (i for i in count() for _ in range(n))  # infinite generator
    for k, g in groupby(iterable, key=lambda item: next(groups)):
        yield list(g)

Maybe replace existing gensim grouper()/chunkize_serial() with a call to this (or some numpy-value-optimized-version when that's elected)?

piskvorky · 2019-11-04T18:31:43Z

AFAIK what we have now is the "standard", idiomatic way of grouping an iterable.

Replacing that with another convoluted itertools construct seems an overkill to me. Iterating over a sequence is trivial enough. We just need a simple loop: explicit and clear. Performance is likely no bottleneck here.

To be clear, this discussion is no longer about the original issue (where we agree the user-supplied "shared-iterators-iterable" was the problem). It's about a cosmetic change where we could group an iterable while calling iter() on it only once, which seems more correct to me than the current version… even if there's no external difference for well-behaved iterables.

piskvorky · 2019-11-04T18:45:48Z

Looking how others solve this, I googled up bolton's version:
https://boltons.readthedocs.io/en/latest/_modules/boltons/iterutils.html#chunked_iter

But it's basically what we have in Gensim now (i.e. iter + islice), sans our memory optimization.

gojomo · 2019-11-04T19:35:26Z

I don't particularly see any 1-to-2-line application of itertools as being "convoluted" – even where its operation isn't maximally readable, if it reliably does whatever simple thing it says it does, the compactness & reliance on standard-library constructs can justify its use.

That boltons implementation strikes me as overkill – but if that library is considered a rugged, reliable source of functions, should we just bring it in and re-use its implementation?

I see the gensim numpy optimization as something that would have been just as efficient, and clearer, to do as a map-wrapping step totally outside the utility function – in only those places where the caller knows that np.array-compactness & cross-thread communication is important. (It's not like the function can surprise unaware callers with a more-efficient return-value, since callers have to request the optimization with a non-default as_numpy=True argument.)

Also, I'm not sure about the theory supporting the "memory opt: wrap the chunk and then pop(), to avoid leaving behind a dangling reference" wrap/pop. (I don't see how that would effectively break any "dangling reference".)

In fact, for generalizability/composability, the utility function could yield generators in the style of itertools, and only callers who need lists would convert those generators. For example, replacing the existing chunkize_serial() with:

def grouper_unfilled(iterable, n): 
    groups = (i for i in count() for _ in range(n))  # infinite generator
    for k, g in groupby(iterable, key=lambda item: next(groups)):
        yield g

# for backward-compatibility until all callers expecting old lists/arrays are coaxed to use above directly
def chunkize_serial(iterable, chunksize, as_numpy=False, dtype=np.float32):
    mapfn = list
    if as_numpy:
        mapfn = lambda item: np.array(item, dtype=dtype)
    return map(mapfn, grouper_unfilled(iterable, chunksize))

grouper = chunkize_serial  # more backward-compatibility

piskvorky · 2019-11-04T20:45:13Z

I see what you mean, but -1 on that, unless profiling shows some tremendous performance benefit. Way too clever, unreadable.

My preference is a plain:

def chunkize_serial(iterable, chunksize):
    current_chunk = [[]]
    for value in iterable:
        current_chunk[0].append(value)
        if len(current_chunk[0]) >= chunksize:
            yield current_chunk.pop()
            current_chunk = [[]]
    if current_chunk[0]:
        yield current_chunk.pop()

Or maybe even without the "memory trick" of avoiding a named variable refcount via pop(). That optimization is probably not necessary any more, who cares about a bit of extra memory.

piskvorky self-assigned this Jul 9, 2019

mpenkov added Hacktoberfest Issues marked for hacktoberfest help wanted difficulty medium Medium issue: required good gensim understanding & python skills labels Sep 28, 2019

Hiyorimi mentioned this issue Oct 10, 2019

Handling for iterables without 0-th element, fixes #2556 #2629

Merged

Hiyorimi added a commit to Hiyorimi/gensim that referenced this issue Oct 14, 2019

WIP fixing piskvorky#2553: added TestLdaIterableTraining for testing …

8284ad2

…problematic iterable

Hiyorimi added a commit to Hiyorimi/gensim that referenced this issue Oct 14, 2019

Added check for number of processed documents being less than corpus …

9d7f17e

…length. Closes piskvorky#2553

piskvorky closed this as completed Nov 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LdaModel trains beyond size of corpus when using an iterable #2553

LdaModel trains beyond size of corpus when using an iterable #2553

gochristoph commented Jul 5, 2019 •

edited

Loading

piskvorky commented Jul 5, 2019

gochristoph commented Jul 9, 2019

Hiyorimi commented Oct 9, 2019

piskvorky commented Oct 9, 2019

Hiyorimi commented Oct 10, 2019

mpenkov commented Oct 11, 2019

piskvorky commented Oct 11, 2019 •

edited

Loading

Hiyorimi commented Oct 14, 2019 •

edited

Loading

Hiyorimi commented Oct 16, 2019

piskvorky commented Oct 16, 2019 •

edited

Loading

Hiyorimi commented Oct 16, 2019 •

edited

Loading

piskvorky commented Oct 16, 2019 •

edited

Loading

Hiyorimi commented Oct 16, 2019

piskvorky commented Nov 3, 2019 •

edited

Loading

gochristoph commented Nov 3, 2019

piskvorky commented Nov 3, 2019 •

edited

Loading

gojomo commented Nov 3, 2019 •

edited by piskvorky

Loading

piskvorky commented Nov 4, 2019 •

edited

Loading

gojomo commented Nov 4, 2019 •

edited

Loading

piskvorky commented Nov 4, 2019 •

edited

Loading

piskvorky commented Nov 4, 2019 •

edited

Loading

gojomo commented Nov 4, 2019

piskvorky commented Nov 4, 2019 •

edited

Loading

LdaModel trains beyond size of corpus when using an iterable #2553

LdaModel trains beyond size of corpus when using an iterable #2553

Comments

gochristoph commented Jul 5, 2019 • edited Loading

Problem description

Steps/code/corpus to reproduce

Versions

piskvorky commented Jul 5, 2019

gochristoph commented Jul 9, 2019

Hiyorimi commented Oct 9, 2019

piskvorky commented Oct 9, 2019

Hiyorimi commented Oct 10, 2019

mpenkov commented Oct 11, 2019

piskvorky commented Oct 11, 2019 • edited Loading

Hiyorimi commented Oct 14, 2019 • edited Loading

Hiyorimi commented Oct 16, 2019

piskvorky commented Oct 16, 2019 • edited Loading

Hiyorimi commented Oct 16, 2019 • edited Loading

piskvorky commented Oct 16, 2019 • edited Loading

Hiyorimi commented Oct 16, 2019

piskvorky commented Nov 3, 2019 • edited Loading

gochristoph commented Nov 3, 2019

piskvorky commented Nov 3, 2019 • edited Loading

gojomo commented Nov 3, 2019 • edited by piskvorky Loading

piskvorky commented Nov 4, 2019 • edited Loading

gojomo commented Nov 4, 2019 • edited Loading

piskvorky commented Nov 4, 2019 • edited Loading

piskvorky commented Nov 4, 2019 • edited Loading

gojomo commented Nov 4, 2019

piskvorky commented Nov 4, 2019 • edited Loading

gochristoph commented Jul 5, 2019 •

edited

Loading

piskvorky commented Oct 11, 2019 •

edited

Loading

Hiyorimi commented Oct 14, 2019 •

edited

Loading

piskvorky commented Oct 16, 2019 •

edited

Loading

Hiyorimi commented Oct 16, 2019 •

edited

Loading

piskvorky commented Oct 16, 2019 •

edited

Loading

piskvorky commented Nov 3, 2019 •

edited

Loading

piskvorky commented Nov 3, 2019 •

edited

Loading

gojomo commented Nov 3, 2019 •

edited by piskvorky

Loading

piskvorky commented Nov 4, 2019 •

edited

Loading

gojomo commented Nov 4, 2019 •

edited

Loading

piskvorky commented Nov 4, 2019 •

edited

Loading

piskvorky commented Nov 4, 2019 •

edited

Loading

piskvorky commented Nov 4, 2019 •

edited

Loading