Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDA model gives Segmentation Fault #92

Closed
buma opened this issue Aug 9, 2012 · 12 comments
Closed

LDA model gives Segmentation Fault #92

buma opened this issue Aug 9, 2012 · 12 comments

Comments

@buma
Copy link
Contributor

buma commented Aug 9, 2012

I would like to use LDA model on mine corpus, but I get segmentation fault when trying to transform corpus.

Versions for running valgrind:
gensim 0.8.5
numpy 1.8.0.dev-fcdbcac
scipy 0.12.0.dev-e75a945

On stable version the problem is the same only file names in valgrind are missing, because there are no debug symbols:
gensim 0.8.5
numpy 1.6.2
scipy 0.10.1

I am running Arch Linux kernel 3.4.7-1 PAE kernel i686 and 8 GB of memory with python2.7.3-2 .

All files needed to run these are here

Intrestingly in scikit-learn I have the same problem with similar stacktrace.

I would be happy for any help.

from gensim import corpora, models
from gensim.corpora.dictionary import Dictionary
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


moj_korpus = corpora.mmcorpus.MmCorpus("podatki_del.mm")
print moj_korpus

dictionary = Dictionary.load_from_text("dictionary.txt")
print dictionary
rpmodel = models.ldamodel.LdaModel(moj_korpus, id2word=dictionary, num_topics=97, update_every=1, passes=20, chunksize=10)

print rpmodel

corpus_rpmodel = rpmodel[moj_korpus]

corpora.MmCorpus.serialize('lda_corpus.mm', corpus_rpmodel)

Segmentation fault happens when merging changes:
2012-08-09 22:17:47,234 : INFO : PROGRESS: iteration 0, at document #10/100
2012-08-09 22:17:47,354 : INFO : 2/10 documents converged within 50 iterations
2012-08-09 22:17:48,899 : INFO : merging changes from 10 documents into a model of 100 documents
Traceback (most recent call last):
File "example.py", line 13, in
rpmodel = models.ldamodel.LdaModel(moj_korpus, id2word=dictionary, num_topics=97, update_every=1, passes=20, chunksize=10)
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 265, in init
self.update(corpus)
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 455, in update
self.do_mstep(rho(), other)
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 489, in do_mstep
diff -= self.state.get_Elogbeta()
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 153, in get_Elogbeta
return dirichlet_expectation(self.get_lambda())
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 60, in dirichlet_expectation
result = psi(alpha) - psi(numpy.sum(alpha, 1))[:, numpy.newaxis]
MemoryError

Valgrind gives this:
2012-08-09 21:51:36,556 : INFO : loaded corpus index from podatki_del.mm.index
2012-08-09 21:51:36,576 : INFO : initializing corpus reader from podatki_del.mm
2012-08-09 21:51:36,579 : INFO : accepted corpus with 100 documents, 591333 features, 11109 non-zero entries
MmCorpus(100 documents, 591333 features, 11109 non-zero entries)
Dictionary(592158 unique tokens)
2012-08-09 21:52:38,601 : INFO : using serial LDA version on this node
==25177== Warning: set address range perms: large range [0xc84f028, 0x27e89318) (undefined)
==25177== Warning: set address range perms: large range [0x38d41028, 0x5437b318) (undefined)
==25177== Warning: set address range perms: large range [0xc84f018, 0x27e89328) (noaccess)
==25177== Warning: set address range perms: large range [0xc84f028, 0x27e89318) (undefined)
==25177== Warning: set address range perms: large range [0x75710028, 0x90d4a318) (undefined)
==25177== Warning: set address range perms: large range [0x97adf028, 0xb3119318) (undefined)
==25177== Warning: set address range perms: large range [0x75710018, 0x90d4a328) (noaccess)
==25177== Warning: set address range perms: large range [0x75710028, 0x90d4a318) (undefined)
==25177== Warning: set address range perms: large range [0x97adf018, 0xb3119328) (noaccess)
==25177== Warning: set address range perms: large range [0xc84f018, 0x27e89328) (noaccess)
==25177== Warning: set address range perms: large range [0xc84f028, 0x27e89318) (undefined)
==25177== Warning: set address range perms: large range [0x75710018, 0x90d4a328) (noaccess)
2012-08-09 21:56:07,067 : INFO : running online LDA training, 97 topics, 20 passes over the supplied corpus of 100 documents, updating model once every 10 documents
==25177== Warning: set address range perms: large range [0x75710028, 0x90d4a318) (undefined)
2012-08-09 21:56:07,978 : INFO : PROGRESS: iteration 0, at document #10/100
==25177== Warning: set address range perms: large range [0x97adf028, 0xb3119318) (undefined)
==25177== Conditional jump or move depends on uninitialised value(s)
==25177== at 0x4A2B7F3: PyArray_MapIterReset (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x4AA5E97: array_subscript (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x4AA691A: array_subscript_nice (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x408F4FF: PyObject_GetItem (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4124478: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== Conditional jump or move depends on uninitialised value(s)
==25177== at 0x4A2B7F3: PyArray_MapIterReset (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x4A9F10C: array_ass_sub (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x408F823: PyObject_SetItem (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412439F: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x40B6B20: function_call (in /usr/lib/libpython2.7.so.1.0)
==25177==
2012-08-09 21:56:09,730 : INFO : 2/10 documents converged within 50 iterations
==25177== Warning: set address range perms: large range [0x97adf018, 0xb3119328) (noaccess)
==25177== Warning: set address range perms: large range [0x97adf028, 0xb3119318) (undefined)
2012-08-09 21:56:34,273 : INFO : merging changes from 10 documents into a model of 100 documents
==25177== Invalid read of size 4
==25177== at 0x4B3960B: trivial_three_operand_loop (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x4B4EEF7: PyUFunc_GenericFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x4B4F1BA: ufunc_generic_call (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4090347: call_function_tail (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x409045F: _PyObject_CallFunction_SizeT (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4A67B80: PyArray_GenericBinaryFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x408BE96: binary_op1 (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x408E685: PyNumber_Multiply (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412587C: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== Address 0x1c is not stack'd, malloc'd or (recently) free'd
==25177==
==25177==
==25177== Process terminating with default action of signal 11 (SIGSEGV)
==25177== Access not within mapped region at address 0x1C
==25177== at 0x4B3960B: trivial_three_operand_loop (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x4B4EEF7: PyUFunc_GenericFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x4B4F1BA: ufunc_generic_call (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4090347: call_function_tail (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x409045F: _PyObject_CallFunction_SizeT (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4A67B80: PyArray_GenericBinaryFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x408BE96: binary_op1 (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x408E685: PyNumber_Multiply (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412587C: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== If you believe this happened as a result of a stack
==25177== overflow in your program's main thread (unlikely but
==25177== possible), you can try to increase the size of the
==25177== main thread stack using the --main-stacksize= flag.
==25177== The main thread stack size used in this run was 8388608.
==25177==
==25177== HEAP SUMMARY:
==25177== in use at exit: 1,899,201,128 bytes in 13,757 blocks
==25177== total heap usage: 719,362 allocs, 705,604 frees, 4,800,996,361 bytes allocated
==25177==
==25177== LEAK SUMMARY:
==25177== definitely lost: 24 bytes in 1 blocks
==25177== indirectly lost: 0 bytes in 0 blocks
==25177== possibly lost: 308,652 bytes in 541 blocks
==25177== still reachable: 1,898,892,436 bytes in 13,214 blocks
==25177== suppressed: 16 bytes in 1 blocks
==25177== Rerun with --leak-check=full to see details of leaked memory
==25177==
==25177== For counts of detected and suppressed errors, rerun with: -v
==25177== Use --track-origins=yes to see where uninitialised values come from
==25177== ERROR SUMMARY: 4 errors from 3 contexts (suppressed: 5120 from 514)
Segmentation fault

@buma
Copy link
Contributor Author

buma commented Aug 10, 2012

I ran memtest overnight. 7 passes no errors.

@buma
Copy link
Contributor Author

buma commented Aug 10, 2012

Interestingly I get the same error on Picloud with f2 instance. M1 instance seems to be working.

You have to first create an environment to install gensim and upload dictionary and data files.
Then you can run this:

After you signup you get 20 hours free each month, so you don't need to pay to test this.

from gensim import corpora, models
  from gensim.corpora.dictionary import Dictionary
  import logging
  import cloud

  logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


  def run():
      moj_korpus = corpora.mmcorpus.MmCorpus("podatki_del.mm")
      print moj_korpus

      dictionary = Dictionary.load_from_text("dictionary.txt")
      print dictionary
      rpmodel = models.ldamodel.LdaModel(moj_korpus, id2word=dictionary, num_topics=97, update_every=1, chunksize=10)

      print rpmodel

      corpus_rpmodel = rpmodel[moj_korpus]

      corpora.MmCorpus.serialize('lda_corpus.mm', corpus_rpmodel)

      cloud.files.put('lda_corpus.mm')

  cloud.call(run, _type='f2', _env='zagensi')

@piskvorky
Copy link
Owner

There is a mismatch between ids in dictionary and corpus -- one has 592,158 items, the other 591,333 items. This leads to segfault when numpy tries to accesses memory for ids that do not exist.

The corpus must be created with the same dictionary (the number of items has to match exactly). See also here: https://groups.google.com/group/gensim/browse_thread/thread/d42ab19f60a228db

@piskvorky
Copy link
Owner

Also 500k features is probably too many -- check out the Dictionary.filter_extremes() method, or use stemming, to curb the amount down to something like 50k.

@buma
Copy link
Contributor Author

buma commented Aug 11, 2012

Thanks for trying to help.

I tested this on full example where there is no mismatch and error is the same:
2012-08-11 12:39:23,228 : INFO : initializing corpus reader from podatkimm.mm.mtx
2012-08-11 12:39:23,241 : INFO : accepted corpus with 175315 documents, 592158 features, 24024123 non-zero entries
MmCorpus(175315 documents, 592158 features, 24024123 non-zero entries)
Dictionary(592158 unique tokens)
2012-08-11 12:39:24,575 : INFO : using serial LDA version on this node
2012-08-11 12:39:36,696 : INFO : running online LDA training, 97 topics, 1 passes over the supplied corpus of 175315 documents, updating model once every 10 documents
2012-08-11 12:39:36,754 : INFO : PROGRESS: iteration 0, at document #10/175315
2012-08-11 12:39:36,927 : INFO : 3/10 documents converged within 50 iterations
2012-08-11 12:39:38,437 : INFO : merging changes from 10 documents into a model of 175315 documents
Traceback (most recent call last):
File "example.py", line 32, in
run()
File "example.py", line 15, in run
rpmodel = models.ldamodel.LdaModel(moj_korpus, id2word=dictionary, num_topics=97, update_every=1, chunksize=10)
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 265, in init
self.update(corpus)
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 455, in update
self.do_mstep(rho(), other)
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 489, in do_mstep
diff -= self.state.get_Elogbeta()
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 153, in get_Elogbeta
return dirichlet_expectation(self.get_lambda())
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 60, in dirichlet_expectation
result = psi(alpha) - psi(numpy.sum(alpha, 1))[:, numpy.newaxis]
MemoryError

I'l try to make some small example with no mismatch and I'll look into that thread.

I will look into my code that creates corpus.

I know that this is too many features I tried to eliminate some with PCA (doesn't work in scikit, chi2 also doesn't work)

Stemming is useless in my case because I don't have text only count matrix.

I used filter extremes I don't remember parameters exactly, but the result was empty dictionary.

I have to eliminate some features in some way because I have problems with classifiers, because there is too much data.

@buma
Copy link
Contributor Author

buma commented Aug 11, 2012

It seems that the problem is because I have too big dataset.

I have now run LdaModel generation without dictionary and I still get the same error:

rpmodel = models.ldamodel.LdaModel(moj_korpus, num_topics=97, update_every=1, chunksize=10)

There were less features in corpus then in dictionary because I created dictionary from full data.

I tested with 10 document and 50000 features (only 35 non zero entries) and LDA worked. Those were just the first 50000 features.

But I don't know how to prune dictionary. If I run dictionary.filter_extremes() I get empty dictionary the same if I set parameter no_above to 0.9. And how to use new dictionary on corpus?

All examples are for changing text to word counts. Is it possible to update corpus with prunned dictionary if I have only word count matrix?

@piskvorky
Copy link
Owner

Yes, it looks like the out-of-memory exception (not segfault) is caused by the excessive amount of features.

d.filter_extremes() works if your d.dfs dictionary contains the "document frequency counts" (it normally does, if you created your Dictionary the standard way). filter_extremes will just look at which tokens have doc frequency less/more than the specified number and remove those from the dictionary.

@buma
Copy link
Contributor Author

buma commented Aug 11, 2012

I create dictionary as text file tab separated with id (column number) word
(same column number as string) and count how many times it has apperead in
all corpus then I open this text file with load_as_text.

dfs parameter has 592158 entries - the same as number of features and is
dictionary.

2012/8/11 Radim Řehůřek notifications@github.com

Yes, it looks like the out-of-memory exception (not segfault) is caused by
the excessive amount of features.

d.filter_extremes() works if your d.dfs dictionary contains the "document
frequency counts" (it normally does, if you created your Dictionary the
standard way). filter_extremes will just look at which tokens have doc
frequency less/more than the specified number and remove those from the
dictionary.


Reply to this email directly or view it on GitHubhttps://github.com//issues/92#issuecomment-7669364.

@piskvorky
Copy link
Owner

Aha, filtering ought to work then. How exactly do you call the filter_extremes method?

If you post your dictionary file, maybe I can have a look.

@buma
Copy link
Contributor Author

buma commented Aug 12, 2012

Thanks for trying. this is dictionary file for full example.

Dictionary for previously provided corpus.

@piskvorky
Copy link
Owner

I checked your file; you are right, dfs is not enough to run the filtering. You'd need the full dictionary, as saved with save() and loaded with load() (not just save_as_text()).

The text format is only meant for debugging; some information is not stored there. But this missing information is needed by filter_extremes().

Fix1: if you have the dictionary stored with save, load it and run filter_extremes() from there.

Fix2: if not, you can do d.num_docs = total_number_of_documents_that_were_used_to_build_the_dictionary and then call d.filter_extremes()

Fix3: or use a script (bash/python) to find lines in the text file where the document frequency (third column) is between 5 and 1,000,000 (or whatever, depends how much you want to filter). Collect their ids (first column) and after that call d.filter_tokens(good_ids=list_of_these_ids); d.compactify().

@buma
Copy link
Contributor Author

buma commented Aug 24, 2012

I'm sorry I was away. I added num_docs to dictionary and filter extremes worked.

I added num_pos (number of words in documents) and num_nnz (number of unique words in documents also all nonzero values in document x term matrix) to my dictionary and LDA worked in 64 bit 8 GB machine.

On my 8 GB PAE 32 bit machine still doesn't work.

I think one idea is to add function which creates dictionary from document x term matrix. Maybe I'll try to add it myself.

This issue could be closed.

@buma buma closed this as completed Aug 24, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants