-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LDA model gives Segmentation Fault #92
Comments
I ran memtest overnight. 7 passes no errors. |
Interestingly I get the same error on Picloud with f2 instance. M1 instance seems to be working. You have to first create an environment to install gensim and upload dictionary and data files. After you signup you get 20 hours free each month, so you don't need to pay to test this. from gensim import corpora, models
from gensim.corpora.dictionary import Dictionary
import logging
import cloud
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
def run():
moj_korpus = corpora.mmcorpus.MmCorpus("podatki_del.mm")
print moj_korpus
dictionary = Dictionary.load_from_text("dictionary.txt")
print dictionary
rpmodel = models.ldamodel.LdaModel(moj_korpus, id2word=dictionary, num_topics=97, update_every=1, chunksize=10)
print rpmodel
corpus_rpmodel = rpmodel[moj_korpus]
corpora.MmCorpus.serialize('lda_corpus.mm', corpus_rpmodel)
cloud.files.put('lda_corpus.mm')
cloud.call(run, _type='f2', _env='zagensi') |
There is a mismatch between ids in The corpus must be created with the same dictionary (the number of items has to match exactly). See also here: https://groups.google.com/group/gensim/browse_thread/thread/d42ab19f60a228db |
Also 500k features is probably too many -- check out the |
Thanks for trying to help. I tested this on full example where there is no mismatch and error is the same: I'l try to make some small example with no mismatch and I'll look into that thread. I will look into my code that creates corpus. I know that this is too many features I tried to eliminate some with PCA (doesn't work in scikit, chi2 also doesn't work) Stemming is useless in my case because I don't have text only count matrix. I used filter extremes I don't remember parameters exactly, but the result was empty dictionary. I have to eliminate some features in some way because I have problems with classifiers, because there is too much data. |
It seems that the problem is because I have too big dataset. I have now run LdaModel generation without dictionary and I still get the same error: rpmodel = models.ldamodel.LdaModel(moj_korpus, num_topics=97, update_every=1, chunksize=10) There were less features in corpus then in dictionary because I created dictionary from full data. I tested with 10 document and 50000 features (only 35 non zero entries) and LDA worked. Those were just the first 50000 features. But I don't know how to prune dictionary. If I run dictionary.filter_extremes() I get empty dictionary the same if I set parameter no_above to 0.9. And how to use new dictionary on corpus? All examples are for changing text to word counts. Is it possible to update corpus with prunned dictionary if I have only word count matrix? |
Yes, it looks like the out-of-memory exception (not segfault) is caused by the excessive amount of features.
|
I create dictionary as text file tab separated with id (column number) word dfs parameter has 592158 entries - the same as number of features and is 2012/8/11 Radim Řehůřek notifications@github.com
|
Aha, filtering ought to work then. How exactly do you call the If you post your dictionary file, maybe I can have a look. |
Thanks for trying. this is dictionary file for full example. Dictionary for previously provided corpus. |
I checked your file; you are right, The text format is only meant for debugging; some information is not stored there. But this missing information is needed by Fix1: if you have the dictionary stored with Fix2: if not, you can do Fix3: or use a script (bash/python) to find lines in the text file where the document frequency (third column) is between 5 and 1,000,000 (or whatever, depends how much you want to filter). Collect their ids (first column) and after that call |
I'm sorry I was away. I added num_docs to dictionary and filter extremes worked. I added num_pos (number of words in documents) and num_nnz (number of unique words in documents also all nonzero values in document x term matrix) to my dictionary and LDA worked in 64 bit 8 GB machine. On my 8 GB PAE 32 bit machine still doesn't work. I think one idea is to add function which creates dictionary from document x term matrix. Maybe I'll try to add it myself. This issue could be closed. |
I would like to use LDA model on mine corpus, but I get segmentation fault when trying to transform corpus.
Versions for running valgrind:
gensim 0.8.5
numpy 1.8.0.dev-fcdbcac
scipy 0.12.0.dev-e75a945
On stable version the problem is the same only file names in valgrind are missing, because there are no debug symbols:
gensim 0.8.5
numpy 1.6.2
scipy 0.10.1
I am running Arch Linux kernel 3.4.7-1 PAE kernel i686 and 8 GB of memory with python2.7.3-2 .
All files needed to run these are here
Intrestingly in scikit-learn I have the same problem with similar stacktrace.
I would be happy for any help.
Segmentation fault happens when merging changes:
2012-08-09 22:17:47,234 : INFO : PROGRESS: iteration 0, at document #10/100
2012-08-09 22:17:47,354 : INFO : 2/10 documents converged within 50 iterations
2012-08-09 22:17:48,899 : INFO : merging changes from 10 documents into a model of 100 documents
Traceback (most recent call last):
File "example.py", line 13, in
rpmodel = models.ldamodel.LdaModel(moj_korpus, id2word=dictionary, num_topics=97, update_every=1, passes=20, chunksize=10)
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 265, in init
self.update(corpus)
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 455, in update
self.do_mstep(rho(), other)
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 489, in do_mstep
diff -= self.state.get_Elogbeta()
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 153, in get_Elogbeta
return dirichlet_expectation(self.get_lambda())
File "/home/mabu/kaggle/zascikit/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 60, in dirichlet_expectation
result = psi(alpha) - psi(numpy.sum(alpha, 1))[:, numpy.newaxis]
MemoryError
Valgrind gives this:
2012-08-09 21:51:36,556 : INFO : loaded corpus index from podatki_del.mm.index
2012-08-09 21:51:36,576 : INFO : initializing corpus reader from podatki_del.mm
2012-08-09 21:51:36,579 : INFO : accepted corpus with 100 documents, 591333 features, 11109 non-zero entries
MmCorpus(100 documents, 591333 features, 11109 non-zero entries)
Dictionary(592158 unique tokens)
2012-08-09 21:52:38,601 : INFO : using serial LDA version on this node
==25177== Warning: set address range perms: large range [0xc84f028, 0x27e89318) (undefined)
==25177== Warning: set address range perms: large range [0x38d41028, 0x5437b318) (undefined)
==25177== Warning: set address range perms: large range [0xc84f018, 0x27e89328) (noaccess)
==25177== Warning: set address range perms: large range [0xc84f028, 0x27e89318) (undefined)
==25177== Warning: set address range perms: large range [0x75710028, 0x90d4a318) (undefined)
==25177== Warning: set address range perms: large range [0x97adf028, 0xb3119318) (undefined)
==25177== Warning: set address range perms: large range [0x75710018, 0x90d4a328) (noaccess)
==25177== Warning: set address range perms: large range [0x75710028, 0x90d4a318) (undefined)
==25177== Warning: set address range perms: large range [0x97adf018, 0xb3119328) (noaccess)
==25177== Warning: set address range perms: large range [0xc84f018, 0x27e89328) (noaccess)
==25177== Warning: set address range perms: large range [0xc84f028, 0x27e89318) (undefined)
==25177== Warning: set address range perms: large range [0x75710018, 0x90d4a328) (noaccess)
2012-08-09 21:56:07,067 : INFO : running online LDA training, 97 topics, 20 passes over the supplied corpus of 100 documents, updating model once every 10 documents
==25177== Warning: set address range perms: large range [0x75710028, 0x90d4a318) (undefined)
2012-08-09 21:56:07,978 : INFO : PROGRESS: iteration 0, at document #10/100
==25177== Warning: set address range perms: large range [0x97adf028, 0xb3119318) (undefined)
==25177== Conditional jump or move depends on uninitialised value(s)
==25177== at 0x4A2B7F3: PyArray_MapIterReset (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x4AA5E97: array_subscript (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x4AA691A: array_subscript_nice (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x408F4FF: PyObject_GetItem (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4124478: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== Conditional jump or move depends on uninitialised value(s)
==25177== at 0x4A2B7F3: PyArray_MapIterReset (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x4A9F10C: array_ass_sub (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x408F823: PyObject_SetItem (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412439F: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x40B6B20: function_call (in /usr/lib/libpython2.7.so.1.0)
==25177==
2012-08-09 21:56:09,730 : INFO : 2/10 documents converged within 50 iterations
==25177== Warning: set address range perms: large range [0x97adf018, 0xb3119328) (noaccess)
==25177== Warning: set address range perms: large range [0x97adf028, 0xb3119318) (undefined)
2012-08-09 21:56:34,273 : INFO : merging changes from 10 documents into a model of 100 documents
==25177== Invalid read of size 4
==25177== at 0x4B3960B: trivial_three_operand_loop (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x4B4EEF7: PyUFunc_GenericFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x4B4F1BA: ufunc_generic_call (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4090347: call_function_tail (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x409045F: _PyObject_CallFunction_SizeT (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4A67B80: PyArray_GenericBinaryFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x408BE96: binary_op1 (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x408E685: PyNumber_Multiply (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412587C: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== Address 0x1c is not stack'd, malloc'd or (recently) free'd
==25177==
==25177==
==25177== Process terminating with default action of signal 11 (SIGSEGV)
==25177== Access not within mapped region at address 0x1C
==25177== at 0x4B3960B: trivial_three_operand_loop (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x4B4EEF7: PyUFunc_GenericFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x4B4F1BA: ufunc_generic_call (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/umath.so)
==25177== by 0x409025F: PyObject_Call (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4090347: call_function_tail (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x409045F: _PyObject_CallFunction_SizeT (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4A67B80: PyArray_GenericBinaryFunction (in /home/mabu/kaggle/zascikit/lib/python2.7/site-packages/numpy/core/multiarray.so)
==25177== by 0x408BE96: binary_op1 (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x408E685: PyNumber_Multiply (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412587C: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x412A03C: PyEval_EvalCodeEx (in /usr/lib/libpython2.7.so.1.0)
==25177== by 0x4128233: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)
==25177== If you believe this happened as a result of a stack
==25177== overflow in your program's main thread (unlikely but
==25177== possible), you can try to increase the size of the
==25177== main thread stack using the --main-stacksize= flag.
==25177== The main thread stack size used in this run was 8388608.
==25177==
==25177== HEAP SUMMARY:
==25177== in use at exit: 1,899,201,128 bytes in 13,757 blocks
==25177== total heap usage: 719,362 allocs, 705,604 frees, 4,800,996,361 bytes allocated
==25177==
==25177== LEAK SUMMARY:
==25177== definitely lost: 24 bytes in 1 blocks
==25177== indirectly lost: 0 bytes in 0 blocks
==25177== possibly lost: 308,652 bytes in 541 blocks
==25177== still reachable: 1,898,892,436 bytes in 13,214 blocks
==25177== suppressed: 16 bytes in 1 blocks
==25177== Rerun with --leak-check=full to see details of leaked memory
==25177==
==25177== For counts of detected and suppressed errors, rerun with: -v
==25177== Use --track-origins=yes to see where uninitialised values come from
==25177== ERROR SUMMARY: 4 errors from 3 contexts (suppressed: 5120 from 514)
Segmentation fault
The text was updated successfully, but these errors were encountered: