-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault using build_vocab(..., update=True) for Doc2Vec #1019
Comments
Vocab expansion for doc2vec is not supported yet so labelled this as a new feature. |
I ran into this also.. Was taking a look at how updating vocabulary worked in the online for word2vec and tried to replicate the update for doc2vec's doctags. It seems to work - as in I can train the model with a few examples and then load it, train it more and it will return new doctags and vocabulary in the similarity functions. When storing the updated model I do have to give it a different filename otherwise the segmentation fault still happens. But the weights look like they get updated to. Here are my edits to the original doc2vec.py In the Added function to store new doctags from new training in a new property
Also an update weights function:
In the
When
at the end of it. Here is a link to the full file: https://gist.github.com/korostelevm/d48c80f296516deef045e5aa5dca1518 Disclaimer: I may not know what i'm doing at all, which is why im posting here for someone to hopefully verify |
As @tmylk notes, the existing vocab-expansion feature ( The times that it's not SegFaulting, there may still be silent corruption – just no memory accesses so bad that they trigger the fault. Perhaps something in the Doc2Vec paths is still using lengths/references to data that wasn't refreshed by the |
Thats what it seemed like to me, I forced into the slow mode to debug it -
Then replaced the After this instead of getting a segmentation fault I get this in the traceback:
Which I think was trying to tell me the index 10 of my doctags is more than the 3 I had in there in the first round of training. So I did the stuff I mentioned above and it seemed to fix the issue. Put back the fast mode flags and it still works. |
I used ddd to debug Cython code and it seemed that the segmentation fault appears at line 123 of doc2vec_inner.pyx: if hs:
codelens[i] = <int>len(predict_word.code)
codes[i] = <np.uint8_t *>np.PyArray_DATA(predict_word.code)
points[i] = <np.uint32_t *>np.PyArray_DATA(predict_word.point) With parameter hs of model set to 0 there are no mistakes (both Python 2. and 3., proved with ddd). So, proposed hotfix is to turn off hs mode when model is upgraded. |
An appropriate hotfix would be to disable vocabulary expansion for doc2vec models, but a proper fix would be better |
Yes, and the proper fix will require figuring out why the model, post-vocab-update, is using some older or incorrect arrays or sizes, and thus making an improper/illegal memory access. |
Current status: only works for hs=0. |
Looks like I'm still getting a segfault when hs=0. (Based on the doc2vec.py:590, it looks like that is the default, though the docs say it's 1.)
Apologies if my code is unclear, but essentially I'm doing the same thing as others above. Any help would be much appreciated. On a side note, I'm sure I'm using total_examples wrong, but when I put in the real total_examples count across all training calls, it says something like the expected count doesn't match the count for sentences on my current call. |
Is it useful to call trian() function repeatedly on a Doc2Vec model without adding new vocabulary? Will the model get better for new data? |
@rajivgrover009 Maybe. Whether it helps or hurts is probably dependent on your dataset, choice of parameters, and the relative contrast between your new texts and the earlier texts. The best-grounded course would be to mix new texts with old to make a new all-inclusive corpus, and continue training with that. |
There's another report from @mullenba in #1578, which includes a minimal triggering case. |
I'm trying to look into this. Here is a status update... Previously, @tmylk reported that doc2vec's document expansions works as long as To debug and iterate quickly, I used this workflow:
The coredump points at this line, apparently the index is out of the bounds of The equivalent piece of code for word2vec is here. I've read that vocab expansion is supposed to work for word2vec, so I was planning to use that as a guide to check the differences. Anyone wants to join me in this debugging adventure? 😄 ps: by the way, I tried to deliberately run the "slow" pure-python implementation of doc2vec to see if vocab expansion works. Same problem, it crashes here because |
The pure-python path isn't actually core-dump 'crashing', is it? (I'd think it'd have to be a printed exception, instead.) Note that segfault crashes are often caused by earlier memory-corruption, rather than the exact line where they're triggered. |
Thanks, but in this case it seems that indeed the index is pointing outside of
Yes, it's not coredumping. As I said, it goes out of bounds when it reaches the first new doctag (i.e., "animals" at line 29 of this minimal code) as follows:
Please note that I had to add the line |
Pushed this fix for the "slow" version. Regarding the cythonized version... I'd need more time (and help). |
Sure, but why would the index be out of the expected, functioning range? Often because of some (arbitrarily-)earlier memory-corruption. |
@gojomo I received one more report with this problem, maybe raise an exception for this case (when |
Hi any update on this issue. I am able to train doc2vec model with new documents in a 32 bit python(for 64 bit python, it still crashes), but cannot query "model.docvecs.most_similar(["XXX"])" for newly added documents. shows index out for range. An online approach for doc2vec will be very helpful. |
@khulasaandh as I know, you can |
Hi @menshikh-iv , thanks for the reply. I am using the same example posted by @danoneata, but have added a few more documents/lines in sentences_1 and sentences_2. As you mentioned, I am computing the infer vector for new document as mentioned below.
It returns me most similar documents but give gives nan values in place of similarity coefficient. Am i doing this wrong? |
@khulasaandh looks really suspicious (your code is correct). Can you share data (traned model & token_list) for reproducing this error? |
@khulasaandh @menshikh-iv A separate non-segfault anomaly with |
Hi @menshikh-iv and @gojomo even on 32bit python that I am using, sometimes the segmentation fault still occurs, but most of the time the code runs. My python version - please find the code below to replicate the issue. import logging
from gensim.models.doc2vec import (
Doc2Vec,
TaggedDocument,
)
logging.basicConfig(
format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s',
level=logging.DEBUG,
)
def to_str(d):
return ", ".join(d.keys())
SENTS = [
"anecdotal using a personal experience or an isolated example instead of a sound argument or compelling evidence",
"plausible thinking that just because something is plausible means that it is true",
"occam razor is used as a heuristic technique discovery tool to guide scientists in the development of theoretical models rather than as an arbiter between published models",
"karl popper argues that a preference for simple theories need not appeal to practical or aesthetic considerations",
"the successful prediction of a stock future price could yield significant profit",
]
SENTS = [s.split() for s in SENTS]
def main():
sentences_1 = [
TaggedDocument(SENTS[0], tags=['SENT_0']),
TaggedDocument(SENTS[1], tags=['SENT_1']),
TaggedDocument(SENTS[2], tags=['SENT_2']),
]
sentences_2 = [
TaggedDocument(SENTS[3], tags=['SENT_3']),
TaggedDocument(SENTS[4], tags=['SENT_4']),
]
model = Doc2Vec(min_count=1, workers=4)
model.build_vocab(sentences_1)
model.train(sentences_1, total_examples=model.corpus_count, epochs=model.iter)
print("-- Base model")
print("Vocabulary:", to_str(model.wv.vocab))
print("Tags:", to_str(model.docvecs.doctags))
model.build_vocab(sentences_2, update=True)
model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)
print("-- Updated model")
print("Vocabulary:", to_str(model.wv.vocab))
print("Tags:", to_str(model.docvecs.doctags))
token_list = "the successful prediction of a stock future price could yield significant profit".split()
infer_vector = model.infer_vector(token_list)
print(model.docvecs.most_similar(positive=[infer_vector]))
if __name__ == '__main__':
main() |
Big thanks @khulasaandh, reproduced with Segfault moment In [6]: model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)
/home/ivan/.virtualenvs/math/bin/ipython:1: DeprecationWarning: Call to deprecated `iter` (Attribute will be removed in 4.0.0, use self.epochs instead).
#!/home/ivan/.virtualenvs/math/bin/python
2018-03-28 02:18:17,204 : MainThread : INFO : training model with 4 workers on 68 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2018-03-28 02:18:17,207 : Thread-79 : DEBUG : job loop exiting, total 1 jobs
Segmentation fault (core dumped) |
Does anyone have a workaround until this gets fixed? |
Hello, I'm currently trying to get gensim to train up a couple of TaggedDocument objects, which originate from a non static source of input-data. So its gensim 3.8.0 on a Linux Debian Buster, 64bit. The workaround offered by nsfinkelstein didn't work at all (beside, I do not know the size of my dictionary) which is sad... and probably caused by my poor Python experience (about.... two weeks?). But (!) I noticed something: If you are about to add new content to you dictionary, it will go straight into segmentation fault if done in a way one would expect: put new TaggedDocument into the model by using model.build_vocab(documents=newTD, update=True) and then calling model.train(newTD) here... look at these:
As you can see, the second one is kind of logical extension of the first one. And as you might have observed, the dictionary will add one entry for every word in about the order its put inside. But if you do it this way:
you can actually train after adding td2 to the vocab. It will get a bit harder when you need to insert words to the vocab
td1's vocabulary representation would omit the second 'and', so it would look like this
Offering this "build_vocab(td3, update=True)" will allow you to train the existing model with td2 But... yes, there is always a but... while this does work with text (documents/words), as soon as you are trying to add tags to the whole thing, it will went back to segmentation fault itself to death. Not even the "offer a special TaggedDocument" trick can solve this :( And this brought me into a dead end, because I really need those tags... Any chance someone might find a solution for this? |
Hello, |
Hello!
I'm performing online learning for
Doc2Vec
, that is, I learn an initial model on a set of tagged documents and try to update the model on a new set of tagged documents. If the second set contains new tags (tags that were not present in the initial set of documents), then I usually get segmentation fault (this behavior is not deterministic, but it happens most of time).Below you can find a toy example that reproduces the issue; and here is the output of that code. I'm using Python 3.4.3 and Gensim 0.13.3.
I've debugged with gdb and I've got the following output:
I'm willing to help fixing this issue if someone can provide me some guidance. Thanks!
Sample code that reproduces the issue:
The text was updated successfully, but these errors were encountered: