-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault training NER with large number of training examples #1757 #1335 #1969
Comments
Are you running out of memory? Large batch sizes can get expensive. |
@honnibal ; my apologies for the confusion. My earlier post should have referred to large number of examples rather than batch-size. The segmentation fault occurs even for small batch sizes if the number of total training examples is large. Thanks! |
Interesting, thanks. Are you able to share your data? The pre-trained models are built on a few million words (a few 10s of thousands of sentences), so I doubt it's strictly the size of the dataset that's causing the problem. |
@honnibal I've tried multiple batch sizes (as small as 6), but get the following error mid-training (each time after unpredictable number of iterations):
Any guidance / suggestions would be greatly appreciated. Thanks Example training item: |
Hi all, I left the same comment on this issue, which is significantly older: #1335 I would just like to mention that I am experiencing the same error - I am getting a Segmentation Fault when I am training a new NER model starting from a blank language model. The weird thing is that it happens kind of randomly; when I execute the same python script many times, sometimes I get it, sometimes not. I am following the exact tutorial from https://spacy.io/usage/training , except I use my own dataset rather than the animal example. I am using spacy 2.0.9 and my code is pretty much the same as the tutorial. The only difference is that my training snippets are large paragraphs, and not sentences, and the "entities" dictionary contains multiple annotations, e.g.
If I can provide more info please let me know. |
Alright, I might have found a work around. I ran into this issue with similar circumstances to the other guys, ~500 examples of text (with a word count of about 2000 each) and entities. Training on 100 of those examples would work fine but when I would add more examples I would get the My thought was that it may be memory related due to the error type, so I tried loading in the examples with a generator so the whole thing wasn't stored in memory. No dice still getting the error. My next thought was maybe there was a particular example that was causing the error (maybe a very large one) so I decided to print the length of each example in the update loop of the model. And to my surprise, no more errors! I did some googling and it looks like there is some bug in a previous version of MacOS and python that would cause a segmentation fault with read line, it may be related but I'm not sure. So for a hot fix to anyone looking just print in the update loop, I decided to print the status of the training like this:
Python version: 3.6.3 Edit: Maybe not, it worked 3 times in a row so I thought it may have fixed it but I guess not |
@mitchemmc unfortunately, I'm still getting the segmentation fault 11 error even when printing within the update loop. |
Small follow-up on my end: At the beginning it looked like the neural network was getting the segmentation fault and crashed on initialization - because I was training it for 20 epochs ( But today, something inconsistent happened; I got the segmentation fault on the 3rd epoch - it just happened once. By the way @mitchemmc printing didn't help, although thanks for the recommendation. What I am trying now is splitting the training data into smaller chunks (e.g. paragraphs -> sentences) and adjusting the entity offsets accordingly. If this eliminates the error, I hypothesize it has to do with the length of the sentences (and therefore something goes wrong in the memory allocation when initializing the embedding layer). Another thing I want to try is fix the random seed in I will come back to you after I investigate further. Thanks. Edit: It seems that splitting the training data into smaller chunks resolved the error. I implemented an algorithm that given an annotated document like:
And annotations like:
Splits it into sentences:
and then trained my system on the sentences. So, for each training data point where I normally had a text snippet of ~5000 characters, now each data point has ~200 characters and I have a lot more of them. This seems to have resolved the issue - although it probably still needs to be addressed as a bug. Edit 2: False alarm, the solution above has not resolved the issue, fixing the random seed did not work either. |
I am facing exactly the same issue ( Any updates on this issue? If not I will investigate this tomorrow, and of course i'll share my findings. Thanks! Edit: Edit 2: |
@peustr @nikeqiang just out of curiosity, does your dataset contain any special characters like ë,á etc? Might help me tomorrow when investigating, although I doubt these characters are an issue. Edit: might possibly have something to do with encoding of the dataset. |
@peustr @nikeqiang does any of you read the training data from an external source before starting the training? Let's say a JSON file, or Database? I might have something, but i'm not sure yet... Care to elaborate on how your data is loaded into the python script? the code would be helpful to test some things. |
Hi @ruudschuurmans My dataset does contain special characters, not only UTF-8 letters, but symbols and stuff as well. However, I don't think that's the problem because then I would expect it to crash more deterministically, e.g. whenever it tried to parse the sentences with the symbols. Something that I failed to mention previously - all the seg faults that I had occurred on an OSX operating system. While investigating, I thought let's move the training to an ubuntu server we had available. And no more crashes, it trains flawlessly end-to-end 100% of the time. There are 2 main differences between my work laptop and the ubuntu server: the operating system and the python version. The default python3 version in ubuntu 16.04 is 3.5, while I have the latest one 3.6. So maybe this is part of the cause? Is it easy for you to try and train your system with python 3.5, or on an ubuntu server, to see if that guess is any good? |
Hey @peustr thanks for elaborating! :) I'll downgrade to python 3.5 on Windows first, to see if that works. I can try ubuntu as well if it doesn't Btw, I have noticed something funny, When i define my dataset in code as a Python variable (copy,paste, it's just json) and I format it into tuples in the format that Spacy consumes using some basic python, it works 100% of the time. Yet if I load my dataset from a file, parse the json and then format it into tuples Spacy consumes, it will give me segmentation errors. Are you doing something alike with your data? Do you recognise this behaviour? Update: Python 3.5 leaves me with the same problem, so that did not work :( |
Hey @ruudschuurmans I also load my annotations from files, namely BRAT annotated documents. I parse them, convert them to the format spacy requires, and feed them to the classifier. |
Hey @peustr, if you find the time, can you try defining a files content directly in your python script? That seems to work for me (tested quite a couple of times), but it might also just be luck. Gonna test in a Linux vm this afternoon Edit: tried another 5 times on Win10 with my data defined in the python script & running the script from the terminal (python test.py), and that really seems to work everytime. Will try another 5 times this afternoon. Could it be an issue with data loaded from disk? since peustr and me are both loading our data from a file, and convert the data. @honnibal do you maybe have an insight to point me into the right direction for researching this problem? |
I can confirm that python 3.5.2 in Ubuntu 16.04.4 LTS is working fine. So according to the information @peustr provided and my own problems and findings, this issue only occurs in Mac OS and Windows |
@ruudschuurmans , my data is loaded from a json file into a python dictionary and then used in training the model. One thought I had for testing purposes was to start with a single training example and then simply duplicate this example X times with random strings substituted for a single named entity and see if there is a specific threshold (# of training examples) above which the segfault starts to occur. This would help confirm that the problem is not related to special characters or encoding and would suggest it is a memory issue. |
@ruudschuurmans @peustr @nikeqiang @mitchemmc Thanks all for your patience with this, and for keeping the thread update. There are a couple of issues here that might be going on here. I suspect different people are getting the errors for different reasons -- which makes things super difficult to tease apart. The first type of problem is that spaCy currently requires quite a lot of internal memory per token of the input. I've been using a length limit of 5000 in the The second type of problem that might be occurring would relate to concurrency issues within the matrix multiplication library that numpy might be calling. In particular, if training is called from a subprocess on OSX, and numpy is linked against the Accelerate framework (the default), that causes an error. I've been working on this. The next version of Thinc brings its own matrix multiplication, and makes it easy to link a different library on installation, to more easily customise this. There may also be issues around concurrency on Windows. I'm not sure what numpy links to on Windows, or what problems might be caused. Again, I've been working on this: I hate relying on the system state at all, which is why I've gone to a lot of trouble to be shipping the matrix multiplication within Thinc. Finally, there can be bugs in the NER. I'm especially thinking the reports by @mitchemmc and @nikeqiang might be pointing to bugs within spaCy. I've recently fixed two problems with the NER that might be relevant. First, if the Another potentially relevant bug I fixed recently related to the interaction between the entity recognizer and the Also, check whether any of your documents are empty, or consist only of whitespace? This shouldn't be an issue --- but it's one possibility. And of course, also check for extremely long documents. Another possibility: is your gold-standard data all correct? Errors should be raised if the gold-standard entities are out-of-bounds or if the tag sequence is invalid --- but these are "shoulds". Are any labels being added during training? This should be handled correctly, but could also cause problem. Some things that are unlikely to be at fault:
Thanks again for your help figuring this out! |
Hey @nikeqiang thanks for elaborating! After my tests today i am pretty sure its a memory issue as well. Hmm, seems that the three of us have one thing in common, we load the data from a filesystem and process it in some way in Python before we use it to train Spacy NER. Maybe that's a good starting point. If you find the time, can you try to define (a part of) your json in Python as a variable (so nothing is loaded from the filesystem) and see if that works? For me it does so that's strange... |
@honnibal thanks for the extended answer! Really appreciate it, will look at it tomorrow :) |
@ruudschuurmans . Unfortunately, even after defining the training data explicitly as a (very long) variable within the body my python training script, I still get a segfault error. I note that the error is more likely to occur as (a) the number of iterations increases and/or (b) the number of training examples increases. @honnibal - Thanks very much for the in-depth reply. One question I had was whether any of the potential causes you mention would throw an error inconsistently on the same training data. I'm encountering a scenario where the same data (e.g. same 300+ examples) will not throw an error if the number of iterations is low (say 20) and the batch size is small (say 12) but throws an error when either of these is substantially increased (unless I decrease the number of training examples). I just thought I would mention this in case it helps rule out any potential sources of the error (e.g. pre-existing annotations, new labels being added, etc). The training script I'm using is substantially similar to that provided on the spacy website (train_new_entity_type.py). I'll keep looking into the other points you raised and also look forward to trying 2.1 to see if the improvements you mentioned already solve the issue! |
@honnibal thank you for being on top of this, let us know if you need any further information to assist you. |
Thanks for raising this again, it's definitely relevant. I forgot to add an important piece of context behind the error. The core of the NER (and dependency parser) is a state machine. The statistical model tries to predict transition actions, which the state machine uses to transition to the next state. The error is almost certainly specific to arriving at some particular state that's invalid (or almost equivalently: getting into a valid state, and then having some invalid operation performed). During training, we're asking which action or actions are optimal to take next given the current state --- that's the supervision signal. We then follow the predicted action --- because we want to train from realistic states that result from previous errors. We don't just want to train from optimal histories. That's why the training history matters. The batch size, shuffling, epochs etc all change the parser's model weights. Those model weights change the predicted actions, changing the states we encounter. Probably the best way to be debugging this is to set the batch size to 1, and print the text being trained from, and leave it training to try to hit the error. We then need to hack in an additional print statement within the The This print-based process is how I usually debug these things. I know many others would be shaking their heads and thinking "Just use a debugger". Honestly, I really just never learned to do that --- but if you're familiar with that workflow, maybe it's actually better here. |
@ruudschuurmans I doubt reading the data in from disk would be a likely factor --- I stream data when I train as well, and haven't hit the same error (which is definitely something that puzzles me...) I think it's just that the error is a bit exceptional, and for non-trivial training streaming data is a pretty sensible thing to do. |
@honnibal I agree that it's not likely, but I have tested another 20 times with 10 iterations, and when streaming it failed 19/20 times, and when i defined the data in code it succeeded 20/20 times. So thats what's really confusing me haha... |
Wow, weird! Could your data be loading differently? That might get you into different states. |
I'm having this issue when adding NEs from matcher-derived rules before running statistical NER. Sometimes the error is a segfault; sometimes it's a "bus error":
Sometimes I get a "Segmentation fault: 11". It seems random. |
But it appears to be tied to adding new entity types, so maybe I'm doing something wrong here:
|
Never mind, just needed to call "add_label" on the "entity" member of nlp.
It works, nominally (the forced assignment of a new entity type seems to have thrown the model off considerably -- Suzy Jones = ORG). But it works. Should I delete my comments to avoid throwing the search engines off? |
I am encountering the same error which appears randomly during training of NER from a blank chinese model. The training data size is around 200k. I am thinking to save the model after every training iteration, and when the error appears, it will be captured (still figuring out how to capture a segmentation fault) and then the start the training again by reading the model saved on last iteration. From my understanding of neural network. This should work and wont violate the theory behind :)? |
I'm also facing the same issue running over a training data with two samples with ~10k characters and 100 entities annotations.
|
@fatalhck can you try Ubuntu with the same dataset & code? I'm curious if that works for you. |
I would like to share with you that after running into the same problem at macOS (10.13.5), I was able to run my datasets on an Ubuntu machine without any problem. On the macOS, if I skip those text with long length (10k), my training works with segmentation fault 11. Based on these tests, I guess it is a memory related, and OS related issue. So if you want to train your spaCy model, run on Ubuntu systems. |
To elaborate further on just how accurate the suggestion that training be done on linux systems is, I was encountering this error on 20 text blocks with avg ~1-1.5k chars and 20 labels and was encountering segfaults on 1st-2nd iteration with native macOS. Running the same code in a docker container in a VM finishes training successfully. |
@ryan2x @shrikrishnaholla Thanks for the analysis. Could you check whether the same occurs with |
@shrikrishnaholla any chance you can share your dockerfile ? |
@honnibal tried with What I mean by that is, the stable build fails predictably - always trains at least for half an iteration and fails somewhere between the 1st and 5th iteration (for the same dataset I mentioned earlier). An unrelated issue, which you might be already aware of - I saw that the NER model that was built was significantly worse at predicting entities. I don't know whether this is because of some API change - let me know if I should test further. It "favors" a particular label and predicts it as the entity for nearly all the spans. The label that will be favoured is unpredictable too, and is seemingly random. |
I'm experiencing this exact same issue with the training data provided by this tutorial: https://github.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy Segmentation fault appears to be occurring randomly within the first iteration, printing within the update loop and fixing the random seed do not fix the issue.
|
This issue goes away if you are not using macOS or Windows. Simply use a
docker to run the program, it works.
For example, you can do: docker run --rm -it -v $PWD:/app -w /app python:3
bash
then install spacy inside this simply container and then run your python
code.
…On Wed, Sep 26, 2018 at 8:50 AM Chase Greco ***@***.***> wrote:
I'm experiencing this exact same issue with the training data provided by
this tutorial:
https://github.com/DataTurks-Engg/Entity-Recognition-In-Resumes-SpaCy
Segmentation fault appears to be occurring randomly within the first
iteration, printing within the update loop and fixing the random seed do
not fix the issue.
spaCy version 2.0.12
Location C:\ProgramData\Anaconda3\lib\site-packages\spacy
Platform Windows-10-10.0.14393-SP0
Python version 3.6.5
Models en_core_web_sm
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1969 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABoqHkPD_VlTeTgdyaYzhOm5WTCSNcg_ks5ue6JOgaJpZM4SBqAX>
.
|
+1 to @ryan2x - got this working with Docker on Mac OS High Sierra. |
I believe the main underlying issue is fixed now:ad068f5 Thanks for your patience with this. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Re-opening this as a new issue specifically related to NER
batch sizetraining with many examples. Relates to #1757 #1335 (which appear to be closed).Training NER on 500+ examples throws segmentation fault error:
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Anybody found a solution/workaround for this?
Thanks!
info about spaCy
Python version: 3.6.3
spaCy version: 2.0.5
Models: en, en_core_sm
Platform: MacOS
The text was updated successfully, but these errors were encountered: