-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak of MorphAnalysis object. #13684
Comments
We have observed similar issues in our pipeline. As you can see in this minimal example with da_core_news_md-model, the vocab keeps growing: nlp = spacy.load("da_core_news_md")
test_texts = [
"Varmere vintre: Flere trækfugle forurener søerne",
"De højere vintertemperaturer giver problemer for landets søer.",
"Blandt andet fordi flere trækfugle sover på vandet.",
"I 1980'erne var der omkring 200 grågæs i Danmark om vinteren.",
"I dag kan der være helt op mod 100.000.",
]
for text in test_texts:
print("Vocab size before nlp:", len(nlp.vocab))
with nlp.memory_zone():
doc = nlp(text)
print("Vocab size after nlp:", len(nlp.vocab))
print("Vocab size out of memory zone:", len(nlp.vocab)) Output: Vocab size before nlp: 2269
Vocab size after nlp: 2275
Vocab size out of memory zone: 2275
Vocab size before nlp: 2275
Vocab size after nlp: 2283
Vocab size out of memory zone: 2283
Vocab size before nlp: 2283
Vocab size after nlp: 2291
Vocab size out of memory zone: 2291
Vocab size before nlp: 2291
Vocab size after nlp: 2300
Vocab size out of memory zone: 2300
Vocab size before nlp: 2300
Vocab size after nlp: 2308
Vocab size out of memory zone: 2308 When trying to modify and access MorphAnalysis, an error occurs with hash in StringStore: for text in test_texts:
with nlp.memory_zone():
doc = nlp(text)
for token in doc:
morph_str = str(token.morph)
if "Definite" in morph_str:
definite = token.morph.get("Definite")[0]
new_morph_str = morph_str.replace(definite, "foo")
token.set_morph(new_morph_str)
token.morph.get("Definite") Output: ---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[24], [line 20](vscode-notebook-cell:?execution_count=24&line=20)
[18](vscode-notebook-cell:?execution_count=24&line=18) new_morph_str = morph_str.replace(definite, "foo")
[19](vscode-notebook-cell:?execution_count=24&line=19) token.set_morph(new_morph_str)
---> [20](vscode-notebook-cell:?execution_count=24&line=20) token.morph.get("Definite")
File ~/.venv/lib/python3.11/site-packages/spacy/tokens/morphanalysis.pyx:71, in spacy.tokens.morphanalysis.MorphAnalysis.get()
File ~/.venv/lib/python3.11/site-packages/spacy/strings.pyx:162, in spacy.strings.StringStore.__getitem__()
KeyError: "[E018] Can't retrieve string for hash '6324204924076910789'. This usually refers to an issue with the `Vocab` or `StringStore`." |
@hynky1999 Are the Japanese morphological tags open-class, or are they a closed set? I've assumed that the morphology tags are a closed set and can be added to the string-store without problems. Regarding deallocation, the @lise-brinck Thanks for the example code. I've found a bug in the memory zone handling that causes this. I'll release a patch shortly. |
* Fix bug in memory-zone code when adding non-transient strings. The error could result in segmentation faults or other memory errors during memory zones if new labels were added to the model. * Fix handling of new morphological labels within memory zones. Addresses second issue reported in Memory leak of MorphAnalysis object. #13684
Hi @honnibal, the I expressed myself incorrectly. This results in allocating new tags without any dealocation here. It woud only get dealocated if the self.vocab.morphology object would be deleted but I don't think it ever happens and for sure not with respect to mem zones. https://github.com/explosion/spaCy/blob/master/spacy/morphology.pyx#L135-L136 |
I have encountered a crucial bug, which makes running a continuous tokenization using Japanese tokenizer close to impossible. It's all due so memory leak of MorphAnalysis
How to reproduce the behaviour
Run this script and observe how memory keeps growing:
It all happens due to the this line:
token.morph = MorphAnalysis(self.vocab, morph)
. I have checked the implementation itself and there is neither code for dealocation implemented, nor it supports the memory_zone.The text was updated successfully, but these errors were encountered: