Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak of MorphAnalysis object. #13684

Open
hynky1999 opened this issue Nov 4, 2024 · 3 comments
Open

Memory leak of MorphAnalysis object. #13684

hynky1999 opened this issue Nov 4, 2024 · 3 comments

Comments

@hynky1999
Copy link

I have encountered a crucial bug, which makes running a continuous tokenization using Japanese tokenizer close to impossible. It's all due so memory leak of MorphAnalysis

How to reproduce the behaviour

import spacy
import tracemalloc


tracemalloc.start()
tokenizer = spacy.blank("ja")
tokenizer.add_pipe("sentencizer")

for _ in range(1000):
    text = " ".join(["a"] * 1000)
    snapshot = tracemalloc.take_snapshot()
    with tokenizer.memory_zone():
        doc = tokenizer(text)
        tokenizer.max_length = len(text) + 10
    import gc
    gc.collect()
    snapshot2 = tracemalloc.take_snapshot()
    # Compare the two snapshots
    p_stats = snapshot2.compare_to(snapshot, "lineno")
    # Pretty print the top 10 differences
    print("[ Top 10 ]")
    # Stop here with pdb
    for stat in p_stats[:10]:
        if stat.size_diff > 0:


            print(stat)

Run this script and observe how memory keeps growing:
image
It all happens due to the this line:
token.morph = MorphAnalysis(self.vocab, morph). I have checked the implementation itself and there is neither code for dealocation implemented, nor it supports the memory_zone.

@lise-brinck
Copy link
Contributor

lise-brinck commented Nov 15, 2024

We have observed similar issues in our pipeline. As you can see in this minimal example with da_core_news_md-model, the vocab keeps growing:

nlp = spacy.load("da_core_news_md")

test_texts = [
    "Varmere vintre: Flere trækfugle forurener søerne",
    "De højere vintertemperaturer giver problemer for landets søer.",
    "Blandt andet fordi flere trækfugle sover på vandet.",
    "I 1980'erne var der omkring 200 grågæs i Danmark om vinteren.",
    "I dag kan der være helt op mod 100.000.",
]

for text in test_texts:
    print("Vocab size before nlp:", len(nlp.vocab))
    with nlp.memory_zone():
        doc = nlp(text)
        print("Vocab size after nlp:", len(nlp.vocab))
    print("Vocab size out of memory zone:", len(nlp.vocab))

Output:

Vocab size before nlp: 2269
Vocab size after nlp: 2275
Vocab size out of memory zone: 2275
Vocab size before nlp: 2275
Vocab size after nlp: 2283
Vocab size out of memory zone: 2283
Vocab size before nlp: 2283
Vocab size after nlp: 2291
Vocab size out of memory zone: 2291
Vocab size before nlp: 2291
Vocab size after nlp: 2300
Vocab size out of memory zone: 2300
Vocab size before nlp: 2300
Vocab size after nlp: 2308
Vocab size out of memory zone: 2308

When trying to modify and access MorphAnalysis, an error occurs with hash in StringStore:

for text in test_texts:
    with nlp.memory_zone():
        doc = nlp(text)
        for token in doc:
            morph_str = str(token.morph)
            if "Definite" in morph_str:
                definite = token.morph.get("Definite")[0]
                new_morph_str = morph_str.replace(definite, "foo")
                token.set_morph(new_morph_str)
            token.morph.get("Definite")

Output:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[24], [line 20](vscode-notebook-cell:?execution_count=24&line=20)
     [18](vscode-notebook-cell:?execution_count=24&line=18)     new_morph_str = morph_str.replace(definite, "foo")
     [19](vscode-notebook-cell:?execution_count=24&line=19)     token.set_morph(new_morph_str)
---> [20](vscode-notebook-cell:?execution_count=24&line=20) token.morph.get("Definite")

File ~/.venv/lib/python3.11/site-packages/spacy/tokens/morphanalysis.pyx:71, in spacy.tokens.morphanalysis.MorphAnalysis.get()

File ~/.venv/lib/python3.11/site-packages/spacy/strings.pyx:162, in spacy.strings.StringStore.__getitem__()

KeyError: "[E018] Can't retrieve string for hash '6324204924076910789'. This usually refers to an issue with the `Vocab` or `StringStore`."

@honnibal
Copy link
Member

@hynky1999 Are the Japanese morphological tags open-class, or are they a closed set? I've assumed that the morphology tags are a closed set and can be added to the string-store without problems.

Regarding deallocation, the MorphAnalysis object doesn't need deallocation code. It's a Python object with a C struct, and the C struct doesn't make any heap allocations. So the memory is freed as normal by Python's reference counting.

@lise-brinck Thanks for the example code. I've found a bug in the memory zone handling that causes this. I'll release a patch shortly.

honnibal added a commit that referenced this issue Dec 11, 2024
* Fix bug in memory-zone code when adding non-transient strings. The error could result in segmentation faults or other memory errors during memory zones if new labels were added to the model.
* Fix handling of new morphological labels within memory zones. Addresses second issue reported in Memory leak of MorphAnalysis object. #13684
@hynky1999
Copy link
Author

hynky1999 commented Dec 28, 2024

Hi @honnibal, the I expressed myself incorrectly.
Yes you are right the MorpAnalysis object is indeed a struct. The issue is rather with it's creation as it calls self.vocab.morphology.add(features).

This results in allocating new tags without any dealocation here. It woud only get dealocated if the self.vocab.morphology object would be deleted but I don't think it ever happens and for sure not with respect to mem zones.

https://github.com/explosion/spaCy/blob/master/spacy/morphology.pyx#L135-L136

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants