End-to-end neural coref in spaCy #11585

svlandeg · 2022-10-06T14:39:08Z

svlandeg
Oct 6, 2022
Maintainer

EDIT 2023-02-14: As of today, new coref-related discussions should open new threads. This thread was created to contain a burst of threads immediately following release, but it's run its course, and now coref can have its own threads like other components. Thanks to everyone who has contributed to this thread and tried coref.

We've released an end-to-end neural coref component as part of spacy-experimental 0.6.0. This release contains a pretrained pipeline for you to play with:

# Commandline
pip install spacy-experimental==0.6.2
pip install https://github.com/explosion/spacy-experimental/releases/download/v0.6.1/en_coreference_web_trf-3.4.0a2-py3-none-any.whl

# Python
import spacy
nlp = spacy.load("en_coreference_web_trf")
doc = nlp("The cats were startled by the dog as it growled at them.") 
print(doc.spans)

Which will print the coref clusters:

{'coref_clusters_1': [the dog, it], 'coref_clusters_2': [The cats, them]}

FYI: If you're interested in training a coref pipeline yourself, check out this project we've assembled: https://github.com/explosion/projects/tree/v3/experimental/coref. We've also published a blog with many details on this architecture: https://explosion.ai/blog/coref

We'd love for you to try this out, and any feedback is very welcome on this thread!

regstuff · 2022-10-07T13:15:45Z

regstuff
Oct 7, 2022

Hi,
Any suggestions on how I can get the start and end indices of the spans? Trying to figure out how I can actually replace the items in the cluster with the cluster head in the original text. Plain text cluster items isn't all that helpful as there might be a whole lot of 'its' and 'thems' in the original text.
Also, kind of struggling to integrate this with the other models, such as 'en_core_web_sm'
When adding both models into a pipe, it seems only one of them actually works. For eg. getting the pos_ of a token gives me an empty string.

11 replies

polm Oct 27, 2022

@phil-scholarcy Hey, thanks for posting this, but it looks like you deleted your initial post explaining what you were doing. That makes this thread really confusing for anyone coming in later or just reading it on the web.

To be clear, it looks like your code is intended to replace every item in a coreference cluster with the first mention in the cluster, which will usually replace pronouns with their referrents. We're actually including code for exactly that in an upcoming video about the coref component, so you'll see our solution for that soon.

iz2late Nov 7, 2022

@regstuff hi there, did you figure out how to identify the head mention of a cluster?

polm Nov 7, 2022

@iz2late The head mention of a cluster is usually going to just be the first one. If you find that doesn't work well for you, I guess you could check for the first one that is a named entity, and fall back to just the first one.

jaskhalsa Nov 22, 2022

Hi @polm to your point with the above, how should I go about combining this with the ner component? I have tried to do nlp.add_pipe('ner') but it throws an error when I actually do nlp(doc).

adrianeboyd Nov 24, 2022

nlp.add_pipe("ner") adds a new uninitialize, untrained ner component rather than an existing one. For existing components, you need to "source" them from an existing pipeline instead to get the full trained component: https://spacy.io/usage/processing-pipelines#sourced-components

I'd suggest a slightly modified version of Paul's code above for non-trf base models, which doesn't duplicate the underlying transformer model:

import spacy
nlp = spacy.load("en_core_web_sm")
assert "transformer" not in nlp.pipe_names
nlp_coref = spacy.load("en_coreference_web_trf")
nlp.add_pipe("transformer", source=nlp_coref)
nlp.add_pipe("coref", source=nlp_coref)
nlp.add_pipe("span_resolver", source=nlp_coref)
nlp.add_pipe("span_cleaner", source=nlp_coref)

Combining components from multiple pipelines can be tricky and Paul is working on general-purpose scripts to make this easier.

VaibhaveS · 2022-10-09T07:58:46Z

VaibhaveS
Oct 9, 2022

Getting this error.

RegistryError: [E893] Could not find function 'spacy-experimental.Coref.v1' in function registry 'architectures'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

3 replies

svlandeg Oct 9, 2022
Maintainer Author

Hi @VaibhaveS! That sounds like the installation of spacy-experimental didn't go through properly. Can you paste the commands you used to install this (preferably in a clean virtual environment)?

VaibhaveS Oct 9, 2022

Hey @svlandeg, I used python -m pip install spacy-experimental

svlandeg Oct 9, 2022
Maintainer Author

Can you copy/paste the output of the commandline when you run this command? Can you verify that indeed version 0.6.0 has been installed?

I'm asking because if you already had an older version of spacy-experimental, it will not have been updated. In that case, you want to run specifically

python -m pip install spacy-experimental==0.6.0

nickchomey · 2022-10-27T02:47:32Z

nickchomey
Oct 27, 2022

Thanks very much for this! It is very cool!

Do you want feedback for when it fails to properly identify coref spans? I've noticed a variety of errors with the sample texts from this page.

https://www.netowl.com/what-is-coreference-resolution

10 replies

nickchomey Oct 27, 2022

@phil-scholarcy Nice one! Though, I see that it doesn't handle other apostrophes, like I'm

As you said, I think it would be better if this could make use of the POS tagging. I wonder if lemmatization could be at all useful here - maybe after the coref detection but before doing the coref replacements?

@polm Perhaps a coref replacement function could/should be officially done by the spaCy team? I think it would be something that many (most?) people would make use of, and surely you guys are able to do a far better job of it than our adhoc and redundnant efforts...

phil-scholarcy Oct 27, 2022

@phil-scholarcy Nice one! Though, I see that it doesn't handle other apostrophes, like I'm

Agree, needs a complete solution, to include I'm, I'll, we'll, we've, she'll, he'll, he's, she's ... and more.

It would be great to have an official replacement function in spaCy - happy to contribute to it.

@polm such a replacement is useful for many downstream tasks, for example, extractive summarisation after coreference has been resolved.

regstuff Oct 28, 2022

It's fine to continue this here, it would also be OK (and maybe easier to follow) to open a new thread. Please do not post in #3052, that would just be hard to follow.

In your example, the output looks like it's consistent with OntoNotes conventions to me.

If we resolve both the spans with their proper nouns, the sentence grammar goes off. Is this kind of resolution intentional?

When you say "resolve the spans", you mean rewriting the sentence? That's come up several times in this thread, but resolving coreference in order to rewrite spans is not the intended use of coreference in OntoNotes - it's more a layer of information you can integrate in other ways. Can you give me some more perspective on how you intend to use the rewriting and why that's helpful?

Yup I mean rewriting the sentence. That improves results quite a bit for me in Question-Answering and Topic Modelling. My primary use for Spacy's Coref pipeline is to rewrite the sentence.

polm Oct 28, 2022

Thanks for the iteration on the code and the examples of how you would use, I think I understand your use case better now.

We are going to have a version of code that does this released with an upcoming intro video for coref. However, it's still going to be relatively simple, and not a core library feature. In complex cases like overlapping references, it can be hard to figure out what the right way to rewrite things is in the first place, even if you were to do it manually. There are other problems too - in particular the current component is highly general, and it's not designed to work only on English, or only for certain kinds of coreference.

For example, in the OntoNotes dataset we used for training the sample pipeline, it's possible to have nouns resolve to verbs, in a sentence like "I thought [going to the mall] was [the best thing] to do". In that case it doesn't seem possible to rewrite the sentence meaningfully at all. You could restrict rewriting to only involve nouns, but the current model actually doesn't even assume POS information is available in the first place.

Separately, once you get into issues like resolving inflection to make the sentence read correctly, that's really more an NLG problem, which is out of scope for spaCy.

While developing this, our typical imagined used case is using coreference information in, say, information extraction, where the information about what spans are the same item is helpful for building representations, but there's no expectation that those representations would need to be readable. I think making them readable sounds like an interesting problem, but relatively involved. If you'd like to make a package focused on getting that right, that'd be a great addition to the Universe.

polm Nov 8, 2022

Here's the video with some example code to do rewriting:

https://www.youtube.com/watch?v=fio3BejnRsM

And here's the example code in a gist:

https://gist.github.com/thomashacker/b5dd6042c092e0a22c2b9243a64a2466

This is similar to the code already in this thread in that it doesn't handle every case, particularly nested coreference or non-noun coreferences, but it should be a good start.

BenGriffithsPEP · 2022-10-28T09:13:36Z

BenGriffithsPEP
Oct 28, 2022

Loving the new coref component, can't wait for the accompanying video.

I had a question about coref resolution in general, not sure if this is the best place for it. I've been trying to label up some examples in prodigy and am not sure how best to resolve these.

"The doctor and nurse saw the patient. They did a great job." Should you assign a reference from 'doctor' to 'they' and 'nurse' to 'they' or would you assign 'doctor and nurse' to they?
"The doctor and nurse saw the patient. They did a great job. But the doctor was rude and he didn't say bye." Extending out the first example, would you still assign 'doctor and nurse' to they and then assign the mention of 'doctor' by himself to 'he'?
"The doctor and nurse saw the patient. They did a great job. But the doctor was rude and he didn't say bye. But the doctor did have a great shirt, I wonder where he bought it?" Finally, extended out the example once more. Would you link the first mention of 'doctor' to 'he', then the second mention of 'doctor' to the first mention of doctor, and the final mention of 'he' back to the first mention of 'doctor'? Basically, does every proceeding coreference, link back to the first mention of the coreferee? Also, in this example, which coreference in the cluster is the head and which is the child? Would the first mention of doctor be the head and the he's and the second 'doctor' be the children? This would be useful to know for labeling data in prodigy. Thanks!

3 replies

polm Oct 28, 2022

This seems to be a followup to your question on the Prodigy forums?

To repeat some of the info there:

Some coreference systems care about directed links between mentions, but the new coref model only models clusters, so the question of whether A links to B or B to A is irrelevant - only the resulting clusters matter. This is also true of many coreference systems - parent/child relationships between mentions simply aren't present in the OntoNotes data we used for training this, or other popular coreference datasets like LitBank. So I would recommend you just don't worry about the order of these too much.

About how to annotate a specific span - there is no one universal standard. In the context of Ontonotes, if you have "The doctor and the nurse said they ...", then "The doctor and the nurse" is one mention that corefers with "they". If you have other mentions that refer to only one of them, then those are annotated with the corresponding spans ("The doctor" or "the nurse"). This does seem like a case that would be annotated the same in most systems.

However, note that, as mentioned in the coref blog post, split antecedents can't be recovered. Specifically, because "nurse" is the head of both "the nurse" and "the doctor and the nurse", there's no way for our system to differentiate references to those two spans. So while you will still get some coreferences if you give that sentence as input, the results may be inconsistent or mixed up for that particular pattern.

BenGriffithsPEP Oct 28, 2022

Yes, this is a follow up question to my prodigy question, Ryan recommend that I move here for more coref specific questions.

Thanks for hammering home the point on my A to B or B to A question. I understand now, I just wanted clarity before going ahead and labeling lots of data.

The span annotation of 'the doctor and nurse' makes sense too. After testing this example with the new model it resolved as I thought it should.

Regarding the issue with "nurse" being the head of "the doctor and the nurse" and "the nurse", so far the model has been working really well but it'll be interesting to see what happens when I test it on some of our more challenging data.

I just have 1 last question to clarify my understanding of how each reference in a cluster relate and how this impacts labeling. When labeling coreferences in the sentence below, is it unimportant if I were to link 'the doctor' to 'he', he' to 'the doc', then 'the doc' back to 'the doctor' and finally 'the doctor' to the final 'he' just as long as they are all linked with a 'coref' label in some way? Ryan said in my prodigy question 'You can think of this as undirected graph' so as I understand it, as long as they are all 'connected' they contribute to the cluster.

"The doctor was rude and he didn't say bye. But the doc did have a great shirt, I wonder where he bought it?"

polm Oct 28, 2022

When labeling coreferences in the sentence below, is it unimportant if I were to link ...

The current model doesn't have any concept of parent/child links in coreferences in input or output data at all. So only the final groupings of mentions, not the individual links or their directions, matter.

A simple way to convert your links to groupings is to group all mentions with their furthest ancestor (by following all "parent" links until you can't).

itssimon · 2022-10-31T11:36:44Z

itssimon
Oct 31, 2022

I've been testing out this new coref component with the view to replace neuralcoref in our production pipeline and finally update to spaCy v3 (yay!). The results look great so far, seems to be a big step up in accuracy for our health care use case. Awesome work guys!!

I do notice that running this component on CPU is a bit slower than what we were able to achieve with neuralcoref, which also wasn't as accurate of course. So I wonder if we could retrain the coref model using distilroberta-base rather than roberta-base as the Transformer model, which I hope could run a bit faster, while hopefully not sacrificing too much accuracy.

With the provided example project (thanks!), this should hopefully be relatively straigtforward. Apart from changing the model name in this line, is there anything else that would need to be considered/updated for this to work?

8 replies

regstuff Nov 2, 2022

@polm I noticed coref resolution can take a minute or two on longer texts of about 2000 words. Is it possible to use a GPU to speed up the process?

polm Nov 2, 2022

If you are not already using a GPU then yes, a GPU should make it much faster. Since the pretrained coref model uses transformers, while it can run on CPU, it will be very slow.

itssimon Nov 2, 2022

That seems to work well, thanks so much!

Unfortunately I'm not having as much luck getting the training to work though, hitting a PyTorch RuntimeError. I've created an issue for it here.

itssimon Nov 7, 2022

Downgrading PyTorch to v1.12.1 has resolved the RuntimeError described above and training now actually starts (yay!), however I'm getting a RuntimeError: CUDA out of memory. after training for a few minutes ☹️ Been running this on a single Tesla V100 GPU so far. Which GPU(s) have you guys used for training? Any tips on resolving this?

polm Nov 7, 2022

This model was trained on an RTX 3090 with 24GB of GPU RAM. The model is pretty memory hungry, but you should be able to reduce memory usage significantly by tweaking the hyperparameters. It sounds like you're having issues with the clustering component, so the main parameters to tweak would be the hidden size and antecedent limit.

We haven't tried this, but you could also look at smaller transformers - I think DistillBERT is one that comes up often.

itssimon · 2022-11-07T06:34:11Z

itssimon
Nov 7, 2022

I'm experiencing some issues with extra tokens in the spans created by the coref pipeline.

>>> import spacy
>>> nlp = spacy.load("en_coreference_web_trf")
>>> doc = nlp("John Smith called from New York, he says it's raining in the city.")
>>> doc.spans
{'coref_clusters_1': [John Smith called, he says], 'coref_clusters_2': [New York,, the city.]}

The verbs "called", "says" as well as the punctuation at the end of "New York," and "the city." shouldn't be included in the spans.

Interestingly, this only happens in new conda environments, while a previous installation in a different conda environment does not include the extra tokens. All relevant versions match though, I cannot figure out what the difference is. Will keep digging, but thought maybe someone here has an idea? @polm?

1 reply

itssimon Nov 7, 2022

Found it, looks like when installing the en_coreference_web_trf model from the spacy-experimental v0.6.0 release assets on GitHub it still installs v0.6.1 as a dependency (it probably shouldn't) and this bug must've been introduced in that latest version. I've created an issue for it: #11759.

BenGriffithsPEP · 2022-11-09T12:53:03Z

BenGriffithsPEP
Nov 9, 2022

I'm trying to create a dataset to test the coref predictions and I have a question about spacy_experimental.coref.coref_scorer . I've invented some simple coreference examples, annotated them using the coref recipe in prodigy and converted those annotations to the same format as the output from "en_coreference_web_trf". Then I'm assigning this doc (gold_doc) and the doc output from the coreference model (pred_doc) to a collection of Examples to score using the score_coref_clusters function from spacy_experimental.coref.coref_scorer.

My question is, what happens if the number of labelled clusters is not the same as the number of predicted clusters? For example, there are 6 clusters in the gold_doc and 5 clusters in the pred_doc, where the extra cluster is the second antecedent in the doc (coref_clusters_2) and the remaining clusters, although may be correct, are no longer aligned? Does the link-based entity-aware evaluation metric (LEA) account for this or are there some other functions in spacy_experimental.coref I should be using to align the gold and predicted clusters?

import spacy
from spacy.tokens import SpanGroup
from spacy.training import Example
from spacy_experimental.coref.coref_scorer import *
from prodigy.components.loaders import JSONL

def create_labeled_coref_clusters(labelled_doc, spacy_doc):
    coref_cluster_pool = {}
    for rel in labelled_doc['relations']:
        coref_id = 'coref_id_' + str(rel['child'])
        if coref_id not in coref_cluster_pool:
            coref_cluster_pool[coref_id] = {'antecedent_idx': rel['child_span']}
            coref_cluster_pool[coref_id]['mention_idx'] = [rel['head_span']]
        else:
            coref_cluster_pool[coref_id]['mention_idx'].append(rel['head_span'])

    coref_clusters = {}
    for i, cluster in enumerate(coref_cluster_pool):
        # Create cluster
        cluster = coref_cluster_pool[cluster]
        coref_clusters[f'coref_clusters_{i+1}'] = SpanGroup(spacy_doc, name=f'coref_clusters_{i+1}')

        # Assign antecedent
        antecedent_idx = cluster['antecedent_idx']
        antecedent = spacy_doc[antecedent_idx['token_start']:antecedent_idx['token_end']+1]
        coref_clusters[f'coref_clusters_{i+1}'].append(antecedent)

        # Assign mentions
        for mention_idx in cluster['mention_idx']:
            mention = spacy_doc[mention_idx['token_start']:mention_idx['token_end']+1]
            coref_clusters[f'coref_clusters_{i+1}'].append(mention)
    return coref_clusters

nlp_coref = spacy.load("en_coreference_web_trf")
nlp_sm = spacy.load("en_core_web_sm")

labelled_docs = list(JSONL(filename))

ex = []
gold_docs = []
pred_docs = []

for labelled_doc in labelled_docs:
    gold_doc = nlp_sm(labelled_doc['text'])
    labelled_clusters = create_labeled_coref_clusters(labelled_doc, gold_doc)
    # Add clusters to document span
    for cluster_name in labelled_clusters:
        gold_doc.spans[cluster_name] = labelled_clusters[cluster_name]
    pred_doc = nlp_coref(gold_doc.text)
    ex.append(Example(predicted=pred_doc, reference=gold_doc))
    gold_docs.append(gold_doc)
    pred_docs.append(pred_doc)

score_coref_clusters(ex)

{'coref_f': 0.720164609053498,
 'coref_p': 0.6862745098039216,
 'coref_r': 0.7575757575757576}

1 reply

polm Nov 10, 2022

LEA just uses logical equivalence, so cluster order shouldn't matter. You can check this by re-ordering the clusters in your test data.

Tanmay98 · 2022-12-23T11:40:43Z

Tanmay98
Dec 23, 2022

Hello everyone,
How can I finetune the coreference model for my own coref annotations, my annotations are annotated according to the litbank data format. I want to include coref pipeline into my NER spacy model.

Link to sample annotations: https://drive.google.com/drive/folders/1WzRogtvg81TMCHmVR0Kw4iqrbVWCFgO7?usp=share_link

1 reply

polm Dec 26, 2022

Just a note that this discussion is followed up on in #12021.

sohail-jogo · 2023-01-11T16:02:34Z

sohail-jogo
Jan 11, 2023

Hey guys! Thanks for this awesome work! Its really great.
I just wanted to know if something similar is possible for this as well.

https://github.com/huggingface/neuralcoref#using-the-conversion-dictionary-parameter-to-help-resolve-rare-words

1 reply

polm Jan 12, 2023

No, we don't have the conversion dictionary feature. I guess you could hack it in by modifying the tok2vecs, though it might be a little involved.

sdspieg · 2023-01-19T01:45:21Z

sdspieg
Jan 19, 2023

Is it possible for mere mortals to apply this component 'out-of-the-box' to a 'text' column in a df and to then put the coref-resolved text in a 'text-coref' column? If so, would somebody please be so kind as to provide some ipynb for this? I guess that for those of you 'in the know', this would only take a few minutes. I, on the other hand, have just spent over an hour trying to figure this out with the help of all of our new best friend chatGPT, who is usually great(ish) at this stuff, but who clearly just doesn't know enough about about it (yet). And even feeding it all of the documentation didn't do the trick. Please? Thanks much...

1 reply

BenGriffithsPEP Jan 19, 2023

Try this. I've used the function suggested in the coref youtube video, but for more control, you might want to alter it to better define the resolution of the coreferences.

import spacy
from spacy.tokens import Doc
import pandas as pd

# Define lightweight function for resolving references in text
def resolve_references(doc: Doc) -> str:
    """Function for resolving references with the coref ouput
    doc (Doc): The Doc object processed by the coref pipeline
    RETURNS (str): The Doc string with resolved references
    """
    # token.idx : token.text
    token_mention_mapper = {}
    output_string = ""
    clusters = [
        val for key, val in doc.spans.items() if key.startswith("coref_cluster")
    ]

    # Iterate through every found cluster
    for cluster in clusters:
        first_mention = cluster[0]
        # Iterate through every other span in the cluster
        for mention_span in list(cluster)[1:]:
            # Set first_mention as value for the first token in mention_span in the token_mention_mapper
            token_mention_mapper[mention_span[0].idx] = first_mention.text + mention_span[0].whitespace_
            
            for token in mention_span[1:]:
                # Set empty string for all the other tokens in mention_span
                token_mention_mapper[token.idx] = ""

    # Iterate through every token in the Doc
    for token in doc:
        # Check if token exists in token_mention_mapper
        if token.idx in token_mention_mapper:
            output_string += token_mention_mapper[token.idx]
        # Else add original token text
        else:
            output_string += token.text + token.whitespace_

    return output_string

nlp = spacy.load("en_coreference_web_trf")

text = ["Philip plays the bass because he loves it.",
"Sam thanked the doctor for helping him.",
"Tina drover the car to the shops because they were about to close."]

df = pd.DataFrame(text, columns=['text'])

df['text-coref'] = [resolve_references(coref_doc) for coref_doc in nlp.pipe(df['text'])]

for txt in df['text-coref']:
    print(txt)

AnishDg13 · 2023-01-27T06:40:59Z

AnishDg13
Jan 27, 2023

Hi,
I've been working on coreference resolution as part of a project and was wondering if the en_coreference_web_trf model addresses plural pronouns ?
For a text like, "Sam and Monica are here. Both of them arrived late",
'them' is resolved to only Sam

Is there any way to solve this ?

1 reply

polm Jan 27, 2023

We mentioned it briefly in the coref blog post, but that's a split antecedent and it's a hard problem in coref generally. With the spacy-experimental coref model itself there's not really much you can do because the actual coreference is operating on single words.

If you run into it frequently, what you can do as a workaround is check if you have a plural pronoun ("them") and if one of the referrents is in a coordinating structure (like "X and Y" or "X or Y"). In that case you can modify the coref spans to include the entire coordination. You could do a literal token match, but I would recommend using the dependency parse to look for conj relations.

Do note that the problem is not that the pronoun is plural, but that it refers to multiple distinct antecedents, so something like "I saw a bunch of people on the beach, they were having fun" should be handled correctly, normal accuracy issues allowing.

yofayed · 2023-02-10T19:52:53Z

yofayed
Feb 10, 2023

Hello Everyone,

Not sure if I should have created a new discussion instead of adding to this one!
I know just enough about NLP and machine learning to get myself in trouble :)

It might be easier to state the problem I am trying to solve by first showing you what I want to extract from the text. Here is a manually crafted example of a JSON output.

{
  "text": "Johnny Bravo is from Italy. Johnny is the one-man army and he works as an actor with Warner Bros (WB). WB is based in the United States.",
  "resolved-relations": [
    {
      "entity-type": "person",
      "primary-mention": "Johnny Bravo",
      "secondary-mentions": ["Johnny", "he"],
      "attributes": {"citizenship": "Italy", "occupation": "actor"}
    },
    {
      "entity-type": "Organization",
      "primary-mention": "Warner Bros",
      "secondary-mentions": ["WB","WB"],
      "attributes": {"location": "United States"}
    }
  ]
}

Here is what I thought of doing to get the extracted results above:

Extract named entities like Person, Org, Occupation, Money, etc.
Resolve the mentions of only Person and Org entities to a primary one
link all other named entities like Occupation, Money, etc. to the primary-mention Person/Org

What have I done so far:

Created a spaCy custom NER model with my own data and that's working perfectly!
Tried to use the projects/tutorials/rel_component to solve both the coref and relation problems in item 2 and 3 above. I used relation types coref-pri-sec, coref-sec-sec to label those coref relations and other types to label relations from these corefs to the other NER extracted types. The model did well on names that it has been trained on, but whenever I tested new names it would not perform well. I am not sure if that's because it's a simple classification layer and needs another layer or two to make it more robust, or if it just needs more training data.

Here is a sample annotation and an example output for this same example text above from the rel_component model that was trained with a small annotated sample.

Labeled for training

output from model testing

From-Entity                  Relation               To-Entity             Score
==================================================================================
Johnny Bravo (1)     ----->  coref_pri_sec  ----->  Johnny (3)            (99.93%)
Johnny Bravo (1)     ----->  citizenship    ----->  Italy (2)             (99.92%)
Johnny (3)           ----->  employed_by    ----->  Warner Brothers (5)   (99.54%)
Johnny (3)           ----->  occupation     ----->  actor (4)             (99.99%)
Warner Brothers (5)  ----->  org_alias      ----->  WB (6)                (99.99%)
Warner Brothers (5)  ----->  org_alias      ----->  WB (7)                (99.98%)
Warner Brothers (5)  ----->  located_in_at  ----->  United States (8)     (99.43%)

Of course, the model does well on this sample because it's a training one, but if I change the name of the person or org even if I keep the same context the results are not very good.

If this single model to solve both will work but needs more layers, then what do you think I should add to it? do you have a more complex example of the relation model? or is this not the right way to solve both coref and the relation problems? or should I use 2 separate models as mentioned below?

Now, I am trying to test with two models instead by creating a model to solve the coref problem using projects/experimental/coref and another one to extract the relations between the coref model results and the NER results. I still think that the relation model that is based on the rel_component tutorial needs more layers to effectively train and predict accurate results for this particular problem. I am not sure though what to add the rel_component model to be able to make it take the input of the coref model.

I really appreciate any help you can provide me to point me in the right direction with this challenge. Thanks!

2 replies

polm Feb 13, 2023

This is a pretty complicated application, but I would definitely expect better results using separate coref and relation extraction models, rather than trying to cover everything in a single relation extraction model.

The relation extraction model we provide is a demo designed to show how to make your own component, so it's not designed to be state of the art. That said, I don't have a lot of advice about improving it - you'd need to look at research or other models. The solution is unlikely to be as simple as adding more layers.

I am not sure though what to add the rel_component model to be able to make it take the input of the coref model.

The coref and relation extraction models shouldn't have to be aware of each other. You should be able to take what the relation extraction model finds about each mention in the text, and then combine the information about mentions that are put in the same cluster by the coref model.

yofayed Feb 13, 2023

This pretty much confirms my conclusion on this effort. Thanks!

sudarshansivakumar · 2023-02-13T07:38:49Z

sudarshansivakumar
Feb 13, 2023

Hello, I'm using the pretrained coreference resolution model using the following steps

nlp = spacy.load("en_coreference_web_trf")
doc = nlp(example_text)
print(doc.spans)

Now I'm pretty happy with the predictions of the model, however for my use case I want to only use those coreferences which have a very strong confidence. From the release blogpost I can see that this kind of scoring is in fact under the hood. However, I am not sure how to access these scores using the method above?
I also tried initializing an object of the CoreferenceResolver class using the steps mentioned in the docs here. However when I use this for predictions, the clusters returned are of really poor quality.
Would appreciate help with this, thanks!

cc : @polm

2 replies

polm Feb 13, 2023

Please do not @ individual maintainers.

About confidence. The way the model picks antecedents for each token is that it calculates scores for candidates and then uses softmax and picks the best one, and the null antecedent is hard-coded to have a score of 0. These confidence scores are not exposed, but if you wanted to add a threshold that worked on them it probably wouldn't require too much modification to the code - the max element is picked here, for example (though that's the post-softmax value).

I am not sure a threshold like that would be useful though - the main thing is that it's very much a comparative score with an unstable set of alternatives, so I would expect a high score to correlate with lack of other good candidates as much as correctness, and it might be noisy. Also in the frequent case where you have multiple candidates and any of them would be correct, it would be unsurprising for the scores to be lower for any individual candidate. You could definitely try it though.

Regarding the CoreferenceResolver you initialized - if you initialize a component like that, it has randomly initialized weights since it hasn't been trained, so it will just make random predictions. It's not equivalent to a trained component.

polm Feb 14, 2023

@sudarshansivakumar Sorry this happened in the middle of your question, but this thread has been locked in favor of opening new individual threads for coref-related questions. If you'd like to follow up on your question, please feel free to open a new Discussion and refer back here with a link.

yofayed · 2023-02-14T02:37:53Z

yofayed
Feb 14, 2023

Hello, what could be causing this RuntimeError? Thanks

# python3 -m spacy train configs/cluster.cfg -g 0 --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy -o training/cluster --training.max_epochs 20
ℹ Saving to output directory: training/cluster
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2023-02-14 02:32:51,018] [INFO] Set up nlp object from config
[2023-02-14 02:32:51,030] [INFO] Pipeline: ['transformer', 'coref']
[2023-02-14 02:32:51,035] [INFO] Created vocabulary
[2023-02-14 02:32:51,036] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib64/python3.7/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib64/python3.7/site-packages/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/typer/core.py", line 785, in main
    **extra,
  File "/usr/local/lib/python3.7/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib64/python3.7/site-packages/spacy/cli/train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/usr/local/lib64/python3.7/site-packages/spacy/cli/train.py", line 72, in train
    nlp = init_nlp(config, use_gpu=use_gpu)
  File "/usr/local/lib64/python3.7/site-packages/spacy/training/initialize.py", line 84, in init_nlp
    nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
  File "/usr/local/lib64/python3.7/site-packages/spacy/language.py", line 1323, in initialize
    proc.initialize(get_examples, nlp=self, **p_settings)
  File "/usr/local/lib64/python3.7/site-packages/spacy_experimental/coref/coref_component.py", line 365, in initialize
    self.model.initialize(X=X, Y=Y)
  File "/usr/local/lib64/python3.7/site-packages/thinc/model.py", line 299, in initialize
    self.init(self, X=X, Y=Y)
  File "/usr/local/lib64/python3.7/site-packages/thinc/layers/chain.py", line 92, in init
    curr_input = layer.predict(curr_input)
  File "/usr/local/lib64/python3.7/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/usr/local/lib64/python3.7/site-packages/spacy_experimental/coref/coref_model.py", line 85, in coref_forward
    return model.layers[0](X, is_train)
  File "/usr/local/lib64/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/usr/local/lib64/python3.7/site-packages/thinc/layers/pytorchwrapper.py", line 219, in forward
    Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
  File "/usr/local/lib64/python3.7/site-packages/thinc/shims/pytorch.py", line 92, in __call__
    return self.predict(inputs), lambda a: ...
  File "/usr/local/lib64/python3.7/site-packages/thinc/shims/pytorch.py", line 110, in predict
    outputs = self._model(*inputs.args, **inputs.kwargs)
  File "/usr/local/lib64/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib64/python3.7/site-packages/spacy_experimental/coref/pytorch_coref_model.py", line 77, in forward
    pairwise = self.pairwise(top_indices)
  File "/usr/local/lib64/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib64/python3.7/site-packages/spacy_experimental/coref/pytorch_coref_model.py", line 269, in forward
    distance = (word_ids.unsqueeze(1) - word_ids[top_indices]).clamp_min_(min=1)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

7 replies

yofayed Feb 14, 2023

Should I be concerned with this sequence length notice in training? thanks

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'coref']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS COREF  COREF_F  COREF_P  COREF_R  SCORE
---  ------  -------------  ----------  -------  -------  -------  ------
  0       0          39.11      197.38     0.37     0.19    63.76    0.00
Token indices sequence length is longer than the specified maximum sequence length for this model (514 > 512). Running this sequence through the model will result in indexing errors

polm Feb 14, 2023

No, that's not an issue. This is mentioned in the sample project docs.

If I recall correctly, the error comes from code we don't control, and spacy-transformers has code to prevent the issue it describes by segmenting input. But if it happens a lot (like hundreds of times, not just a few), that can indicate something weird, so that's another reason we can't quite get rid of it.

yofayed Feb 14, 2023

I am glad I asked, I was going to start debugging the issue. Thanks!

yofayed Feb 14, 2023

The training actually stopped after this message!

⚠ Aborting and saving the final best model. Encountered exception:
OutOfMemoryError('CUDA out of memory. Tried to allocate 454.00 MiB (GPU 0; 14.61
GiB total capacity; 12.35 GiB already allocated; 361.19 MiB free; 12.97 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF')

polm Feb 14, 2023

This model does take a lot of memory to train. We trained on a GTX 3090 with 24GB of RAM. 15GB is going to be kind of tough, but you might be able to make that work by reducing parameters like the hidden size.

polm · 2023-02-14T08:39:37Z

polm
Feb 14, 2023

Howdy everyone, thanks for posting in this thread. This was originally created to contain a burst of posts immediately following the release of coref, but since some time has passed, that's no longer necessary, and going forward coref related discussions can have their own threads, like any other component. This should make it easier to group discussions by specific issue and keep them searchable for the future.

If you need to refer to anything in this thread, feel free to link to it from a new Discussion.

0 replies

End-to-end neural coref in spaCy #11585

svlandeg Oct 6, 2022 Maintainer

Replies: 15 comments · 52 replies

svlandeg Oct 9, 2022 Maintainer Author

svlandeg Oct 9, 2022 Maintainer Author

svlandeg
Oct 6, 2022
Maintainer

Replies: 15 comments 52 replies

svlandeg Oct 9, 2022
Maintainer Author

svlandeg Oct 9, 2022
Maintainer Author