Finetuning coreference model for custom spacy model #12021

Tanmay98 · 2022-12-23T11:08:33Z

Tanmay98
Dec 23, 2022

Hello everyone,
I trained my spacy model pipeline just for NER and sourced other pipeline components from pretrained en_core_web_sm model.
I wanted to implement coreference resolution just for a single entity detected by my custom spacy model.
Eg, "The section ABC of XYZ Act states that......... Section ABC of the act also proves .... "
Here, I wanted model to tag ["section ABC of XYZ Act", section ABC of the act].

And I thought this of implementing like the following

Get NER model to detect section ABC, XYZ Act, Section ABC, the act as ACT entities
Coref model to match section with section and xyz act with the act i.e [section abc, section abc], [XYZ act, the act, an act]
Some entity/relational matcher to connect these relations

So to implement coreference part I first came across the coreferee model and annotated some dataset according to litbank format but while training the rules written were not tagging anything true (maybe because of span.root value ).
For example, when training the coreferee model I saw that my span.root value gives incorrect value
In Litbank data: "Her father" -> "father"
for my data: "The Copy right Act" -> "act" (but I want copyright act)

Is there any simpler way of finetuning the coreference model in spacy experimental or if you someone could help training the coreferee model since I have already annotated data in Litbank format?

Thanks in advance !

polm · 2022-12-26T04:42:23Z

polm
Dec 26, 2022

I see you linked to sample annotations in #11585, so let me link that post here: #11585 (comment)

To clarify one thing, there is no such thing as "LitBank format" - LitBank distributes coref annotations in BRAT, CoNLL, and TSV formats. We actually use LitBank CoNLL data in our tests. It looks like what you have is BRAT data.

I haven't used it before, but it looks like there's enough information in the BRAT files to convert it to spaCy coref annotations on Docs, or to the CoNLL format (though that is kind of complex). It looks like the T lines are mentions and the R lines are references that connect mentions.

Can you clarify how you used coreferee? What you refer to in this part:

So to implement coreference part I first came across the coreferee model and annotated some dataset according to litbank format but while training the rules written were not tagging anything true (maybe because of span.root value ).

Note that span.root is based on the dependency parse, not coreference. It just gives the single word that's the head of a span, so your output is correct and normal. span.root shouldn't be the final output of your coreference system. What led you to look at it?

0 replies

Tanmay98 · 2022-12-26T08:46:51Z

Tanmay98
Dec 26, 2022
Author

Hi thanks for replying back @polm ,

Let me try to clear everything until now that I did for implementing finetuning of coreference model

Firstly, I was not familiar with the different types of formats like CoNLL and brat. So, I annotated all of my text in Label Studio and got a json of all mentions and relations that I marked.
Then I tried understanding brat and CoNLL formatted data files used in Coreferee repository to which I converted my LabelStudio json to brat format using self coded scripts. I did this because I tried using the brat software but I couldnt really understand the documentation for it so I wrote script to just convert it in such format.
To what I understood after analysing the brat format, firstly all the mentions all stated regardless of whether its a coreference entity or the parent entity. Once all the entites are stated line by line then the relations are stated using T or R . R are used for stating copulae relations and T for others.

Secondly, coreferee LitbankLoaderclass model assigns training=True based on rules_analyzer class that it has set. I think it checks various attributes for the entity like independent_noun, potentioal_anaphora. The input to these function is the span.root value.
I am sorry I didnt debug it properly, you are right the span.root value is probably not causing issues.
In the code, for every span value, it tries to assign bool values for whether the span is_independent_noun and is_potential_anaphor. For my values, is_potential_anaphor is always False since my references are not pronouns.

My labels:

Litbank :

How should I proceed to train ? I am not sure if changing the training data to tsv/CoNLL/spaCy coref annotations format would help?

23 replies

Tanmay98 Jan 2, 2023
Author

Hey @polm, Happy New Year.
Since I wanted to check the annotations whether they are working or not, I started with less data and that is why it gets trained in one epoch only.

But that is not the issue, I get to train the cluster model but when I try training span-resolver model and it gives OOM out of memory error for cuda. I am using colab pro- Tesla T4 gpu

Mon Jan  2 12:15:03 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   74C    P0    31W /  70W |    336MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|

What gpu should be used to train a large dataset, I am asking this since I am using a small sized data only and I am getting OOM error if I upgrade to colab pro+, I am afraid I will still get it if I increase the data?

polm Jan 3, 2023

Happy New Year.

Sorry you ran into the issue with devices indices - that has been fixed in explosion/spacy-experimental#30, but not released yet. Downgrading is a reasonable workaround, or it should be easy to install spacy-experimental from source.

We used GTX 3090 cards for training. This model is pretty memory hungry when training, so a 16GB GPU may be a little low.

Tanmay98 Jan 3, 2023
Author

@polm Thanks for your reply, using 16 gb was giving me OOM rather than being little slow XD.
Also, colab pro+ gives a 32 gb gpu but I am afraid it would also be not enough to train the model. To be exact how many gtx 3090 cards did you use?
I am asking this so that I could setup a vm with similar gpu specs on azure to train the model.

polm Jan 3, 2023

I think you misread "low" as "slow"?

We used a single GTX 3090 card, so 24GB of GPU RAM.

Tanmay98 Jan 3, 2023
Author

Oh yeah my bad, so colab pro+ should seal the deal then, thanks for the help !

Tanmay98 · 2023-01-04T06:40:45Z

Tanmay98
Jan 4, 2023
Author

HI @polm, I think this is really unfortunate but after upgrading the colab gpu, I have around 90gb ram
Your runtime has 89.6 gigabytes of available RAM You are using a high-RAM runtime!

but the pytorch version 1.12 and spacy throws an error of compatibility with the high ram gpu assigned.

import spacy print(spacy.prefer_gpu())

/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:146: UserWarning: A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

With pytorch 1.13 the gpu seems compatible but spacy-experimental wont work with it. Any work arounds or solutions?

22 replies

Tanmay98 Jan 11, 2023
Author

I will add the correct training data with more size to train the cluster model which should solve the OOM problem. And then get back here again !

polm Jan 11, 2023

This makes sense but for testing purposes as you would have seen in the coref.conll file in the repo, I had only added #begin document in the beginning and hence making it one big document. But will that cause any problem?

If you only have one big document, the data is valid, but long documents require more memory, are harder to process, etc.

Tanmay98 Jan 11, 2023
Author

I see so since I collected all these sentences from a collection of pdfs, I can simply break them into multiple documents of size say 200 tokens each according to my memory convenience. Does this makes sense?

polm Jan 11, 2023

Smaller documents are certainly easier to train on.

The main issue is if you have coreference relations between documents - the model can't recover those. So like this:

# doc 1
The Act of 1959 is blah blah blah ...
# doc 2
And according to clause 2 of the Act of 1959 blah blah...

The model won't tell you that the mentions in doc 1 and doc 2 are related. If that's OK, for example because you don't have mentions like that, then there's no downside to making your docs smaller.

Tanmay98 Jan 11, 2023
Author

Thanks, I understand this makes sense.

Tanmay98 · 2023-01-20T06:18:05Z

Tanmay98
Jan 20, 2023
Author

HI @polm,

I curated 192 small documents out of the big chunk but I am still getting terrible results as well as the training epochs for both models end after 1 epoch.

Can you look at the repo ?
I have added the new train data(coref.conll), training notebook(train_spacycoref.ipynb) and the logs file(logs.txt) for debug purposes.

Some interesting observations:

training file (coref.conll) contains 192 documents but spacy preprocess gives output serializing 97 documents
Prep span data gives : Processed 97 documents and skipped 96
Found 54 heads with 0 duplicates
Found target spans for 13 heads.
Sample input sentence : "The Code of Criminal Procedure commonly called Criminal Procedure Code ( CrPC ) is the main legislation on procedure for administration of substantive criminal law in India . Indian Penal Code is the official criminal code of India . This application under Section 438 Cr . P.C. has been filed for grant of bail to the petitioner in the event of arrest in FIR No . 56/2018 dated 11.02.2018 , registered at PS Model Town for offences under Sections 498 - A , 406 and 34 of IPC . "
Output: The Code of Criminal Procedure commonly called Criminal Procedure Code ( CrPC ) is the main legislation The Code of Criminal Procedure commonly called Criminal Procedure Code ( CrPC ) is the main legislation on procedure for administration of substantive

Will the output even get better since on 200 sentences its same as 10 sentences, I am not sure if I am doing it correctly? And it takes a lot of time and effort to find raw sentences and annotate, so doing it for 500 sentences seems like a bad idea to me.
Also now I am being unsure if very long coreferences can be captured, or if it is at all feasible.

3 replies

polm Jan 23, 2023

Thanks for providing a repo, that makes it easier to see what's going on.

It looks to me like currently the problems are that 1. you do not have enough data 2. your data has quality issues.

For coref, 100 or 200 sentences is not going to be enough data. We recommend a few hundred to get started with NER, which is a simpler task, for example. If you have trouble collecting enough data - for example, you are doing this by yourself - you can try using an existing dataset. You can use LitBank as a starting point, though I don't think it would work well for the legal references you have. I'm not really familiar with legal NLP, but maybe there is a better example.

About your data problems. You mention that you have 192 documents but spaCy only reports 97. However, begin document in your input file marks a document, and you only have 97 lines like that, so the preprocessing seems to be working for that part at least. Remember that the second and third columns are ignored.

Something still seems weird about your format - look at line 11 76, where the last field is 23)Code.

You seem to have several documents that end in the middle of a sentence, like this one:

coref 6 40  ( - - - - - - - - - -
coref 6 41  for - - - - - - - - - -
coref 6 42  short - - - - - - - - - -
coref 6 43  “ - - - - - - - - - -
coref 6 44  the - - - - - - - - - - (33
coref 6 45  Code  - - - - - - - - - - 33)
#end document

I assume real input will not look like that, which means it will make it harder for your model to learn. You also have a lot of really repetitive text - I see "The Code of Criminal Procedure, commonly called Criminal Procedure Code (CrPC)" or something like it many times. Is your real data actually like that?

More generally I am also confused by your annotations. Look at this:

coref 9 0 The - - - - - - - - - -
coref 9 1 Code  - - - - - - - - - -
coref 9 2 of  - - - - - - - - - -
coref 9 3 Criminal  - - - - - - - - - -
coref 9 4 Procedure - - - - - - - - - -
coref 9 5 commonly  - - - - - - - - - -
coref 9 6 called  - - - - - - - - - -
coref 9 7 Criminal  - - - - - - - - - - (32
coref 9 8 Procedure - - - - - - - - - -
coref 9 9 Code  - - - - - - - - - - 32)
coref 9 10  ( - - - - - - - - - -
coref 9 11  CrPC  - - - - - - - - - - (32)
coref 9 12  ) - - - - - - - - - -
coref 9 13  is  - - - - - - - - - -
coref 9 14  the - - - - - - - - - -
coref 9 15  main  - - - - - - - - - -
coref 9 16  legislation - - - - - - - - - -
coref 9 17  on  - - - - - - - - - -
coref 9 18  procedure - - - - - - - - - -
coref 9 19  for - - - - - - - - - -
coref 9 20  administration  - - - - - - - - - -
coref 9 21  of  - - - - - - - - - -
coref 9 22  substantive - - - - - - - - - -
coref 9 23  criminal  - - - - - - - - - -
coref 9 24  law - - - - - - - - - -
coref 9 25  in  - - - - - - - - - -
coref 9 26  India - - - - - - - - - -
coref 9 27  . - - - - - - - - - -
coref 9 28  Section - - - - - - - - - - (32
coref 9 29  397 - - - - - - - - - - 32)
coref 9 30  of  - - - - - - - - - -
coref 9 31  the - - - - - - - - - - (32
coref 9 32  Code  - - - - - - - - - - 32)
coref 10  0 The - - - - - - - - - - (33
coref 10  1 Code  - - - - - - - - - -
coref 10  2 of  - - - - - - - - - -
coref 10  3 Criminal  - - - - - - - - - -
coref 10  4 Procedure - - - - - - - - - - 33)
coref 10  5 commonly  - - - - - - - - - -
coref 10  6 called  - - - - - - - - - -
coref 10  7 Criminal  - - - - - - - - - - (33
coref 10  8 Procedure - - - - - - - - - -
coref 10  9 Code  - - - - - - - - - - 33)
coref 10  10  ( - - - - - - - - - -
coref 10  11  CrPC  - - - - - - - - - - (33)

In this annotation The Criminal Procedure Code is assigned 32 and 33, but it should clearly be the same, and annotations like this will confuse the model. On the other hand, "Section 397" is marked as the same thing as the code itself, which seems weird. It looks like maybe you want to link sections to the law they come from? I think it would be more typical to model that as relation extraction than coreference, and it may be hard for the model to learn from annotations like that.

It might help if you explain what your goal is (outside of coref specifically), so it's easier to understand what you're trying to accomplish.

Tanmay98 Jan 31, 2023
Author

Hi @polm,

My goal in general was to detect acts and sections with their respective references in a text.

Entity : Act with/without section Reference : Act without section
For example, (The copyright act of 1958) / (Section 2 of the copyright act of 1958) states blah blah.... (The act) / (The act of 1958 ) is .....
Entity : Act with/without section Reference : Act with section
For example, (The copyright act of 1958) / (Section 2 of the copyright act of 1958) states blah blah.... (Section 4 of the act) / (Section 4 of the act of 1958 ) is .....
Entity : Act with/without section Reference : Only section
For example, (The copyright act of 1958) / (Section 2 of the copyright act of 1958) states blah blah.... (Section 8) is .....

In real legal text, the possibilities could be combination of above and the references themselves can be far from the parent mention of the entity. I am saying this because all the sample coref sentences I see are very short references.

Now coming to the inconsistencies in the data:

The conversion script i wrote to convert json annotations of relations and entities to conll would probable have some issues.
Secondly, I thought of assigning section and acts as a single entity and marking solo act references or solo sections refrences to acts itself. Like you mentioned, marking sections to acts will be difficult to learn for the model but I was trying to experiment only.
But due to very less data I dont think it is feasible on the other hand relation extraction seems better idea to marks sections as part of the acts than doing corefernece resolution as you mentioned.
But then this means I should do relations between sections to sections and acts to acts. And doing coreferences of acts to acts and sections to sections. And finally maybe linking them together?
Also even if the above idea seems fine but training coreference model from scratch on legal references data alone seems out of scope to me :(

polm Feb 1, 2023

Thanks for the explanation. I do think that creating a legal coreference dataset from scratch is a big project, and if you're doing this by yourself that may not be feasible - this is normally the sort of thing that would be done by a whole team or research group. I do think a relation extraction model might be more approachable, if still a lot of work.

In either case, I would recommend you look for existing papers on the topic. I'm not very familiar with legal NLP, but there is a lot of research into it, so I suspect there are existing datasets or systems that can be a useful reference for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning coreference model for custom spacy model #12021

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 48 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Finetuning coreference model for custom spacy model #12021

Replies: 4 comments · 48 replies

Tanmay98 Dec 26, 2022 Author

Tanmay98 Jan 2, 2023 Author

Tanmay98 Jan 3, 2023 Author

Tanmay98 Jan 3, 2023 Author

Tanmay98 Jan 4, 2023 Author

Tanmay98 Jan 11, 2023 Author

Tanmay98 Jan 11, 2023 Author

Tanmay98 Jan 11, 2023 Author

Tanmay98 Jan 20, 2023 Author

Tanmay98 Jan 31, 2023 Author

Replies: 4 comments 48 replies

Tanmay98
Dec 26, 2022
Author

Tanmay98 Jan 2, 2023
Author

Tanmay98 Jan 3, 2023
Author

Tanmay98 Jan 3, 2023
Author

Tanmay98
Jan 4, 2023
Author

Tanmay98 Jan 11, 2023
Author

Tanmay98 Jan 11, 2023
Author

Tanmay98 Jan 11, 2023
Author

Tanmay98
Jan 20, 2023
Author

Tanmay98 Jan 31, 2023
Author