added `.pipe()` method to spaCy integration #16

davidberenstein1957 · 2023-06-27T20:28:57Z

tomaarsen · 2023-06-27T22:07:42Z

Do we need __call__ to be implemented if pipe is defined?

Also, I'm not extremely sure where to apply the self.batch_size. If the batch size is 128, then inputs will be 128 sentences, but the model will internally likely expand that into something like 150 samples, which requires two forward passes. So, in the end, to process e.g. 200 samples with a batch_size of 128, this approach would do 3 forward passes, when only 2 are likely needed.

This could be reduced down to 2 if we remove the minibatch looping and just take all inputs in one go, but then this pipe isn't a great "generator", as it will fully convert the input generator into a list before it even starts doing any predictions. In the end, I do think that this is the most efficient approach though.

I'd love to hear your thoughts on this.

davidberenstein1957 · 2023-07-12T20:54:08Z

I will also tackle this. #17

tomaarsen · 2023-07-12T22:38:04Z

#17 also requires changes here:

SpanMarkerNER/span_marker/__init__.py

Lines 43 to 49 in 41fdda8

    
           # Remove the existing NER component, if it exists, 
        
           # to allow for SpanMarker to act as a drop-in replacement 
        
           try: 
        
               nlp.remove_pipe("ner") 
        
           except ValueError: 
        
               # The `ner` pipeline component was not found 
        
               pass

davidberenstein1957 · 2023-07-13T06:47:01Z

@tomaarsen, do you have any indication w.r.t. the SpanMarker memory usage? A batch_size of 4 seems quite low in most settings, and I think that, especially for a single call on a sentence level, setting the batch size to n_sentences will mostly be fine. Also, I am not sure if it makes sense to call SpanMarker on CPU, would also allow for higher batch sizes and faster inference. Given your expertise, you can probably give some more grounded pointers.

tomaarsen · 2023-07-13T07:45:51Z

That's a great question. I put the batch size at 4 by default to be on the safe side, but I'll try the following: Load a ...-large SpanMarker model, forcibly limit the VRAM that torch can use to maybe 2 or 4GB, and then find the highest batch size that still runs. As for CPU, I think these are really slow on the larger models, but likely feasible for e.g. a BERT-tiny model.

Also, do you have an opinion on this? Or should I try to experiment to come up with the optimal solution?

Also, I'm not extremely sure where to apply the self.batch_size. If the batch size is 128, then inputs will be 128 sentences, but the model will internally likely expand that into something like 150 samples, which requires two forward passes. So, in the end, to process e.g. 200 samples with a batch_size of 128, this approach would do 3 forward passes, when only 2 are likely needed.

This could be reduced down to 2 if we remove the minibatch looping and just take all inputs in one go, but then this pipe isn't a great "generator", as it will fully convert the input generator into a list before it even starts doing any predictions. In the end, I do think that this is the most efficient approach though.

Beyond that, I'd like to make #17 toggle-able. If you're using e.g. a FewNERD model, then you don't want the outputs to include OntoNotes labels like PERSON or WORK_OF_ART.

davidberenstein1957 · 2023-07-13T07:56:58Z

That's a great question. I put the batch size at 4 by default to be on the safe side, but I'll try the following: Load a ...-large SpanMarker model, forcibly limit the VRAM that torch can use to maybe 2 or 4GB, and then find the highest batch size that still runs. As for CPU, I think these are really slow on the larger models, but likely feasible for e.g. a BERT-tiny model.

Also, do you have an opinion on this? Or should I try to experiment to come up with the optimal solution?

Normally, I would go with a batch size, being as large as reasonable on both CPU and GPU, and let people fix it themselves when it breaks. So forcibly limiting the VRAM and experimenting with a reasonable batch size sounds great.

What do you think about the misaligned batch size warning?

Beyond that, I'd like to make #17 toggle-able. If you're using e.g. a FewNERD model, then you don't want the outputs to include OntoNotes labels like PERSON or WORK_OF_ART.

That makes sense, I will include that.

tomaarsen · 2023-07-13T08:08:33Z

What do you think about the misaligned batch size warning?

Oh, oops. I missed that. I think it misses the point a little bit. The issue exists especially when the batch sizes are the same. SpanMarker may, for larger sentences, create multiple samples per sentence that need to be passed to the embedding model. So, 128 sentences may result in 135 samples for the embedding model. So if the spacy minibatch gets 128 sentences, then we need 2 forward passes for SpanMarker, which is a bit inefficient.
Does that make sense?

davidberenstein1957 · 2023-07-14T08:31:17Z

Ahh yes, I understand now. Before I did not. I removed the warning. But for now it makes sense to leave it at the initial proposal then.

chore: removed warning chore: updated changelog

davidberenstein1957 · 2023-08-11T09:45:07Z

@tomaarsen should I make any more changes?

* Removed Optional from `overwrite_entities` * Introduce `set_ents` method to prevent duplicate code

…o pr-16

tomaarsen · 2023-08-24T10:49:58Z

Thanks a bunch @davidberenstein1957
Some quick tests shows a 2x speedup. I can process about 42 sentences per second with little optimization/tuning using a RoBERTa-large SpanMarker model now.

davidberenstein1957 added 2 commits June 27, 2023 22:26

added pipe() to spaCy integration

33e14ca

added spaCy .pipe() integration tests

37fb431

tomaarsen linked an issue Jun 27, 2023 that may be closed by this pull request

spaCy integration has no .pipe() method, hence will fallback to individual .call() #15

Closed

davidberenstein1957 added 2 commits July 12, 2023 22:54

Merge branch 'main' into main

91b8f03

chore: avoid overwriting pre-existing entities tomaarsen#17

f34f813

davidberenstein1957 added 2 commits July 13, 2023 08:34

chore: disable removing NER pipeline by default

0711804

chore: added batch size warning

4f58b6b

davidberenstein1957 added 3 commits July 14, 2023 10:37

chore: added overwrite_entities flag

12b52f3

chore: removed warning chore: updated changelog

fix: resolved small typo

57390d4

Merge branch 'main' into main

18df386

tomaarsen added 4 commits August 24, 2023 11:06

Small refactor + formatting

f02469e

* Removed Optional from `overwrite_entities` * Introduce `set_ents` method to prevent duplicate code

Merge branch 'main' of https://github.com/tomaarsen/SpanMarkerNER int…

a524b89

…o pr-16

Update changelog

699659e

Update documentation with overwrite_entities

1653b0e

tomaarsen linked an issue Aug 24, 2023 that may be closed by this pull request

Integrate Entity Ruler with Span Marker model #25

Closed

tomaarsen added 2 commits August 24, 2023 12:37

Reintroduce accidentally removed "Fixed" header

cf0ad39

Prefer SpanMarker outputs over spaCy outputs

115c3ea

tomaarsen merged commit 870ccd6 into tomaarsen:main Aug 24, 2023

tomaarsen mentioned this pull request Aug 28, 2023

spaCy integration ignores old entities #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added `.pipe()` method to spaCy integration #16

added `.pipe()` method to spaCy integration #16

davidberenstein1957 commented Jun 27, 2023

tomaarsen commented Jun 27, 2023 •

edited

Loading

davidberenstein1957 commented Jul 12, 2023

tomaarsen commented Jul 12, 2023

davidberenstein1957 commented Jul 13, 2023

tomaarsen commented Jul 13, 2023

davidberenstein1957 commented Jul 13, 2023

tomaarsen commented Jul 13, 2023

davidberenstein1957 commented Jul 14, 2023

davidberenstein1957 commented Aug 11, 2023

tomaarsen commented Aug 24, 2023

added .pipe() method to spaCy integration #16

added .pipe() method to spaCy integration #16

Conversation

davidberenstein1957 commented Jun 27, 2023

tomaarsen commented Jun 27, 2023 • edited Loading

davidberenstein1957 commented Jul 12, 2023

tomaarsen commented Jul 12, 2023

davidberenstein1957 commented Jul 13, 2023

tomaarsen commented Jul 13, 2023

davidberenstein1957 commented Jul 13, 2023

tomaarsen commented Jul 13, 2023

davidberenstein1957 commented Jul 14, 2023

davidberenstein1957 commented Aug 11, 2023

tomaarsen commented Aug 24, 2023

added `.pipe()` method to spaCy integration #16

added `.pipe()` method to spaCy integration #16

tomaarsen commented Jun 27, 2023 •

edited

Loading