Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added .pipe() method to spaCy integration #16

Merged
merged 15 commits into from
Aug 24, 2023

Conversation

davidberenstein1957
Copy link
Contributor

#15

@tomaarsen
Copy link
Owner

tomaarsen commented Jun 27, 2023

Do we need __call__ to be implemented if pipe is defined?

Also, I'm not extremely sure where to apply the self.batch_size. If the batch size is 128, then inputs will be 128 sentences, but the model will internally likely expand that into something like 150 samples, which requires two forward passes. So, in the end, to process e.g. 200 samples with a batch_size of 128, this approach would do 3 forward passes, when only 2 are likely needed.

This could be reduced down to 2 if we remove the minibatch looping and just take all inputs in one go, but then this pipe isn't a great "generator", as it will fully convert the input generator into a list before it even starts doing any predictions. In the end, I do think that this is the most efficient approach though.

I'd love to hear your thoughts on this.

@davidberenstein1957
Copy link
Contributor Author

I will also tackle this. #17

@tomaarsen
Copy link
Owner

#17 also requires changes here:

# Remove the existing NER component, if it exists,
# to allow for SpanMarker to act as a drop-in replacement
try:
nlp.remove_pipe("ner")
except ValueError:
# The `ner` pipeline component was not found
pass

@davidberenstein1957
Copy link
Contributor Author

@tomaarsen, do you have any indication w.r.t. the SpanMarker memory usage? A batch_size of 4 seems quite low in most settings, and I think that, especially for a single call on a sentence level, setting the batch size to n_sentences will mostly be fine. Also, I am not sure if it makes sense to call SpanMarker on CPU, would also allow for higher batch sizes and faster inference. Given your expertise, you can probably give some more grounded pointers.

@tomaarsen
Copy link
Owner

That's a great question. I put the batch size at 4 by default to be on the safe side, but I'll try the following: Load a ...-large SpanMarker model, forcibly limit the VRAM that torch can use to maybe 2 or 4GB, and then find the highest batch size that still runs. As for CPU, I think these are really slow on the larger models, but likely feasible for e.g. a BERT-tiny model.

Also, do you have an opinion on this? Or should I try to experiment to come up with the optimal solution?

Also, I'm not extremely sure where to apply the self.batch_size. If the batch size is 128, then inputs will be 128 sentences, but the model will internally likely expand that into something like 150 samples, which requires two forward passes. So, in the end, to process e.g. 200 samples with a batch_size of 128, this approach would do 3 forward passes, when only 2 are likely needed.

This could be reduced down to 2 if we remove the minibatch looping and just take all inputs in one go, but then this pipe isn't a great "generator", as it will fully convert the input generator into a list before it even starts doing any predictions. In the end, I do think that this is the most efficient approach though.

Beyond that, I'd like to make #17 toggle-able. If you're using e.g. a FewNERD model, then you don't want the outputs to include OntoNotes labels like PERSON or WORK_OF_ART.

@davidberenstein1957
Copy link
Contributor Author

That's a great question. I put the batch size at 4 by default to be on the safe side, but I'll try the following: Load a ...-large SpanMarker model, forcibly limit the VRAM that torch can use to maybe 2 or 4GB, and then find the highest batch size that still runs. As for CPU, I think these are really slow on the larger models, but likely feasible for e.g. a BERT-tiny model.

Also, do you have an opinion on this? Or should I try to experiment to come up with the optimal solution?

Normally, I would go with a batch size, being as large as reasonable on both CPU and GPU, and let people fix it themselves when it breaks. So forcibly limiting the VRAM and experimenting with a reasonable batch size sounds great.

What do you think about the misaligned batch size warning?

Beyond that, I'd like to make #17 toggle-able. If you're using e.g. a FewNERD model, then you don't want the outputs to include OntoNotes labels like PERSON or WORK_OF_ART.

That makes sense, I will include that.

@tomaarsen
Copy link
Owner

What do you think about the misaligned batch size warning?

Oh, oops. I missed that. I think it misses the point a little bit. The issue exists especially when the batch sizes are the same. SpanMarker may, for larger sentences, create multiple samples per sentence that need to be passed to the embedding model. So, 128 sentences may result in 135 samples for the embedding model. So if the spacy minibatch gets 128 sentences, then we need 2 forward passes for SpanMarker, which is a bit inefficient.
Does that make sense?

@davidberenstein1957
Copy link
Contributor Author

Ahh yes, I understand now. Before I did not. I removed the warning. But for now it makes sense to leave it at the initial proposal then.

@davidberenstein1957
Copy link
Contributor Author

@tomaarsen should I make any more changes?

@tomaarsen tomaarsen linked an issue Aug 24, 2023 that may be closed by this pull request
@tomaarsen tomaarsen merged commit 870ccd6 into tomaarsen:main Aug 24, 2023
4 checks passed
@tomaarsen
Copy link
Owner

Thanks a bunch @davidberenstein1957
Some quick tests shows a 2x speedup. I can process about 42 sentences per second with little optimization/tuning using a RoBERTa-large SpanMarker model now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants