-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added .pipe()
method to spaCy integration
#16
Conversation
Do we need __call__ to be implemented if Also, I'm not extremely sure where to apply the This could be reduced down to 2 if we remove the minibatch looping and just take all inputs in one go, but then this pipe isn't a great "generator", as it will fully convert the input generator into a list before it even starts doing any predictions. In the end, I do think that this is the most efficient approach though. I'd love to hear your thoughts on this. |
I will also tackle this. #17 |
#17 also requires changes here: SpanMarkerNER/span_marker/__init__.py Lines 43 to 49 in 41fdda8
|
@tomaarsen, do you have any indication w.r.t. the SpanMarker memory usage? A batch_size of 4 seems quite low in most settings, and I think that, especially for a single call on a sentence level, setting the batch size to n_sentences will mostly be fine. Also, I am not sure if it makes sense to call SpanMarker on CPU, would also allow for higher batch sizes and faster inference. Given your expertise, you can probably give some more grounded pointers. |
That's a great question. I put the batch size at 4 by default to be on the safe side, but I'll try the following: Load a ...-large SpanMarker model, forcibly limit the VRAM that torch can use to maybe 2 or 4GB, and then find the highest batch size that still runs. As for CPU, I think these are really slow on the larger models, but likely feasible for e.g. a BERT-tiny model. Also, do you have an opinion on this? Or should I try to experiment to come up with the optimal solution?
Beyond that, I'd like to make #17 toggle-able. If you're using e.g. a FewNERD model, then you don't want the outputs to include OntoNotes labels like PERSON or WORK_OF_ART. |
Normally, I would go with a batch size, being as large as reasonable on both CPU and GPU, and let people fix it themselves when it breaks. So forcibly limiting the VRAM and experimenting with a reasonable batch size sounds great. What do you think about the misaligned batch size warning?
That makes sense, I will include that. |
Oh, oops. I missed that. I think it misses the point a little bit. The issue exists especially when the batch sizes are the same. SpanMarker may, for larger sentences, create multiple samples per sentence that need to be passed to the embedding model. So, 128 sentences may result in 135 samples for the embedding model. So if the spacy minibatch gets 128 sentences, then we need 2 forward passes for SpanMarker, which is a bit inefficient. |
Ahh yes, I understand now. Before I did not. I removed the warning. But for now it makes sense to leave it at the initial proposal then. |
chore: removed warning chore: updated changelog
@tomaarsen should I make any more changes? |
* Removed Optional from `overwrite_entities` * Introduce `set_ents` method to prevent duplicate code
Thanks a bunch @davidberenstein1957 |
#15