-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Allow sentences longer than the token limit for sequence tagger training #3519
Comments
Hi @MattGPT-ai If that's not the case, I'd like to have a reproducible example of that limitation |
I did just confirm that I could successfully use I will try some more things to see if I can reproduce it or narrow down if there is a particular issue, perhaps a memory leak or maybe there is just a particularly large batch that it's failing on |
https://gist.github.com/MattGPT-ai/80327ab5854cb0d978d23f205eeae882 Linking to a gist with notebooks that demonstrate success using |
Would it be possible to refactor the training script such that batching is based on the chunked inputs? It seems like now maybe it does not get a consistent batch size after chunking. Could you offer any insight here? I also see there is a |
The problem with longer sentences is that they will inevitably require more RAM, since the gradient information of all tokens is required. With sentences that long, you will want to have only 1 sentence computed at a time. For this, you can use the Notice that you can split the batch by sentences without loss of quality, but you cannot split sentences, as the token embeddings/gradients are dependent on each other. In general I would recommend setting the |
I am giving the I think at the very least if the chunking function isn't useful, that this could be reduced to a function to create a labeled sentence from a text with character-indexed entities. As far as sentence chunking, I'm still a little unclear if there is no use case, perhaps because I'm confused by the multiple uses of the word "sentence." Let's say in our case we have very long texts, such as a resume, that contain many actual sentences. If some of the full resumes do not fit into memory, when is it invalid to split one into multiple |
I agree that So I hope I can clarify how I meant this, I will refer to You could split your Resumes to literal sentences using the SentenceSplitter. That means you won't have 1 Sentence-object per resume but multiple smaller ones. For those, the SentenceSplitter adds the next & previous objects as context, so you can use a FLERT-Transformer And yes, this way the actual sentence boundaries would always be valid, assumably making it easier for the model to learn About the labels from char-indices: |
Problem statement
Currently, we are not able to train
SequenceTagger
models with taggedSentence
objects exceeding the token limit (typically 512). It does seem there is some support for long sentences in embeddings via theallow_long_sentences
option, but it does not appear that this applies to sequence tagging where the labels still need to be applied at the token level.We have tried doing this, but if we don't limit the sentences to the token limit, we get an out of memory error. Not sure if this is a bug specifically, or just a lack of support for this feature.
Solution
Not sure if there is a more ideal way, but one solution for training is to split a sentence into "chunks" that are of length 512 tokens or less, and applying the labels to these chunks. It is important to avoid splitting chunks across a labeled entity boundary.
Additional Context
We have used this in training successfully, so I will be introducing our specific solution in a PR
The text was updated successfully, but these errors were encountered: