-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement sliding window #90
Comments
The TF 2.0 RNN has a |
I have now implemented something in my version: elifesciences/sciencebeam-trainer-delft#179 I tried the (I also later realised that the segmentation model isn't using tokens, therefore the embedding won't make much sense) |
Maybe we could imagine 2 segmentation models, one for the initial window and one for the follow-up window(s) with some sort of overlapping. About the segmentation model, it works at line-based, and it contains as lexical features the two first words of the line (I tried 1 word, 3 words, adding 1 or 2 last words of the line... just 2 first words appeared to be enough with the current limited training data). So the embeddings still make sense for these "full" lexical features. But it also means with 2 tokens to concatenate vertically embeddings for 2 tokens, which will impact negatively the memory and window size. |
Thank you for explaining that. I didn't realise that it was the first two tokens. In any case I am now training a segmentation model without word embedding at all to see whether characters + features might be enough. It will be worth to look at the very long sequences to see whether that is actually good data. (I haven't done that yet). I am suspecting that the sliding windows should become more in handy for the fulltext model as I believe that is token based like the header model? (and should therefore have longer sequences) |
One example, 025544v1, where the PDF is currently resulting in many tokens ( |
Perhaps interesting on the topic. I have now also implemented sliding windows at prediction / tagging time as it was running out of memory for some documents (I am not using it for training because it's much slower). Here is an evaluation of three options: This is over 200 validation documents. All of the models use the same trained DL models for the segmentation, header, reference segmentation and citation model (with a max sequence length of 3000). Something that this chart might show:
I implemented the sliding windows by making the model EDIT: I haven't tested it properly and now realised that |
I thought it might be better to discuss the sliding window in a separate issue.
#44 (comment)
#44 (comment)
/cc @kermitt2 @lfoppiano
The text was updated successfully, but these errors were encountered: