-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncating long documents #46
Comments
I also think it is an issue as it has T_position, which is based on the Indices of the sentences a term was found in, with the hypothesis that the most important words appear at the top of the document. So, any term appearing more frequently towards the end of the document like "metrics", "accuracy", "precision", such terms in an ML-based research paper, mainly will appear towards the end and will get a lower score. But, how do you plan to merge the lists of Keywords we get from the segmented documents?? |
Hi @juhoinkinen and @prateekkrjain. |
Hi @juhoinkinen. Providing the parameter for truncating is something to consider. Would you be willing to suggest a PR for that? |
In this case I would probably break the document and manage the sections separately. |
At the moment I can't, but if I have more time at some point I could take a look at this. |
Hi,
I found out that when using YAKE for long documents, it can be advantageous to truncate them in advance.
We have a test set of theses and dissertations (766 documents of on average 196k characters, 22k words), and when those documents are used as a gold standard for evaluation of YAKE (or its integration in our application), a F1@5 score of 0.29 is reached. However, if the documents are first truncated to a fixed length of 15000 characters, a better score 0.33 is reached.
Being such a simple way to possibly improve results, maybe a parameter/option for truncating input text could be added directly to YAKE? Or, better yet, could the term position feature be tuned to be better suited for long texts? To somehow make it to give even more importance to the beginning part?
The text was updated successfully, but these errors were encountered: