Truncating long documents #46

juhoinkinen · 2021-05-17T14:03:32Z

Hi,
I found out that when using YAKE for long documents, it can be advantageous to truncate them in advance.

We have a test set of theses and dissertations (766 documents of on average 196k characters, 22k words), and when those documents are used as a gold standard for evaluation of YAKE (or its integration in our application), a F1@5 score of 0.29 is reached. However, if the documents are first truncated to a fixed length of 15000 characters, a better score 0.33 is reached.

Being such a simple way to possibly improve results, maybe a parameter/option for truncating input text could be added directly to YAKE? Or, better yet, could the term position feature be tuned to be better suited for long texts? To somehow make it to give even more importance to the beginning part?

prateekkrjain · 2021-07-27T11:45:33Z

@juhoinkinen ,

I also think it is an issue as it has T_position, which is based on the Indices of the sentences a term was found in, with the hypothesis that the most important words appear at the top of the document.

So, any term appearing more frequently towards the end of the document like "metrics", "accuracy", "precision", such terms in an ML-based research paper, mainly will appear towards the end and will get a lower score.

But, how do you plan to merge the lists of Keywords we get from the segmented documents??

arianpasquali · 2022-01-13T03:01:04Z

Hi @juhoinkinen and @prateekkrjain.
Interesting topic to be discussed @rncampos.

arianpasquali · 2022-01-13T03:02:12Z

Hi @juhoinkinen.

Providing the parameter for truncating is something to consider. Would you be willing to suggest a PR for that?

arianpasquali · 2022-01-13T03:04:37Z

@prateekkrjain

In this case I would probably break the document and manage the sections separately.

juhoinkinen · 2022-01-14T08:12:26Z

Providing the parameter for truncating is something to consider. Would you be willing to suggest a PR for that?

At the moment I can't, but if I have more time at some point I could take a look at this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncating long documents #46

Truncating long documents #46

juhoinkinen commented May 17, 2021

prateekkrjain commented Jul 27, 2021

arianpasquali commented Jan 13, 2022

arianpasquali commented Jan 13, 2022

arianpasquali commented Jan 13, 2022

juhoinkinen commented Jan 14, 2022

Truncating long documents #46

Truncating long documents #46

Comments

juhoinkinen commented May 17, 2021

prateekkrjain commented Jul 27, 2021

arianpasquali commented Jan 13, 2022

arianpasquali commented Jan 13, 2022

arianpasquali commented Jan 13, 2022

juhoinkinen commented Jan 14, 2022