TFIDF requires a corpus to compare #27

AbhiPawar5 · 2021-04-26T12:15:07Z

Hi Andrew,
I was trying the Keyword Extraction API with TF-IDF, the code is:
bert_kws = extract_kws(
method="TFIDF", # "BERT", "LDA", "TFIDF", "frequency"
bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA
input_language=input_language,
output_language=None, # allows the output to be translated
num_keywords=num_keywords,
num_topics=num_topics,
corpuses_to_compare=None, # for TFIDF
ignore_words=ignore_words,
prompt_remove_words=True, # check words with user
show_progress_bar=True,
batch_size=5,
)

Which returns the error,
AssertionError: TFIDF requires another text corpus to be passed to the corpuses_to_compare argument.

I wonder why we require corpus to compare for keyword extraction? Thanks!

The text was updated successfully, but these errors were encountered:

andrewtavis · 2021-04-26T13:57:21Z

Hi Abhishek,

The necessity to have a corpus to compare for TFIDF comes from the "IDF" part - Inverse Document Frequency. The way that kwx works is that everything that you're passing in via your dataframe or other input is treated as a single "document" from which topics are derived for LDA and BERT, and then term frequencies are found for TFIDF. Without something to compare, there's no way for TFIDF to figure out which words are more relevant to what it is that has been passed, as there's no reference. If your inputs are large, then you could treat each as if it's its own document and compare across them.

I'd be happy to chat a bit more on this if you wanted to send along a better description of what your inputs are :)

The wiki for kwx also has a resources for models page that has some good links for TFIDF and the other models, if you're interested!

Thanks again for writing :)

andrewtavis · 2021-04-26T16:15:14Z

A further explanation on this: if you look at my package wikirec, there we're using TFIDF to find the terms that appear more frequently in any given Wikipedia article when compared to other articles. In that case we have different documents to compare, but for keyword extraction purposes the likely use is that we want to know what the keywords are for the whole corpus - i.e. all the individual parts should be combined.

The usage case for this comes from the freelance that I did that originally produced this. In that the question was finding keywords from surveys, where TFIDF in that case can be used to derive what words are relevant to respondents from one survey by comparing the responses from that survey to those of other surveys. Also, as seen in examples/kw_extraction, we could also segment the original corpus and use TFIDF to find keywords that are more relevant for the segment in comparison to the rest :)

andrewtavis added the question Further information is requested label Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFIDF requires a corpus to compare #27

TFIDF requires a corpus to compare #27

AbhiPawar5 commented Apr 26, 2021

andrewtavis commented Apr 26, 2021

andrewtavis commented Apr 26, 2021

TFIDF requires a corpus to compare #27

TFIDF requires a corpus to compare #27

Comments

AbhiPawar5 commented Apr 26, 2021

andrewtavis commented Apr 26, 2021

andrewtavis commented Apr 26, 2021