Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFIDF requires a corpus to compare #27

Open
AbhiPawar5 opened this issue Apr 26, 2021 · 2 comments
Open

TFIDF requires a corpus to compare #27

AbhiPawar5 opened this issue Apr 26, 2021 · 2 comments
Labels
question Further information is requested

Comments

@AbhiPawar5
Copy link

Hi Andrew,
I was trying the Keyword Extraction API with TF-IDF, the code is:
bert_kws = extract_kws(
method="TFIDF", # "BERT", "LDA", "TFIDF", "frequency"
bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA
input_language=input_language,
output_language=None, # allows the output to be translated
num_keywords=num_keywords,
num_topics=num_topics,
corpuses_to_compare=None, # for TFIDF
ignore_words=ignore_words,
prompt_remove_words=True, # check words with user
show_progress_bar=True,
batch_size=5,
)

Which returns the error,
AssertionError: TFIDF requires another text corpus to be passed to the corpuses_to_compare argument.

I wonder why we require corpus to compare for keyword extraction? Thanks!

@andrewtavis andrewtavis added the question Further information is requested label Apr 26, 2021
@andrewtavis
Copy link
Owner

Hi Abhishek,

The necessity to have a corpus to compare for TFIDF comes from the "IDF" part - Inverse Document Frequency. The way that kwx works is that everything that you're passing in via your dataframe or other input is treated as a single "document" from which topics are derived for LDA and BERT, and then term frequencies are found for TFIDF. Without something to compare, there's no way for TFIDF to figure out which words are more relevant to what it is that has been passed, as there's no reference. If your inputs are large, then you could treat each as if it's its own document and compare across them.

I'd be happy to chat a bit more on this if you wanted to send along a better description of what your inputs are :)

The wiki for kwx also has a resources for models page that has some good links for TFIDF and the other models, if you're interested!

Thanks again for writing :)

@andrewtavis
Copy link
Owner

A further explanation on this: if you look at my package wikirec, there we're using TFIDF to find the terms that appear more frequently in any given Wikipedia article when compared to other articles. In that case we have different documents to compare, but for keyword extraction purposes the likely use is that we want to know what the keywords are for the whole corpus - i.e. all the individual parts should be combined.

The usage case for this comes from the freelance that I did that originally produced this. In that the question was finding keywords from surveys, where TFIDF in that case can be used to derive what words are relevant to respondents from one survey by comparing the responses from that survey to those of other surveys. Also, as seen in examples/kw_extraction, we could also segment the original corpus and use TFIDF to find keywords that are more relevant for the segment in comparison to the rest :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants