Corpus function - AttributeError: 'DocumentSet' object has no attribute 'title' #63

SS159 · 2023-10-09T12:19:57Z

SS159
Oct 9, 2023

AttributeError: 'DocumentSet' object has no attribute 'title' is displayed, even after changing title within relevant CSV file (docs_springer) to read 'title'.

Thanks in advance! :)

Sam

AttributeError DocumentSet No Attribute Title

stijnh · 2023-10-09T12:37:35Z

stijnh
Oct 9, 2023
Maintainer

Thanks for using LitStudy!

Looks like build_corpus expects a DocumentSet and it seems that docs_springer is not a DocumentSet but something else.

Could you maybe provide the rest of the notebook, or do you have the line that creates docs_springer?

0 replies

SS159 · 2023-10-09T12:47:28Z

SS159
Oct 9, 2023
Author

Hi stijnh thanks for the quick response.

Sure, here we go:

0 replies

SS159 · 2023-10-09T13:01:08Z

SS159
Oct 9, 2023
Author

I have defined DocumentSet as docs_springer in my case, and it seems to have resolved the error, as the output is no longer an AttributeError, but instead (As below):

Does this look correct to you?

0 replies

stijnh · 2023-10-09T14:06:57Z

stijnh
Oct 9, 2023
Maintainer

refine_scopus returns two document sets: One for the document found on scopus and one for the documents not found on scopus.

You would need to do something like this:

docs_springer, docs_not_found = litstudy.refine_scopus(docs_springer)
print(len(docs_springer), "papers found on Scopus")
print(len(docs_not_found), "papers NOT found on Scopus")

0 replies

SS159 · 2023-10-16T10:58:07Z

SS159
Oct 16, 2023
Author

Great, thanks stijnh.

Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

Thanks, as always,

S

0 replies

SS159 · 2023-10-16T11:16:12Z

SS159
Oct 16, 2023
Author

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?

Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.

Thanks,

S

0 replies

stijnh · 2023-10-16T18:03:55Z

stijnh
Oct 16, 2023
Maintainer

Hi,

Great, thanks stijnh.

Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

Thanks, as always,

S

This is the complete table of all ngrams, that means all the words that contain a _ after processing (that is what the .filter(like="_") does).

Remove .filter(...) part to see a list of the complete word distribution.

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?

Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.

Thanks,

S

The parameter ngram_threshold determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").

The actual processing is done by gensim, here is the documentation and look at the threshold parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

0 replies

SS159 · 2023-10-17T14:14:00Z

SS159
Oct 17, 2023
Author

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

Thanks again,

Sam

0 replies

SS159 · 2023-10-17T14:49:40Z

SS159
Oct 17, 2023
Author

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:

In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

Thanks, as always, for your patience and advice,

Sam

0 replies

stijnh · 2023-10-23T09:09:49Z

stijnh
Oct 23, 2023
Maintainer

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

Thanks again,

Sam

The thing returned by compute_word_distribution is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html

For example, you can add ...sort_index().to_csv("word_distrbution.csv")

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:

In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

Thanks, as always, for your patience and advice,

Sam

Not sure about this one. Maybe sometimes nature is followed by solutions and it is interpreted as the bigram nature_solutions. You can disable bigram detection by removing the ngram_threshold= options from build_corpus.

Good luck!

0 replies

SS159 · 2023-10-23T13:13:53Z

SS159
Oct 23, 2023
Author

Hi,

Great, thanks stijnh.
Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

Thanks, as always,
S

This is the complete table of all ngrams, that means all the words that contain a _ after processing (that is what the .filter(like="_") does).

Remove .filter(...) part to see a list of the complete word distribution.

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?
Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.
Thanks,
S

The parameter ngram_threshold determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").

The actual processing is done by gensim, here is the documentation and look at the threshold parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

Thanks for sharing this @stijnh - one (final) question which isn't clear to me from the guidance, how can we change the parameters to search for trigrams? I have a feeling that the top scoring bigram below "nature_solutions" is actually "nature-based solutions" or "nature based solutions", and would like to capture this in the word distribution output.

0 replies

SS159 · 2023-10-23T13:48:08Z

SS159
Oct 23, 2023
Author

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

Thanks again,
Sam

The thing returned by compute_word_distribution is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html

For example, you can add ...sort_index().to_csv("word_distrbution.csv")

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:
In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

Thanks, as always, for your patience and advice,
Sam

Not sure about this one. Maybe sometimes nature is followed by solutions and it is interpreted as the bigram nature_solutions. You can disable bigram detection by removing the ngram_threshold= options from build_corpus.

Good luck!

Thanks @stijnh , although I can't seem to get pandas to write the DataFrame to a .csv, here's what I'm doing:

There's no error returned, but nothing being written to the .csv either...

0 replies

stijnh · 2023-10-25T18:29:13Z

stijnh
Oct 25, 2023
Maintainer

There's no error returned, but nothing being written to the .csv either...

Replace

DataFrame = pd.DataFrame()

by

DataFrame = litstudy.compute_word_distribution(corpus).sort_index()

You were creating an empty DataFrame and then calling to_excel on that one.

0 replies

SS159 · 2023-10-30T09:16:31Z

SS159
Oct 30, 2023
Author

Great, thanks @stijnh

I've now instead encountered the issue of the exported .xlsx from DataFrame being unopenable, due to an invalid extension of file pathway, but this seems to be a known issue that requires a workaround so I've posted elsewhere. If you are curious, here's the issue

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus function - AttributeError: 'DocumentSet' object has no attribute 'title' #63

{{title}}

Replies: 14 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Corpus function - AttributeError: 'DocumentSet' object has no attribute 'title' #63

SS159 Oct 9, 2023

Replies: 14 comments

stijnh Oct 9, 2023 Maintainer

SS159 Oct 9, 2023 Author

SS159 Oct 9, 2023 Author

stijnh Oct 9, 2023 Maintainer

SS159 Oct 16, 2023 Author

SS159 Oct 16, 2023 Author

stijnh Oct 16, 2023 Maintainer

SS159 Oct 17, 2023 Author

SS159 Oct 17, 2023 Author

stijnh Oct 23, 2023 Maintainer

SS159 Oct 23, 2023 Author

SS159 Oct 23, 2023 Author

stijnh Oct 25, 2023 Maintainer

SS159 Oct 30, 2023 Author

SS159
Oct 9, 2023

stijnh
Oct 9, 2023
Maintainer

SS159
Oct 9, 2023
Author

SS159
Oct 9, 2023
Author

stijnh
Oct 9, 2023
Maintainer

SS159
Oct 16, 2023
Author

SS159
Oct 16, 2023
Author

stijnh
Oct 16, 2023
Maintainer

SS159
Oct 17, 2023
Author

SS159
Oct 17, 2023
Author

stijnh
Oct 23, 2023
Maintainer

SS159
Oct 23, 2023
Author

SS159
Oct 23, 2023
Author

stijnh
Oct 25, 2023
Maintainer

SS159
Oct 30, 2023
Author