Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed multiple iterations of corpus in p_boolean_document. #1325

Merged
merged 2 commits into from
May 15, 2017

Conversation

danielchamberlain
Copy link
Contributor

In the previous version of p_boolean_document, the entire
corpus was iterated through for each top_id. In this new
version, the corpus is iterated through a single time
and the matching top_ids for each document are extracted
simultaneously.

In the previous version of p_boolean_document, the entire
corpus was iterated through for each top_id. In this new
version, the corpus is iterated through a single time
and the matching docs for each top_id are extracted
simultaneously.
@danielchamberlain
Copy link
Contributor Author

The build failed because of trailing whitespace after a colon. Should I submit a new pull request?

@menshikh-iv
Copy link
Contributor

@danielchamberlain No, you can fix it in current PR

@danielchamberlain
Copy link
Contributor Author

Thanks. All set now.

@tmylk
Copy link
Contributor

tmylk commented May 15, 2017

Thanks for the improvement. Confirm regression test testPBooleanDocument passe.

@tmylk tmylk merged commit 2996884 into piskvorky:develop May 15, 2017
for n, document in enumerate(corpus):
doc_words = frozenset(x[0] for x in document)
top_ids_in_doc = top_ids.intersection(doc_words)
if len(top_ids_in_doc) > 0:
Copy link
Owner

@piskvorky piskvorky May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if top_ids_in doc: in Python.

In fact, better remove the condition completely, it doesn't seem to be needed: for word_id in top_ids.intersection(doc_words): ...

top_ids_in_doc = top_ids.intersection(doc_words)
if len(top_ids_in_doc) > 0:
for id in top_ids_in_doc:
per_topic_postings[id].add(n)
Copy link
Owner

@piskvorky piskvorky May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is a reserved keyword, best use something else (like word_id).

@danielchamberlain danielchamberlain deleted the pboolean_speedup branch May 18, 2017 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants