-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed multiple iterations of corpus in p_boolean_document. #1325
Conversation
In the previous version of p_boolean_document, the entire corpus was iterated through for each top_id. In this new version, the corpus is iterated through a single time and the matching docs for each top_id are extracted simultaneously.
The build failed because of trailing whitespace after a colon. Should I submit a new pull request? |
@danielchamberlain No, you can fix it in current PR |
Thanks. All set now. |
Thanks for the improvement. Confirm regression test |
for n, document in enumerate(corpus): | ||
doc_words = frozenset(x[0] for x in document) | ||
top_ids_in_doc = top_ids.intersection(doc_words) | ||
if len(top_ids_in_doc) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if top_ids_in doc:
in Python.
In fact, better remove the condition completely, it doesn't seem to be needed: for word_id in top_ids.intersection(doc_words): ...
top_ids_in_doc = top_ids.intersection(doc_words) | ||
if len(top_ids_in_doc) > 0: | ||
for id in top_ids_in_doc: | ||
per_topic_postings[id].add(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id
is a reserved keyword, best use something else (like word_id
).
In the previous version of p_boolean_document, the entire
corpus was iterated through for each top_id. In this new
version, the corpus is iterated through a single time
and the matching top_ids for each document are extracted
simultaneously.