Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zero norm bug #376

Merged
merged 7 commits into from
Aug 7, 2018
Merged

Conversation

Victor0118
Copy link
Member

@lintool @Peilin-Yang Could you take a quick look at this PR?

The bug detail is in #374 .

@lintool
Copy link
Member

lintool commented Aug 3, 2018

@Victor0118 I don’t think this is the right fix... this will impact the quality of the estimated RM. we should actually remove those documents...

@Victor0118
Copy link
Member Author

Victor0118 commented Aug 3, 2018

According to the source code, Englishanalyzer does not remove the word containing special character.

But in RM3 implementation in Anserini, you remove words that contain character not in [a-z0-9]. Is there any special reason to do this?

This is the cause of the empty doc. Deleting this line can also fix this bug.

@Victor0118 Victor0118 changed the title fix zero norm bug Fix zero norm bug Aug 3, 2018
@lintool
Copy link
Member

lintool commented Aug 3, 2018

The [a-z0-9] throws away terms that are likely to be junk... if we remove that line, then all the RM3 results will change, then we need to verify all the effectiveness numbers, etc. I'd rather not do that...

@Victor0118
Copy link
Member Author

Another choice is adding a condition like this:

if (len(doc) < threshold){
   // do not add doc to the RM3 candidate
}

This can also avoid the junk document in Wikipedia corpus. Good?

@lintool
Copy link
Member

lintool commented Aug 3, 2018

Yes, although that might also change regression results...

Why don't we just throw away all feature vectors with norm of 0?

@Victor0118
Copy link
Member Author

Okay. That makes sense!

@lintool
Copy link
Member

lintool commented Aug 3, 2018

Can you run the regression experiments to make sure everything still works?
https://github.com/castorini/Anserini/blob/master/docs/regression-tuna.md

Should be a matter of cut-and-paste commands on tuna and waiting for the runs to finish...

@Victor0118
Copy link
Member Author

OKay!

@Victor0118
Copy link
Member Author

no segments* file found in MMapDirectory@/tuna1/indexes/lucene-index.aquaint.pos+docvectors+rawdocs
no segments* file found in MMapDirectory@/tuna1/indexes/lucene-index.nyt.pos+docvectors+rawdocs
no segments* file found in MMapDirectory@/tuna1/indexes/lucene-index.tweets2011.pos+docvectors+rawdocs
no segments* file found in MMapDirectory@/tuna1/indexes/lucene-index.tweets2013.pos+docvectors+rawdocs

@lintool All regression tests passed except four index files not found above.

@Victor0118
Copy link
Member Author

@lintool All regression test passed!

@lintool
Copy link
Member

lintool commented Aug 7, 2018

@Victor0118 great!
One final thing - can you add a comment that explains why we do that check in the code? Then we'll be good for merging.

@Victor0118
Copy link
Member Author

@lintool Done.

@lintool
Copy link
Member

lintool commented Aug 7, 2018

How about a comment that is more descriptive? Like:

Avoids zero-length feedback documents, which causes division by zero when computing term weights.
Zero-length feedback documents occur (e.g., with CAR17) when a document has only terms that accents (which are indexed, but not selected for feedback).

@Victor0118
Copy link
Member Author

@lintool Thanks for your kind suggestion. I have updated the comment.

@lintool lintool merged commit 135d08c into castorini:master Aug 7, 2018
@Victor0118 Victor0118 deleted the fix-rm3-empty-doc-bug branch August 9, 2018 15:47
crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants