-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix zero norm bug #376
Fix zero norm bug #376
Conversation
@Victor0118 I don’t think this is the right fix... this will impact the quality of the estimated RM. we should actually remove those documents... |
According to the source code, Englishanalyzer does not remove the word containing special character. But in RM3 implementation in Anserini, you remove words that contain character not in [a-z0-9]. Is there any special reason to do this? This is the cause of the empty doc. Deleting this line can also fix this bug. |
The |
Another choice is adding a condition like this: if (len(doc) < threshold){
// do not add doc to the RM3 candidate
} This can also avoid the junk document in Wikipedia corpus. Good? |
Yes, although that might also change regression results... Why don't we just throw away all feature vectors with norm of 0? |
Okay. That makes sense! |
Can you run the regression experiments to make sure everything still works? Should be a matter of cut-and-paste commands on tuna and waiting for the runs to finish... |
OKay! |
@lintool All regression tests passed except four index files not found above. |
@lintool All regression test passed! |
@Victor0118 great! |
add comment
@lintool Done. |
How about a comment that is more descriptive? Like: Avoids zero-length feedback documents, which causes division by zero when computing term weights. |
Edit comment.
@lintool Thanks for your kind suggestion. I have updated the comment. |
@lintool @Peilin-Yang Could you take a quick look at this PR?
The bug detail is in #374 .