Fix zero norm bug #376

Victor0118 · 2018-08-02T17:41:34Z

@lintool @Peilin-Yang Could you take a quick look at this PR?

The bug detail is in #374 .

lintool · 2018-08-03T16:09:07Z

@Victor0118 I don’t think this is the right fix... this will impact the quality of the estimated RM. we should actually remove those documents...

Victor0118 · 2018-08-03T16:29:09Z

According to the source code, Englishanalyzer does not remove the word containing special character.

But in RM3 implementation in Anserini, you remove words that contain character not in [a-z0-9]. Is there any special reason to do this?

This is the cause of the empty doc. Deleting this line can also fix this bug.

lintool · 2018-08-03T20:29:30Z

The [a-z0-9] throws away terms that are likely to be junk... if we remove that line, then all the RM3 results will change, then we need to verify all the effectiveness numbers, etc. I'd rather not do that...

Victor0118 · 2018-08-03T20:36:13Z

Another choice is adding a condition like this:

if (len(doc) < threshold){
   // do not add doc to the RM3 candidate
}

This can also avoid the junk document in Wikipedia corpus. Good?

lintool · 2018-08-03T20:38:55Z

Yes, although that might also change regression results...

Why don't we just throw away all feature vectors with norm of 0?

Victor0118 · 2018-08-03T20:39:59Z

Okay. That makes sense!

lintool · 2018-08-03T21:26:17Z

Can you run the regression experiments to make sure everything still works?
https://github.com/castorini/Anserini/blob/master/docs/regression-tuna.md

Should be a matter of cut-and-paste commands on tuna and waiting for the runs to finish...

Victor0118 · 2018-08-03T21:48:34Z

OKay!

Victor0118 · 2018-08-05T20:30:39Z

no segments* file found in MMapDirectory@/tuna1/indexes/lucene-index.aquaint.pos+docvectors+rawdocs
no segments* file found in MMapDirectory@/tuna1/indexes/lucene-index.nyt.pos+docvectors+rawdocs
no segments* file found in MMapDirectory@/tuna1/indexes/lucene-index.tweets2011.pos+docvectors+rawdocs
no segments* file found in MMapDirectory@/tuna1/indexes/lucene-index.tweets2013.pos+docvectors+rawdocs

@lintool All regression tests passed except four index files not found above.

Victor0118 · 2018-08-06T17:43:06Z

@lintool All regression test passed!

lintool · 2018-08-07T11:44:02Z

@Victor0118 great!
One final thing - can you add a comment that explains why we do that check in the code? Then we'll be good for merging.

add comment

Victor0118 · 2018-08-07T14:46:36Z

@lintool Done.

lintool · 2018-08-07T14:50:12Z

How about a comment that is more descriptive? Like:

Avoids zero-length feedback documents, which causes division by zero when computing term weights.
Zero-length feedback documents occur (e.g., with CAR17) when a document has only terms that accents (which are indexed, but not selected for feedback).

Edit comment.

Victor0118 · 2018-08-07T14:57:17Z

@lintool Thanks for your kind suggestion. I have updated the comment.

fix zero norm bug

58254fe

Victor0118 changed the title ~~fix zero norm bug~~ Fix zero norm bug Aug 3, 2018

Victor0118 added 2 commits August 3, 2018 17:07

remove wrong fix

bd769f2

change norm check location

435aed5

Merge branch 'master' into fix-rm3-empty-doc-bug

57d8384

Update Rm3Reranker.java

636d1ca

add comment

Update Rm3Reranker.java

6b36f3c

Edit comment.

Merge branch 'master' into fix-rm3-empty-doc-bug

95eab04

lintool approved these changes Aug 7, 2018

View reviewed changes

lintool merged commit 135d08c into castorini:master Aug 7, 2018

lintool mentioned this pull request Aug 8, 2018

Nan score in RM3 term weight on CAR corpus #374

Closed

Victor0118 deleted the fix-rm3-empty-doc-bug branch August 9, 2018 15:47

crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022

Fix environment (castorini#376)

00dd97a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zero norm bug #376

Fix zero norm bug #376

Victor0118 commented Aug 2, 2018

lintool commented Aug 3, 2018

Victor0118 commented Aug 3, 2018 •

edited

Loading

lintool commented Aug 3, 2018

Victor0118 commented Aug 3, 2018

lintool commented Aug 3, 2018

Victor0118 commented Aug 3, 2018

lintool commented Aug 3, 2018 •

edited

Loading

Victor0118 commented Aug 3, 2018

Victor0118 commented Aug 5, 2018

Victor0118 commented Aug 6, 2018

lintool commented Aug 7, 2018

Victor0118 commented Aug 7, 2018

lintool commented Aug 7, 2018

Victor0118 commented Aug 7, 2018

Fix zero norm bug #376

Fix zero norm bug #376

Conversation

Victor0118 commented Aug 2, 2018

lintool commented Aug 3, 2018

Victor0118 commented Aug 3, 2018 • edited Loading

lintool commented Aug 3, 2018

Victor0118 commented Aug 3, 2018

lintool commented Aug 3, 2018

Victor0118 commented Aug 3, 2018

lintool commented Aug 3, 2018 • edited Loading

Victor0118 commented Aug 3, 2018

Victor0118 commented Aug 5, 2018

Victor0118 commented Aug 6, 2018

lintool commented Aug 7, 2018

Victor0118 commented Aug 7, 2018

lintool commented Aug 7, 2018

Victor0118 commented Aug 7, 2018

Victor0118 commented Aug 3, 2018 •

edited

Loading

lintool commented Aug 3, 2018 •

edited

Loading