-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring of WashingtonPost collection for Core18 #1092
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1092 +/- ##
============================================
+ Coverage 44.81% 45.29% +0.47%
- Complexity 664 673 +9
============================================
Files 141 141
Lines 8184 8188 +4
Branches 1166 1167 +1
============================================
+ Hits 3668 3709 +41
+ Misses 4187 4147 -40
- Partials 329 332 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Surprised the regression numbers dropped from such a small change
protected WashingtonPostObject obj; | ||
|
||
protected String fullCaption = null; | ||
protected String kicker = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a kicker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's one of the fields in WashingtonPost corpus: http://trec-news.org/guidelines-2020.pdf
... we decree that articles from the "Opinion", "Letters to the Editor", or "The Post's View" sections, as labeled in the "kicker" field, are not relevant
Refactoring of WashingtonPost collection based on the clarified definition of
contents()
andraw()
inSourceDocument
, per #1048.Before, both
contents()
andraw()
returned the raw JSON, andWashingtonPostGenerator
extracted the article text for indexing.Now,
raw()
returns the raw JSON, andcontents()
returns the extracted document contents. In other words, the logic for parsing the JSON has been moved fromWashingtonPostGenerator
into the collection itself. As a result, the generator has been simplified. It inherits from theDefaultLuceneDocumentGenerator
, which creates the default fields, and then adds additional specific ones.Regression values went down slightly for
Ax
, which seems very sensitive to tiny differences. The difference here is that, before, the "empty document" check was performed on the JSON, so it never triggered (since the JSON was never empty). With the new processing logic, the "empty document" check is performed oncontents()
, and so the number of empty documents is now accurate (there are 6). Apparently, this is enough to change regression numbers forAx
.