Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of WashingtonPost collection for Core18 #1092

Merged
merged 5 commits into from
Apr 12, 2020
Merged

Conversation

lintool
Copy link
Member

@lintool lintool commented Apr 11, 2020

Refactoring of WashingtonPost collection based on the clarified definition of contents() and raw() in SourceDocument, per #1048.

Before, both contents() and raw() returned the raw JSON, and WashingtonPostGenerator extracted the article text for indexing.

Now, raw() returns the raw JSON, and contents() returns the extracted document contents. In other words, the logic for parsing the JSON has been moved from WashingtonPostGenerator into the collection itself. As a result, the generator has been simplified. It inherits from the DefaultLuceneDocumentGenerator, which creates the default fields, and then adds additional specific ones.

Regression values went down slightly for Ax, which seems very sensitive to tiny differences. The difference here is that, before, the "empty document" check was performed on the JSON, so it never triggered (since the JSON was never empty). With the new processing logic, the "empty document" check is performed on contents(), and so the number of empty documents is now accurate (there are 6). Apparently, this is enough to change regression numbers for Ax.

@lintool lintool requested review from edwinzhng and nikhilro April 11, 2020 20:43
@codecov
Copy link

codecov bot commented Apr 11, 2020

Codecov Report

Merging #1092 into master will increase coverage by 0.47%.
The diff coverage is 51.78%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1092      +/-   ##
============================================
+ Coverage     44.81%   45.29%   +0.47%     
- Complexity      664      673       +9     
============================================
  Files           141      141              
  Lines          8184     8188       +4     
  Branches       1166     1167       +1     
============================================
+ Hits           3668     3709      +41     
+ Misses         4187     4147      -40     
- Partials        329      332       +3     
Impacted Files Coverage Δ Complexity Δ
...ndex/generator/DefaultLuceneDocumentGenerator.java 54.83% <0.00%> (-3.79%) 2.00 <0.00> (ø)
...erini/index/generator/WashingtonPostGenerator.java 34.48% <0.00%> (+18.60%) 0.00 <0.00> (ø)
...arch/topicreader/BackgroundLinkingTopicReader.java 19.08% <0.00%> (-0.15%) 9.00 <0.00> (ø)
.../main/java/io/anserini/index/IndexReaderUtils.java 45.94% <55.55%> (+0.49%) 21.00 <1.00> (+1.00)
.../anserini/collection/WashingtonPostCollection.java 78.40% <76.66%> (+6.40%) 2.00 <0.00> (ø)
...c/main/java/io/anserini/search/SimpleSearcher.java 53.44% <100.00%> (+0.26%) 26.00 <1.00> (+1.00)
...anserini/ltr/feature/base/PMIFeatureExtractor.java 86.53% <0.00%> (+1.92%) 13.00% <0.00%> (+1.00%)
...java/io/anserini/ltr/feature/CountBigramPairs.java 93.50% <0.00%> (+10.38%) 35.00% <0.00%> (+6.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 585229f...e2e9df8. Read the comment docs.

Copy link
Member

@edwinzhng edwinzhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Surprised the regression numbers dropped from such a small change

protected WashingtonPostObject obj;

protected String fullCaption = null;
protected String kicker = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a kicker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's one of the fields in WashingtonPost corpus: http://trec-news.org/guidelines-2020.pdf

... we decree that articles from the "Opinion", "Letters to the Editor", or "The Post's View" sections, as labeled in the "kicker" field, are not relevant

@lintool lintool merged commit 35f9f82 into master Apr 12, 2020
@lintool lintool deleted the refactoring branch April 12, 2020 11:35
crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants