Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new test cases including more global stories #29

Merged
merged 13 commits into from
Aug 3, 2021

Conversation

rahulbot
Copy link
Contributor

@rahulbot rahulbot commented Jul 7, 2021

This adds a new set of test cases based on a global random sample of 200 articles from the Media Cloud dataset (related to #8). We currently use our own date_guesser library and are evaluating switching the htmldate.

This new corpus includes 200 articles discovered via a search of stories from 2020 in the Media Cloud corpus. The set of countries of origin, and languages, is representative of the ~60k sources we ingest from every day.

The htmldate code still performs well against this new test corpus:

Name                    Precision    Recall    Accuracy    F-score
--------------------  -----------  --------  ----------  ---------
htmldate extensive       0.755102  0.973684       0.74    0.850575
htmldate fast            0.769663  0.861635       0.685   0.813056
newspaper                0.671141  0.662252       0.5     0.666667
newsplease               0.736527  0.788462       0.615   0.76161
articledateextractor     0.72973   0.675          0.54    0.701299
date_guesser             0.686567  0.582278       0.46    0.630137
goose                    0.75      0.508772       0.435   0.606272

A few notes:

  • We changed comparison.py to load test data from .json files so the test data is isolated from the code itself.
  • The new set of stories and dates are in test/eval_mediacloud_2020.json, with HTML cached in tests/eval.
  • Then evaluation results are now printed out via the tabulate module, and saved to the file system.
  • Perhaps the two evaluations sets should be merged into one larger one? Or the scores combined between them? We weren't sure how to approach this.
  • Interesting to note that overall all the precision scores were lower against this corpus - more false positives. Recall actually slightly better against this set - fewer false negatives.

We hope this contribution helps document the performance of the various libraries against a more global dataset.

@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2021

Codecov Report

Merging #29 (d85f893) into master (d6d34d3) will not change coverage.
The diff coverage is n/a.

❗ Current head d85f893 differs from pull request most recent head 628d623. Consider uploading reports for the commit 628d623 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##           master      #29   +/-   ##
=======================================
  Coverage   92.36%   92.36%           
=======================================
  Files           7        7           
  Lines         943      943           
=======================================
  Hits          871      871           
  Misses         72       72           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6d34d3...628d623. Read the comment docs.

@adbar
Copy link
Owner

adbar commented Jul 9, 2021

Hi @rahulbot, thank you for your work!
I'll go through the code first to see if the pull request can be accepted as is.

I'm glad you tried date extraction on a more diverse dataset and still consider switching to htmldate!
Is there a reason why you didn't measure execution speed?

@adbar
Copy link
Owner

adbar commented Jul 13, 2021

Hi @rahulbot @coreydockser,

Just to be clear: I'd like to accept your pull request but I'd be better to keep a rough comparison of speed execution. You deleted it in the new version of comparison.py.

Could you please time the execution and add a ratio as originally implemented, or could you keep the old comparison as a legacy file?

@rahulbot
Copy link
Contributor Author

Yes - @coreydockser is working on adding the execution speed measurement back in. We should be able to update this PR once he has fixed that.

… comparison for htmldate extensive, since thats the one we'll be using
@rahulbot
Copy link
Contributor Author

New results, with timing against our global story test set:

Name                    Precision    Recall    Accuracy    F-score  Time (Relative to htmldate extensive)    Time (Relative to htmldate fast)
--------------------  -----------  --------  ----------  ---------  ---------------------------------------  ----------------------------------
htmldate extensive          0.753     0.935       0.715   0.833819  1.00x                                    2.02x
htmldate fast               0.763     0.830       0.660   0.795181  0.50x                                    1.00x
newspaper                   0.671     0.662       0.500   0.666667  4.61x                                    9.30x
newsplease                  0.737     0.788       0.615   0.76161   8.40x                                    16.93x
articledateextractor        0.730     0.675       0.540   0.701299  1.29x                                    2.61x
date_guesser                0.687     0.582       0.460   0.630137  5.29x                                    10.67x
goose                       0.746     0.497       0.425   0.596491  3.20x                                    6.45x

@adbar
Copy link
Owner

adbar commented Jul 15, 2021

Hi @rahulbot @coreydockser,

Thanks for the changes! There is only one question left: should the two datasets be merged into one?

I believe so, what do you think? We could do the merge in this PR as well, feel free to implement it :)

@rahulbot
Copy link
Contributor Author

Sure - we wanted to get your opinion on that so we left it open as a possible path forward. @coreydockser perhaps you could have an array of files, then load and merge all of them before running tests? That would be a solution that easily allows for multiple test sets but still acknowledges that they came from different sources. Ie. something like this pseudo-code:

eval_files = [
  "eval_mediacloud_2020.json",  # 200 random stories from Media Cloud 2020
  "eval_default" # original mostly German test set
]
EVAL_PAGES = []
for f in eval_files:
  # load the file's stories
  # merge with `EVAL_PAGES`
# now EVAL_PAGES is a big list of all the test story data from the list of `eval_files` combined

…f a single JSON file. Also commented out newspaper3k because it does not work with the original dataset (though it does work with the mediacloud dataset).
@adbar
Copy link
Owner

adbar commented Aug 3, 2021

@coreydockser @rahulbot Alright, thanks!

@adbar adbar merged commit e0f2e2b into adbar:master Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants