Add new test cases including more global stories #29

rahulbot · 2021-07-07T19:26:43Z

This adds a new set of test cases based on a global random sample of 200 articles from the Media Cloud dataset (related to #8). We currently use our own date_guesser library and are evaluating switching the htmldate.

This new corpus includes 200 articles discovered via a search of stories from 2020 in the Media Cloud corpus. The set of countries of origin, and languages, is representative of the ~60k sources we ingest from every day.

The htmldate code still performs well against this new test corpus:

Name                    Precision    Recall    Accuracy    F-score
--------------------  -----------  --------  ----------  ---------
htmldate extensive       0.755102  0.973684       0.74    0.850575
htmldate fast            0.769663  0.861635       0.685   0.813056
newspaper                0.671141  0.662252       0.5     0.666667
newsplease               0.736527  0.788462       0.615   0.76161
articledateextractor     0.72973   0.675          0.54    0.701299
date_guesser             0.686567  0.582278       0.46    0.630137
goose                    0.75      0.508772       0.435   0.606272

A few notes:

We changed comparison.py to load test data from .json files so the test data is isolated from the code itself.
The new set of stories and dates are in test/eval_mediacloud_2020.json, with HTML cached in tests/eval.
Then evaluation results are now printed out via the tabulate module, and saved to the file system.
Perhaps the two evaluations sets should be merged into one larger one? Or the scores combined between them? We weren't sure how to approach this.
Interesting to note that overall all the precision scores were lower against this corpus - more false positives. Recall actually slightly better against this set - fewer false negatives.

We hope this contribution helps document the performance of the various libraries against a more global dataset.

…ts.txt from the repo, and removed the comment saying newspaper doesnt work.

codecov-commenter · 2021-07-07T19:28:20Z

Codecov Report

Merging #29 (d85f893) into master (d6d34d3) will not change coverage.
The diff coverage is n/a.

❗ Current head d85f893 differs from pull request most recent head 628d623. Consider uploading reports for the commit 628d623 to get more accurate results

@@           Coverage Diff           @@
##           master      #29   +/-   ##
=======================================
  Coverage   92.36%   92.36%           
=======================================
  Files           7        7           
  Lines         943      943           
=======================================
  Hits          871      871           
  Misses         72       72

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6d34d3...628d623. Read the comment docs.

adbar · 2021-07-09T16:42:30Z

Hi @rahulbot, thank you for your work!
I'll go through the code first to see if the pull request can be accepted as is.

I'm glad you tried date extraction on a more diverse dataset and still consider switching to htmldate!
Is there a reason why you didn't measure execution speed?

adbar · 2021-07-13T17:06:19Z

Hi @rahulbot @coreydockser,

Just to be clear: I'd like to accept your pull request but I'd be better to keep a rough comparison of speed execution. You deleted it in the new version of comparison.py.

Could you please time the execution and add a ratio as originally implemented, or could you keep the old comparison as a legacy file?

rahulbot · 2021-07-13T17:19:29Z

Yes - @coreydockser is working on adding the execution speed measurement back in. We should be able to update this PR once he has fixed that.

… comparison for htmldate extensive, since thats the one we'll be using

rahulbot · 2021-07-13T23:44:28Z

New results, with timing against our global story test set:

Name                    Precision    Recall    Accuracy    F-score  Time (Relative to htmldate extensive)    Time (Relative to htmldate fast)
--------------------  -----------  --------  ----------  ---------  ---------------------------------------  ----------------------------------
htmldate extensive          0.753     0.935       0.715   0.833819  1.00x                                    2.02x
htmldate fast               0.763     0.830       0.660   0.795181  0.50x                                    1.00x
newspaper                   0.671     0.662       0.500   0.666667  4.61x                                    9.30x
newsplease                  0.737     0.788       0.615   0.76161   8.40x                                    16.93x
articledateextractor        0.730     0.675       0.540   0.701299  1.29x                                    2.61x
date_guesser                0.687     0.582       0.460   0.630137  5.29x                                    10.67x
goose                       0.746     0.497       0.425   0.596491  3.20x                                    6.45x

…ed work on writing unit_tests

adbar · 2021-07-15T12:55:40Z

Hi @rahulbot @coreydockser,

Thanks for the changes! There is only one question left: should the two datasets be merged into one?

I believe so, what do you think? We could do the merge in this PR as well, feel free to implement it :)

rahulbot · 2021-07-15T13:04:52Z

Sure - we wanted to get your opinion on that so we left it open as a possible path forward. @coreydockser perhaps you could have an array of files, then load and merge all of them before running tests? That would be a solution that easily allows for multiple test sets but still acknowledges that they came from different sources. Ie. something like this pseudo-code:

eval_files = [
  "eval_mediacloud_2020.json",  # 200 random stories from Media Cloud 2020
  "eval_default" # original mostly German test set
]
EVAL_PAGES = []
for f in eval_files:
  # load the file's stories
  # merge with `EVAL_PAGES`
# now EVAL_PAGES is a big list of all the test story data from the list of `eval_files` combined

…f a single JSON file. Also commented out newspaper3k because it does not work with the original dataset (though it does work with the mediacloud dataset).

adbar · 2021-08-03T16:22:12Z

@coreydockser @rahulbot Alright, thanks!

coreydockser and others added 9 commits July 7, 2021 12:18

added everything

b3a2c65

Merge branch 'master' of https://github.com/adbar/htmldate

fdcfdb9

changed the filenames for the eval JSONs to be clearer

ee301b0

fixed a bug where i wrote the name of the file wrong :|

0fbcb23

added newspaper back in and added comparison results.txt to the repo

43698a0

remove unused code, ignore PyCharm files, re-run new tests

619ba6f

remove generated file, remove duplicate ignore entry

801061f

made the file path in comparison.py absolute, removed compariso_resul…

30d0302

…ts.txt from the repo, and removed the comment saying newspaper doesnt work.

Merge branch 'master' of https://github.com/coreydockser/htmldate

ae0d8ea

added time comparison for htmldate fast back and additionally added a…

d85f893

… comparison for htmldate extensive, since thats the one we'll be using

coreydockser added 2 commits July 14, 2021 12:38

fixed the .gitignore so it ignores all of the correct files and start…

0356a13

…ed work on writing unit_tests

storing this for later

d49d2fd

adbar mentioned this pull request Jul 15, 2021

Port improvements from go-htmldate #30

Merged

Rewrote comparison.py so that it takes in an array of JSONs instead o…

628d623

…f a single JSON file. Also commented out newspaper3k because it does not work with the original dataset (though it does work with the mediacloud dataset).

adbar merged commit e0f2e2b into adbar:master Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new test cases including more global stories #29

Add new test cases including more global stories #29

rahulbot commented Jul 7, 2021

codecov-commenter commented Jul 7, 2021 •

edited

Loading

adbar commented Jul 9, 2021

adbar commented Jul 13, 2021

rahulbot commented Jul 13, 2021

rahulbot commented Jul 13, 2021

adbar commented Jul 15, 2021

rahulbot commented Jul 15, 2021

adbar commented Aug 3, 2021

Add new test cases including more global stories #29

Add new test cases including more global stories #29

Conversation

rahulbot commented Jul 7, 2021

codecov-commenter commented Jul 7, 2021 • edited Loading

Codecov Report

adbar commented Jul 9, 2021

adbar commented Jul 13, 2021

rahulbot commented Jul 13, 2021

rahulbot commented Jul 13, 2021

adbar commented Jul 15, 2021

rahulbot commented Jul 15, 2021

adbar commented Aug 3, 2021

codecov-commenter commented Jul 7, 2021 •

edited

Loading