-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new test cases including more global stories #29
Conversation
…ts.txt from the repo, and removed the comment saying newspaper doesnt work.
Codecov Report
@@ Coverage Diff @@
## master #29 +/- ##
=======================================
Coverage 92.36% 92.36%
=======================================
Files 7 7
Lines 943 943
=======================================
Hits 871 871
Misses 72 72 Continue to review full report at Codecov.
|
Hi @rahulbot, thank you for your work! I'm glad you tried date extraction on a more diverse dataset and still consider switching to htmldate! |
Just to be clear: I'd like to accept your pull request but I'd be better to keep a rough comparison of speed execution. You deleted it in the new version of Could you please time the execution and add a ratio as originally implemented, or could you keep the old comparison as a legacy file? |
Yes - @coreydockser is working on adding the execution speed measurement back in. We should be able to update this PR once he has fixed that. |
… comparison for htmldate extensive, since thats the one we'll be using
New results, with timing against our global story test set:
|
…ed work on writing unit_tests
Thanks for the changes! There is only one question left: should the two datasets be merged into one? I believe so, what do you think? We could do the merge in this PR as well, feel free to implement it :) |
Sure - we wanted to get your opinion on that so we left it open as a possible path forward. @coreydockser perhaps you could have an array of files, then load and merge all of them before running tests? That would be a solution that easily allows for multiple test sets but still acknowledges that they came from different sources. Ie. something like this pseudo-code: eval_files = [
"eval_mediacloud_2020.json", # 200 random stories from Media Cloud 2020
"eval_default" # original mostly German test set
]
EVAL_PAGES = []
for f in eval_files:
# load the file's stories
# merge with `EVAL_PAGES`
# now EVAL_PAGES is a big list of all the test story data from the list of `eval_files` combined |
…f a single JSON file. Also commented out newspaper3k because it does not work with the original dataset (though it does work with the mediacloud dataset).
@coreydockser @rahulbot Alright, thanks! |
This adds a new set of test cases based on a global random sample of 200 articles from the Media Cloud dataset (related to #8). We currently use our own
date_guesser
library and are evaluating switching thehtmldate
.This new corpus includes 200 articles discovered via a search of stories from 2020 in the Media Cloud corpus. The set of countries of origin, and languages, is representative of the ~60k sources we ingest from every day.
The
htmldate
code still performs well against this new test corpus:A few notes:
comparison.py
to load test data from.json
files so the test data is isolated from the code itself.test/eval_mediacloud_2020.json
, with HTML cached intests/eval
.tabulate
module, and saved to the file system.We hope this contribution helps document the performance of the various libraries against a more global dataset.