This repository contains the evaluation code and dataset to reproduce the results from the paper "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions".
Fundus is a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code.
In the following sections, we provide instructions to reproduce the comparative evaluation of Fundus against prominent scraping libraries. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than comparable news scrapers. For a more in-depth overview of Fundus, the evaluation practises, and its results, consult the result summary and our paper.
Fundus and this evaluation repository require Python 3.8 or later and Java for the Boilerpipe scraper. (Note: The evaluation was tested and performed using Python 3.8 and Java JDK 17.0.10.)
To install the fundus-evaluation
Python package, including the reference scraper dependencies, clone this GitHub repository and simply install the package using pip:
git clone https://github.com/dobbersc/fundus-evaluation.git
pip install ./fundus-evaluation
This installation also contains the dataset and evaluation results.
If you only are interested in the Python package directly (without the dataset and evaluation results), install the fundus-evaluation
package directly from GitHub using pip:
pip install git+https://github.com/dobbersc/fundus-evaluation.git@master
Verify the installation by running evaluate --version
, with the expected output of evaluate <version>
, where <version>
specifies the current version of the evaluation package.
For development, install the package, including the development dependencies:
git clone https://github.com/dobbersc/fundus-evaluation.git
pip install -e ./fundus-evaluation[dev]
In the following steps, we assume that the current working directory is the root of the repository.
To fully reproduce the evaluation results, only the dataset is required.
Each step in the evaluation pipeline requires the outputs from the previous step (dataset -> scrape -> score -> analysis).
To ease the reproducibility, we also provide the artifacts of intermediate steps in the dataset
folder.
Therefore, the pipeline may be started from any step.
The evaluation results may be reproduced using the package's command line interface (CLI), representing the evaluation pipeline steps:
$ evaluate --help
usage: evaluate [-h] [--version] {complexity,scrape,score,analysis} ...
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Fundus News Scraper Evaluation:
select evaluation pipeline step
{complexity,scrape,score,analysis}
complexity calculate page complexity scores
scrape scrape extractions on the evaluation dataset
score calculate evaluation scores
analysis generate tables and plots
Each entry point also provides its help page, e.g. with evaluate scrape --help
.
Alternatively to the CLI, we provide direct Python entry points in fundus_evaluation.entry_points
.
In the following steps, we will use the CLI.
We selected the 16 English-language publishers Fundus currently supports as the data source, and retrieved five articles for each publisher from the respective RSS feeds/sitemaps. The selection process yielded an evaluation corpus of 80 news articles. From it, we manually extracted the plain text from each article and stored it together with information on the original paragraph structure.
The resulting evaluation dataset is included in this repository and consists of the (compressed) HTML article files and their ground truth extractions as JSON.
Execute the following command to let all supported scrapers extract the plain text of the evaluation dataset's articles:
evaluate scrape \
--ground-truth-path dataset/ground_truth.json \
--html-directory dataset/html/ \
--output-directory dataset/extractions/
To restrict the scrapers that are part of the evaluation,
- use the
--scrapers
option to explicitly specify a list of evaluation scrapers, - or use the
--exclude-scrapers
option to exclude scrapers from the evaluation.
E.g. to exclude BoilerNet, as this scraper is very resource intensive, add the --exclude-scrapers boilernet
argument to the command above.
To evaluate the extraction results with the three supported metrics (paragraph match, ROUGE-LSum and WER), run the following command:
evaluate score \
--ground-truth-path dataset/ground_truth.json \
--extractions-directory dataset/extractions/ \
--output-directory dataset/scores/
This step is not part of the evaluation in our paper and is thus optional.
Execute the following command to calculate the page complexity scores established in "An Empirical Comparison of Web Content Extraction Algorithms" (Bevendorff et al., 2023):
evaluate complexity \
--ground-truth-path dataset/ground_truth.json \
--html-directory dataset/html/ \
--output-path dataset/complexity.tsv
Run the following command to produce the paper's tables and plots for the ROUGE-LSum score:
evaluate analysis --rouge-lsum-path dataset/scores/rouge_lsum.tsv --output-directory dataset/analysis/
To also produce a boxplot of the page complexity, execute:
evaluate analysis --complexity-path dataset/complexity.tsv --output-directory dataset/analysis/
The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. In addition, we provide the scrapers' versions at their evaluation time. The table is sorted in descending order over the F1-score:
Scraper | Precision | Recall | F1-Score | Version |
---|---|---|---|---|
Fundus | 99.89±0.57 | 96.75±12.75 | 97.69±9.75 | 0.4.1 |
Trafilatura | 93.91±12.89 | 96.85±15.69 | 93.62±16.73 | 1.12.0 |
news-please | 97.95±10.08 | 91.89±16.15 | 93.39±14.52 | 1.6.13 |
BTE | 81.09±19.41 | 98.23±8.61 | 87.14±15.48 | / |
jusText | 86.51±18.92 | 90.23±20.61 | 86.96±19.76 | 3.0.1 |
BoilerNet | 85.96±18.55 | 91.21±19.15 | 86.52±18.03 | / |
Boilerpipe | 82.89±20.65 | 82.11±29.99 | 79.90±25.86 | 1.3.0 |
Previous Results
Scraper | Precision | Recall | F1-Score | Version |
---|---|---|---|---|
Fundus | 99.89±0.57 | 96.75±12.75 | 97.69±9.75 | 0.2.2 |
Trafilatura | 90.54±18.86 | 93.23±23.81 | 89.81±23.69 | 1.7.0 |
BTE | 81.09±19.41 | 98.23±8.61 | 87.14±15.48 | / |
jusText | 86.51±18.92 | 90.23±20.61 | 86.96±19.76 | 3.0.0 |
news-please | 92.26±12.40 | 86.38±27.59 | 85.81±23.29 | 1.5.44 |
BoilerNet | 84.73±20.82 | 90.66±21.05 | 85.77±20.28 | / |
Boilerpipe | 82.89±20.65 | 82.11±29.99 | 79.90±25.86 | 1.3.0 |
We encourage contributions, particularly those involving competitive news scrapers. For example, you can contribute by:
- Submitting a New Scraper: Open an issue or submit a pull request to incorporate your scraper into our evaluation pipeline. We will review and integrate new submissions as appropriate.
- Updating an Existing Scraper: Please inform us if a supported scraper has undergone significant updates. We are open to re-evaluating our results accordingly. (Previous evaluation results are available on our Release Page.)
Note: We also appreciate contributions to the Fundus library!
Please open an issue for unresolved questions about our paper or the evaluation in this repository. For questions about the general functionality or bug reports regarding Fundus please refer to our main repository and submit an issue.
Please cite the following paper when using Fundus or building upon our work:
@inproceedings{dallabetta-etal-2024-fundus,
title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
author = "Dallabetta, Max and
Dobberstein, Conrad and
Breiding, Adrian and
Akbik, Alan",
editor = "Cao, Yixin and
Feng, Yang and
Xiong, Deyi",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-demos.29",
pages = "305--314",
abstract = "This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.",
}
- This repository's architecture has been inspired by the web content extraction benchmark (Bevendorff et al., 2023).
- Since BoilerNet has no Python package on PyPI, we adopted a stripped-down version of the upstream BoilerNet provided by Bevendorff et al. from their web content extraction benchmark.
- Similarly, BTE has no Python package on PyPI. Here, we used the implementation by Jan Pomikalek found from this and this source.