Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update our Paper Citation to ACL #12

Merged
merged 1 commit into from
Aug 14, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 20 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Fundus News Scraper Evaluation

This repository contains the evaluation code and dataset to reproduce the results from the [paper](https://arxiv.org/abs/2403.15279) "FUNDUS: A Simple-to-Use News Scraper Optimized for High Quality Extractions".
This repository contains the evaluation code and dataset to reproduce the results from the [paper](https://aclanthology.org/2024.acl-demos.29/) "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions".

[Fundus](https://github.com/flairNLP/fundus) is a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code.

In the following sections, we provide instructions to reproduce the comparative evaluation of Fundus against prominent scraping libraries.
Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than comparable news scrapers.
For a more in-depth overview of Fundus, the evaluation practises, and its results, consult the [result summary](https://github.com/dobbersc/fundus-evaluation/tree/master?tab=readme-ov-file#results) and our [paper](https://arxiv.org/abs/2403.15279).
For a more in-depth overview of Fundus, the evaluation practises, and its results, consult the [result summary](https://github.com/dobbersc/fundus-evaluation/tree/master?tab=readme-ov-file#results) and our [paper](https://aclanthology.org/2024.acl-demos.29/).

## Prerequisites
Fundus and this evaluation repository require Python 3.8 or later and Java for the Boilerpipe scraper.
Expand Down Expand Up @@ -153,16 +153,26 @@ The table is sorted in descending order over the F1-score:

## Cite

Please cite the following [paper](https://arxiv.org/abs/2403.15279) when using Fundus or building upon our work:
Please cite the following [paper](https://aclanthology.org/2024.acl-demos.29/) when using Fundus or building upon our work:

```bibtex
@misc{dallabetta2024fundus,
title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions},
author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik},
year={2024},
eprint={2403.15279},
archivePrefix={arXiv},
primaryClass={cs.CL}
@inproceedings{dallabetta-etal-2024-fundus,
title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
author = "Dallabetta, Max and
Dobberstein, Conrad and
Breiding, Adrian and
Akbik, Alan",
editor = "Cao, Yixin and
Feng, Yang and
Xiong, Deyi",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-demos.29",
pages = "305--314",
abstract = "This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.",
}
```

Expand Down