Releases: flairNLP/fundus
v0.4.6
🚨 Hotfix release for CCNewsCrawler
🚨
With the newly added xpath_search
in version 0.4.5
some parsers generated unpickable extractions, crashing the CCNewsCrawler
when piping back to the main thread and thus rendering the crawler unusable. This issue is now fixed with #655
Updated parsers
- Fix
paragraph
andsubheadline
selectors forMDR
by @MaxDall in #648 - Fix
BoersenZeitung
by @addie9800 in #647 - Fix
Merkur
by @addie9800 in #654 - Fix
Frankfurter Rundschau
by @addie9800 in #652 - Update
Stern
parser by @MaxDall in #658 - Add RSSFeed to
LeFigaro
by @addie9800 in #657
Bug fixes
- Fix a bug with attribute defaults and add
default_factory
parameter by @MaxDall in #649 - Fix pickling problem in
LinkedDataMapping
by @addie9800 in #655
QoL
- Add additional space characters to
normalize_whitespace
by @MaxDall in #646 - Improve encoding detection by @MaxDall in #650
Full Changelog: v0.4.5...v0.4.6
v0.4.5
Important
This is a re-release of version 0.4.5
from 10/21/2024, as the package couldn't be published on pypi
.
New publishers for Japan and Spain and some maintenance 🔧
Publishers
New
We added two new publishers located in Japan (The Japan News/Yomiuri Shimbun) and one from Spain (El Pais)
- Add
The Japan News
by @addie9800 in #627 - Add
Yomiuri Shimbun
by @addie9800 in #628 - Add
El Pais
by @addie9800 in #632
Fixes
- Fix bug in author parsing in
TheNamibian
by @addie9800 in #619 - Fix
Hessenschau
by @addie9800 in #624 - Fix
Focus
by @addie9800 in #623 - Update
Taz
parser by @MaxDall in #642 - Handle author dict Bug by @addie9800 in #641
for DEVs
JSON+LD
We refactored our JSON and JSON-LD parser to be more robust and support multi-type LDs
- Cleaner code for LD and JSON parsing by @MaxDall in #625
- Handle multiple ld types by @addie9800 in #631
- Fix trailing whitespace issue by @addie9800 in #635
Deprecation
Bug fixes
Fixed a bug with using suppress
as error handling would result in skipping articles
Full Changelog: v0.4.4...v0.4.5
v0.4.4
New publishers for India, Switzerland, and Australia
With this release, we added 3 new publishers, updated several existing ones, and added some QoL functionality for DEVs
Publishers
New
- IND:
Bhaskar
(@MaxDall in #605) - CH:
TagesAnzeiger
(@MaxDall in #608) - AU:
TheWestAustralian
(@MaxDall in #615)
Updates
- DE:
SportSchau
(@addie9800 in #611) - FR:
LesEchos
is now deprecated (@MaxDall in #617) - UK:
TheTelegraph
(@MaxDall in #616)
What's new?
We implemented XPath queries for LinkedDataMaping
to search through the data more fine-grained (@MaxDall in #614). Further, we now parse crawl-delays from publisher-given robots.txt
files, which can be omitted through the crawler (@MaxDall in #609). Additionally, we ...
- Ignore
robots.txt
in coverage script by @MaxDall in #610 - Adjust
generic_topic_parsing
to return only unique topics by @MaxDall in #620
Bug fixes
Full Changelog: v0.4.3...v0.4.4
v0.4.3
Introducing New Publishers from Canada, Germany, and India 🚀
This release includes:
- Support for five new publishers (three from Canada, one from India, and one from Germany)
- Article filtering based on
robots.txt
New Features
With this update, we've implemented article filtering using robots.txt
. Each URL fetched is now evaluated against the path and user-agent
restrictions specified by publishers in their robots.txt
files. This feature is enabled by default, but users can disable it by setting ignore_robots=True
in the Crawler
constructor.
New Publishers
Canada (CA)
- Introduced CBC as the first Canadian publisher by @addie9800 in #583
- Added
NationalPost
by @addie9800 in #584 - Included The Globe and Mail by @addie9800 in #587
India (IND)
- Added
Times Of India
by @addie9800 in #569
Germany (DE)
Updates
We've updated our APNews
parser to accurately parse authors once more and applied additional fixes.
Bug Fixes
- Protected key access for RSSFeed entries by @MaxDall in #599
- Fixed an issue in test file generation by @addie9800 in #597
Full Changelog: v0.4.2...v0.4.3
v0.4.2
Maintainance release 🔧
- update existing parsers to address layout changes
- update evaluation results
- update the paper citation
Bug fixes
- Add mandatory
set_name
to logging documentation by @addie9800 in #586 - Catch
ConnectionError
when fetching RSS-Feeds by @MaxDall in #589 - Fix typing regarding new
fastWARC
version by @MaxDall in #591
Full Changelog: v0.4.1...v0.4.2
v0.4.1
New support for Norwegian publishers 🚀
Thanks to our community, we added 9 new publishers to Fundus, being:
NO
- VerdensGang
- Dagbladet
- Nettavisen
- NRK
UK
- Metro
- Express
- BBC
FR
- Les Echos
CH
- NZZ
New Publisher
- Add UK newspaper
Metro
by @MaxDall in #562 - Add french news paper Les Echos by @MaxDall in #564
- Add swiss newspaper
NZZ
by @MaxDall in #565 - Add UK newspaper
Express
by @MaxDall in #566 - Add The BBC as a UK publisher by @olaughter in #572
- Norwegian publishers by @mhaugestad in #574
Update existing publisher
- Fix
NTV
parser by @MaxDall in #577 - Set
WashingtonPost
deprecated by @MaxDall in #578 - Fix
TheMirror
by @MaxDall in #579
Bug fixes
- Fix a bug with
HTMLTestFile
path generation by @MaxDall in #563 - Bump Version of Dynamic Badges Action by @dobbersc in #567
- GH573: Fixes #573 by @MaxDall in #580
What else
New Contributors
- @mhaugestad made their first contribution in #574
Full Changelog: v0.4.0...v0.4.1
v0.4.0
🚀 Crawl 1 million articles in 7 hours on local hardware*
With this release, we made stability improvements regarding our CC-NEWS pipeline and are introducing some QoL features, being:
- a timeout parameter for the crawler
- article serialization
- improved logging
- redesign of the
PublisherCollection
- redesign of the
Article
class
Further, we added two new publishers (golem, Heise) and made several updates to existing publishers and general bug fixes.
*Testing involved crawling 100,000, which took 41.5 minutes, and scaling timings up by 10. This was done on a machine using 1000 Mbit/s bandwidth, Core i9-13905H, 64GB RAM, Windows 11, and the complete PublisherCollection
. Results may vary based on the use case and bandwidth.
CC-NEWS pipeline and documentation
- Slow down WARC path requests by @MaxDall in #538
- Guard download and streaming of WARC files by @MaxDall in #537
- Spread parallel requests for
CCNewsCrawler
by @MaxDall in #539 - Fix upper bound for retries and catch
urllib3.exceptions.HTTPError
by @MaxDall in #541 - Add progress bar for WARC file processing by @MaxDall in #542
- Rework examples and tutorials regarding CC-NEWS by @MaxDall in #560
QoL
New timeout parameter for crawl
method
- Add crawl timeout functionality by @olaughter in #536
New article serialization
- Add export feature for Articles by @addie9800 in #530
Improved logging
- Expose loggers and update documentation by @MaxDall in #540
- Rework logging and fix overwritten config by @MaxDall in #553
Redesigned PublisherCollection
class
- Publisher Collection Rework by @addie9800 in #526
Redesigned Article
class
Publishers
New Publishers
- Adds new publisher Heise by @addie9800 in #426
- added golem as publisher by @Feyrbrand in #484
Fixes
- Fix author parsing for
BSZ
by @MaxDall in #518 - Update
TechCrunch
by @MaxDall in #522 - Remove unreachable source for
FreeBeacon
by @MaxDall in #521 - Add sitemap filter to
BusinessInsiderDE
by @MaxDall in #520 - Fix
sitemap_filter
forFreeBeacon
by @MaxDall in #527 - Mark Occupy Democrats as deprecated by @addie9800 in #543
- Fix The Mirror by @addie9800 in #547
- Fix Heise by @addie9800 in #545
- Update
EveningStandard
parser by @MaxDall in #549 - Fix Freie Presse by @addie9800 in #554
- Fix haberturk selectors by @MaxDall in #551
- Fix Funke topics by @addie9800 in #555
Misc
- Add timeout to publisher_coverage.py by @MaxDall in #508
- Remove _parser from file names by @addie9800 in #516
- Catch errors in coverage only if no complete articles were received by @MaxDall in #515
- Remove previous file when using
-o
option in test case script by @MaxDall in #517 - Set PYTHONPATH to the Root of the Repository for the Publisher Coverage Actions by @dobbersc in #519
- Refactor metadata parsing to include multiple values using the same key by @MaxDall in #523
- Deprecated Flag for Uncrawlable Publishers by @addie9800 in #534
- Show details about incomplete articles in Publisher Coverage by @addie9800 in #531
- Use
timeout
parameter in coverage script instead of wrapper by @MaxDall in #548
Bug Fixes
- Update LD Selector by @addie9800 in #514
- Documentation Fix
Requires
by @addie9800 in #535 - Fix an error message related to summary parsing by @MaxDall in #552
New Contributors
- @Feyrbrand made their first contribution in #484
- @olaughter made their first contribution in #536
Full Changelog: v0.3.1...v0.4.0
v0.3.1
🚀 New Release: Version 0.3.1
This release is mostly about content! With the help of many new contributors we added ~40 new publisher.
New publisher
- added zdf.de by @LethalSnake1337 in #432
- added Bayerischer Rundfunk as publisher by @MK2112 in #435
- add-publisher-wdr by @jannispoltier in #439
- Add Hessenschau by @martinknz in #472
- adding new publisher Berliner Morgenpost by @myoncee in #493
- add people daily publisher by @screw-44 in #444
- add daily star support by @Jabbawukis in #443
- Added UK Publisher The Sun by @BorisKalika in #445
- Added Euronews to fundus by @dkm1006 in #478
- Add Tagesspiegel by @fabianhenning in #488
- Add netzpolitik.org publisher by @SebChmie in #486
- added derstandard.at by @felixvonberlin in #431
- Added US publisher Rolling Stone by @brandjakHU in #453
- Feature/add motorsport magazin by @lsch0lz in #456
- Sportschau added by @JuSteif in #457
- Add PublisherSpec for "Der Postillon" by @buschd-nlp in #458
- RBB24 added by @henrikkirchmann in #460
- add junge welt by @zhengxueying in #462
- Add support for newspaper "Kicker" by @Benjamin2107 in #464
- Add Support for "Washington Post" by @areinicke in #467
- Add new publisher "Frankfurter Rundschau" by @SunitaSi in #468
- Add new publisher “The Mirror” @TingC99 in #466
- add au publisher nine news australia by @desicochrane in #469
- Added Daily Mail as a publisher by @BanoMarvey in #473
- Added BoersenZeitung by @SugarTzu in #474
- Add Voice Of America by @dlrow18 in #476
- Add new publisher Vogue by @Oreo7985 in #477
- Add new publisher Haberturk by @ozelalisen in #479
- Add new publisher TechCrunch.com by @jannichorst in #480
- add mitteldeutsche zeitung to publisher selection by @ThoBenMeHU in #481
- adding new publisher "wired" by @dilara1919 in #482
- Add Hamburger Abendblatt by @alwanyah in #483
- Added turkish publisher ntv by @umutyesildal in #491
- added french publisher le figaro by @mornningstar in #492
- add_freiepresse by @Frank10969 in #494
- add new publisher the standards by @Merchants-11 in #495
- Add Ruhr Nachrichten by @Junq1 in #496
- Add MalayMail publisher by @syafrahman in #497
- Added winfuture publisher by @lindhork in #487*
What else
-
Add a part about overwriting unit tests in how_to_add_a_publisher.md by @MaxDall in #442
-
Catch an error regarding
RSSFeed
and changeuser-agent
forFocus
by @MaxDall in #447 -
Use
!r
instead of single quotes for strings in f-strings by @MaxDall in #454 -
Fix URL Filter Bug by @addie9800 in #459
-
Fix some issues regarding encoding detection by @MaxDall in #455
-
Add section about known issues to documentation by @MaxDall in #489
-
Raise
Exception
if summary is not first node by @MaxDall in #485 -
Discard empty strings when parsing topics by @MaxDall in #502
-
Require at least one paragraph to evaluate
ArticleBody
to true by @MaxDall in #500 -
Rearrange If Statement by @addie9800 in #506
-
Add test case for attributes that are required to be covered by @MaxDall in #501
-
Clean URLs before inserting into response_cache by @addie9800 in #504
-
Addition of parser for the german newspaper "Gamestar" by @Opticus10 in #475
New Contributors
- @LethalSnake1337 made their first contribution in #432
- @MK2112 made their first contribution in #435
- @jannispoltier made their first contribution in #439
- @martinknz made their first contribution in #472
- @myoncee made their first contribution in #493
- @screw-44 made their first contribution in #444
- @Jabbawukis made their first contribution in #443
- @BorisKalika made their first contribution in #445
- @dkm1006 made their first contribution in #478
- @fabianhenning made their first contribution in #488
- @SebChmie made their first contribution in #486
- @felixvonberlin made their first contribution in #431
- @brandjakHU made their first contribution in #453
- @lsch0lz made their first contribution in #456
- @JuSteif made their first contribution in #457
- @buschd-nlp made their first contribution in #458
- @henrikkirchmann made their first contribution in #460
- @zhengxueying made their first contribution in #462
- @Benjamin2107 made their first contribution in #464
- @TingC99 made their first contribution in #466
- @areinicke made their first contribution in #467
- @SunitaSi made their first contribution in #468
- @desicochrane made their first contribution in #469
- @BanoMarvey made their first contribution in #473
- @SugarTzu made their first contribution in #474
- @dlrow18 made their first contribution in #476
- @Oreo7985 made their first contribution in #477
- @ozelalisen made their first contribution in #479
- @jannichorst made their first contribution in #480
- @ThoBenMeHU made their first contribution in #481
- @dilara1919 made their first contribution in #482
- @alwanyah made their first contribution in #483
- @umutyesildal made their first contribution in #491
- @mornningstar made their first contribution in #492
- @Frank10969 made their first contribution in #494
- @Merchants-11 made their first contribution in #495
- @Junq1 made their first contribution in #496
- @syafrahman made their first contribution in https://github.com/flairNLP/fundus/pul...
v0.3.0
🚀 New Release: Version 0.3.0
We're excited to announce a significant overhaul of the Fundus crawling core logic in this release! We've transitioned from using asyncio
to a ThreadPool
-based solution, resulting in a more robust and performant system. Now, each publisher operates on its own thread, synchronized seamlessly through a queue.
Breaking changes
To provide a more streamlined experience, we've relocated every crawler-type specific parameter to its respective constructor. As a result, these parameters are no longer accessible through the crawl
method:
delay
->Crawler
start, end
-> CCNewsCrawler
Furthermore, since we removed asyncio
, the crawl_async
method is no longer available.
What's new
- Unbatch Fundus by @MaxDall in #357
- Add
free_access
as attribute toArticle
by @MaxDall in #421 - Add query parameter [Based on #357] by @addie9800 in #403
- Rework
ExtractionFilter
to adept to boolean values by @MaxDall in #423
New publisher
- Add Lithuanian News Source by @addie9800 in #393
- Add US version of business insider by @MaxDall in #356
- Adding a swiss publisher (SRF) by @addie9800 in #410
- Add
Rheinische Post
as publisher by @MaxDall in #416
Updating existing publisher
- This is a renewed PR for BI Germany, that keeps the mostly Test files unmodified by @addie9800 in #402
- Bump
WAZ
to versionV1_1
by @MaxDall in #388 - Update
FAZ
parser by @MaxDall in #419 - bi authentication bug workaround by @addie9800 in #406
Bug fixes
- Fix domains for several publishers by @MaxDall in #398
- Restrict
typing-extensions
version to >= 4.6 by @MaxDall in #405 - Bump
mypy
to version 1.9.0 by @MaxDall in #412 - Fixed a bug in
documentation.yaml
by @MaxDall in #415 - Fix a bug in generate_parser_test_files.py by @MaxDall in #418
- Fix a bug in bf_search regarding boolean values by @MaxDall in #422
QoL
- Adds Pretty Print for PublisherCollection and PublisherSpec by @addie9800 in #399
- Add custom filter for
publisher_coverage
to skip boolean values by @MaxDall in #408 - Documentation Update: Explain Addition of New Countries by @addie9800 in #413
- Attributes Parameter in Test Generation Script by @addie9800 in #411
- Add
body
to unit tests by @MaxDall in #338 - Adds a part about
generate_tables
script to the documentation by @MaxDall in #424
Maintenance
- Update relevant actions to versions utilizing node 20 by @MaxDall in #417
- Disable
strict_query
parsing for URL validation. by @MaxDall in #407
Full Changelog: v0.2.2...v0.3.0
v0.2.2
What's Changed
This removes some leftover code used for testing that would write a file called test_nation.html
to the project root when using us.TheNation
's parser.
Full Changelog: v0.2.1...v0.2.2