Skip to content

Releases: flairNLP/fundus

v0.4.6

05 Nov 18:24
f06969f
Compare
Choose a tag to compare

🚨 Hotfix release for CCNewsCrawler 🚨

With the newly added xpath_search in version 0.4.5 some parsers generated unpickable extractions, crashing the CCNewsCrawler when piping back to the main thread and thus rendering the crawler unusable. This issue is now fixed with #655

Updated parsers

Bug fixes

  • Fix a bug with attribute defaults and add default_factory parameter by @MaxDall in #649
  • Fix pickling problem in LinkedDataMapping by @addie9800 in #655

QoL

  • Add additional space characters to normalize_whitespace by @MaxDall in #646
  • Improve encoding detection by @MaxDall in #650

Full Changelog: v0.4.5...v0.4.6

v0.4.5

22 Oct 18:30
5d3f301
Compare
Choose a tag to compare

Important

This is a re-release of version 0.4.5 from 10/21/2024, as the package couldn't be published on pypi.

New publishers for Japan and Spain and some maintenance 🔧

Publishers

New

We added two new publishers located in Japan (The Japan News/Yomiuri Shimbun) and one from Spain (El Pais)

Fixes

for DEVs

JSON+LD

We refactored our JSON and JSON-LD parser to be more robust and support multi-type LDs

Deprecation

  • Deprecate get_value_by_key_path and replace with xpath_search by @MaxDall in #626

Bug fixes

Fixed a bug with using suppress as error handling would result in skipping articles

  • Add default return values for attributes by @MaxDall in #633

Full Changelog: v0.4.4...v0.4.5

v0.4.4

30 Sep 15:46
cf5b17f
Compare
Choose a tag to compare

New publishers for India, Switzerland, and Australia

With this release, we added 3 new publishers, updated several existing ones, and added some QoL functionality for DEVs

Publishers

New

Updates

What's new?

We implemented XPath queries for LinkedDataMaping to search through the data more fine-grained (@MaxDall in #614). Further, we now parse crawl-delays from publisher-given robots.txt files, which can be omitted through the crawler (@MaxDall in #609). Additionally, we ...

  • Ignore robots.txt in coverage script by @MaxDall in #610
  • Adjust generic_topic_parsing to return only unique topics by @MaxDall in #620

Bug fixes

  • Fix a bug with the plaintext property of Article by @MaxDall in #612

Full Changelog: v0.4.3...v0.4.4

v0.4.3

04 Sep 12:07
ccf5a80
Compare
Choose a tag to compare

Introducing New Publishers from Canada, Germany, and India 🚀

This release includes:

  • Support for five new publishers (three from Canada, one from India, and one from Germany)
  • Article filtering based on robots.txt

New Features

With this update, we've implemented article filtering using robots.txt. Each URL fetched is now evaluated against the path and user-agent restrictions specified by publishers in their robots.txt files. This feature is enabled by default, but users can disable it by setting ignore_robots=True in the Crawler constructor.

New Publishers

Canada (CA)

India (IND)

Germany (DE)

Updates

We've updated our APNews parser to accurately parse authors once more and applied additional fixes.

Bug Fixes

Full Changelog: v0.4.2...v0.4.3

v0.4.2

28 Aug 13:30
8972903
Compare
Choose a tag to compare

Maintainance release 🔧

  • update existing parsers to address layout changes
  • update evaluation results
  • update the paper citation

Bug fixes

  • Add mandatory set_name to logging documentation by @addie9800 in #586
  • Catch ConnectionError when fetching RSS-Feeds by @MaxDall in #589
  • Fix typing regarding new fastWARC version by @MaxDall in #591

Full Changelog: v0.4.1...v0.4.2

v0.4.1

06 Aug 14:56
351546d
Compare
Choose a tag to compare

New support for Norwegian publishers 🚀

Thanks to our community, we added 9 new publishers to Fundus, being:

NO

  • VerdensGang
  • Dagbladet
  • Nettavisen
  • NRK

UK

  • Metro
  • Express
  • BBC

FR

  • Les Echos

CH

  • NZZ

New Publisher

Update existing publisher

Bug fixes

What else

New Contributors

Full Changelog: v0.4.0...v0.4.1

v0.4.0

17 Jul 15:51
72e7ff0
Compare
Choose a tag to compare

🚀 Crawl 1 million articles in 7 hours on local hardware*

With this release, we made stability improvements regarding our CC-NEWS pipeline and are introducing some QoL features, being:

  • a timeout parameter for the crawler
  • article serialization
  • improved logging
  • redesign of the PublisherCollection
  • redesign of the Article class

Further, we added two new publishers (golem, Heise) and made several updates to existing publishers and general bug fixes.

*Testing involved crawling 100,000, which took 41.5 minutes, and scaling timings up by 10. This was done on a machine using 1000 Mbit/s bandwidth, Core i9-13905H, 64GB RAM, Windows 11, and the complete PublisherCollection. Results may vary based on the use case and bandwidth.

CC-NEWS pipeline and documentation

  • Slow down WARC path requests by @MaxDall in #538
  • Guard download and streaming of WARC files by @MaxDall in #537
  • Spread parallel requests for CCNewsCrawler by @MaxDall in #539
  • Fix upper bound for retries and catch urllib3.exceptions.HTTPError by @MaxDall in #541
  • Add progress bar for WARC file processing by @MaxDall in #542
  • Rework examples and tutorials regarding CC-NEWS by @MaxDall in #560

QoL

New timeout parameter for crawl method

New article serialization

Improved logging

  • Expose loggers and update documentation by @MaxDall in #540
  • Rework logging and fix overwritten config by @MaxDall in #553

Redesigned PublisherCollection class

Redesigned Article class

Publishers

New Publishers

Fixes

Misc

  • Add timeout to publisher_coverage.py by @MaxDall in #508
  • Remove _parser from file names by @addie9800 in #516
  • Catch errors in coverage only if no complete articles were received by @MaxDall in #515
  • Remove previous file when using -o option in test case script by @MaxDall in #517
  • Set PYTHONPATH to the Root of the Repository for the Publisher Coverage Actions by @dobbersc in #519
  • Refactor metadata parsing to include multiple values using the same key by @MaxDall in #523
  • Deprecated Flag for Uncrawlable Publishers by @addie9800 in #534
  • Show details about incomplete articles in Publisher Coverage by @addie9800 in #531
  • Use timeout parameter in coverage script instead of wrapper by @MaxDall in #548

Bug Fixes

New Contributors

Full Changelog: v0.3.1...v0.4.0

v0.3.1

13 May 11:28
4fa7bd5
Compare
Choose a tag to compare

🚀 New Release: Version 0.3.1

This release is mostly about content! With the help of many new contributors we added ~40 new publisher.

New publisher

What else

  • Change default user-agent -> FundusBot by @MaxDall in #440

  • Add a part about overwriting unit tests in how_to_add_a_publisher.md by @MaxDall in #442

  • Catch an error regarding RSSFeed and change user-agent for Focus by @MaxDall in #447

  • Pin validators >= 0.24.0 by @MaxDall in #448

  • Implement module scoped logging by @MaxDall in #361

  • Add meta-info file automatically to git by @MaxDall in #451

  • gh446: predict encodings for responses by @MaxDall in #450

  • Use !r instead of single quotes for strings in f-strings by @MaxDall in #454

  • Fix URL Filter Bug by @addie9800 in #459

  • Add part about un-crawl-able publishers by @MaxDall in #463

  • Fix some issues regarding encoding detection by @MaxDall in #455

  • Add section about known issues to documentation by @MaxDall in #489

  • Raise Exception if summary is not first node by @MaxDall in #485

  • Add .DS_Store to gitignore by @MaxDall in #503

  • Discard empty strings when parsing topics by @MaxDall in #502

  • Require at least one paragraph to evaluate ArticleBody to true by @MaxDall in #500

  • Fix a bug in topic parsing by @MaxDall in #499

  • Rearrange If Statement by @addie9800 in #506

  • Add test case for attributes that are required to be covered by @MaxDall in #501

  • Clean URLs before inserting into response_cache by @addie9800 in #504

  • Add custom user-agent to SpiegelOnline by @MaxDall in #507

  • Add header to bot protection snippet by @MaxDall in #512

  • Add custom user agent to DerStandard by @MaxDall in #511

  • Fix doctrsing for RequiresAll by @MaxDall in #509

  • Addition of parser for the german newspaper "Gamestar" by @Opticus10 in #475

  • Bump version to 0.3.1 by @MaxDall in #513

New Contributors

Read more

v0.3.0

21 Apr 19:41
28a74d0
Compare
Choose a tag to compare

🚀 New Release: Version 0.3.0

We're excited to announce a significant overhaul of the Fundus crawling core logic in this release! We've transitioned from using asyncio to a ThreadPool-based solution, resulting in a more robust and performant system. Now, each publisher operates on its own thread, synchronized seamlessly through a queue.

Breaking changes

To provide a more streamlined experience, we've relocated every crawler-type specific parameter to its respective constructor. As a result, these parameters are no longer accessible through the crawl method:

  • delay -> Crawler
  • start, end -> CCNewsCrawler

Furthermore, since we removed asyncio, the crawl_async method is no longer available.

What's new

New publisher

Updating existing publisher

Bug fixes

QoL

  • Adds Pretty Print for PublisherCollection and PublisherSpec by @addie9800 in #399
  • Add custom filter for publisher_coverage to skip boolean values by @MaxDall in #408
  • Documentation Update: Explain Addition of New Countries by @addie9800 in #413
  • Attributes Parameter in Test Generation Script by @addie9800 in #411
  • Add body to unit tests by @MaxDall in #338
  • Adds a part about generate_tables script to the documentation by @MaxDall in #424

Maintenance

  • Update relevant actions to versions utilizing node 20 by @MaxDall in #417
  • Disable strict_query parsing for URL validation. by @MaxDall in #407

Full Changelog: v0.2.2...v0.3.0

v0.2.2

04 Apr 11:44
f01be36
Compare
Choose a tag to compare

What's Changed

This removes some leftover code used for testing that would write a file called test_nation.html to the project root when using us.TheNation's parser.

Full Changelog: v0.2.1...v0.2.2