Introducing New Publishers from Canada, Germany, and India ๐
This release includes:
- Support for five new publishers (three from Canada, one from India, and one from Germany)
- Article filtering based on
robots.txt
New Features
With this update, we've implemented article filtering using robots.txt
. Each URL fetched is now evaluated against the path and user-agent
restrictions specified by publishers in their robots.txt
files. This feature is enabled by default, but users can disable it by setting ignore_robots=True
in the Crawler
constructor.
New Publishers
Canada (CA)
- Introduced CBC as the first Canadian publisher by @addie9800 in #583
- Added
NationalPost
by @addie9800 in #584 - Included The Globe and Mail by @addie9800 in #587
India (IND)
- Added
Times Of India
by @addie9800 in #569
Germany (DE)
Updates
We've updated our APNews
parser to accurately parse authors once more and applied additional fixes.
Bug Fixes
- Protected key access for RSSFeed entries by @MaxDall in #599
- Fixed an issue in test file generation by @addie9800 in #597
Full Changelog: v0.4.2...v0.4.3