Releases: scrapinghub/frontera
Bug fix release
Some bugs are fixed due to dependencies update.
Examples and documentation updates
- general-spider example is fixed,
- SW crashes with ZeroMQ are fixed (stats output is wiped),
- documentation update.
Crawling strategy is a cornerstone of Frontera
This is major release containing many architectural changes. The goal of these changes is make development and debugging of the crawling strategy easier. From now, there is an extensive guide in documentation on how to write a custom crawling strategy, a single process mode making much easier to debug crawling strategy locally and old distributed mode for production systems. Starting from this version there is no requirement to setup Apache Kafka or HBase to experiment with crawling strategies on your local computer.
We also removed unnecessary, rarely used features: distributed spiders run mode, prioritisation logic from backends to make Frontera easier to use and understand.
Here is a (somewhat) full change log:
- PyPy (2.7.*) support,
- Redis backend (kudos to @khellan),
- LRU cache and two cache generations for HBaseStates,
- Discovery crawling strategy, respecting robots.txt and leveraging sitemaps to discover links faster,
- Breadth-first and depth-first crawling strategies,
- new mandatory component in backend: DomainMetadata,
- filter_links_extracted method in crawling strategy API to optimise calls to backends for state data,
- create_request in crawling strategy is now using FronteraManager middlewares,
- many batch gen instances,
- support of latest kafka-python,
- statistics are sent to message bus from all parts of Frontera,
- overall reliability improvements,
- settings for OverusedBuffer,
- DBWorker was refactored and divided on components (kudos to @vshlapakov),
- seeds addition can be done using s3 now,
- Python 3.7 compatibility.
Codecs are serializing strings type and other improvements
Thanks to @voith, a problem introduced with beginning of support of Python 3 when Frontera was supporting only keys and values stored as bytes in .meta fields is now solved. Many Scrapy middlewares weren't working or working incorrectly. This is still not tested properly, so please report any bugs.
Other improvements include:
- batched states refresh in crawling strategy,
- proper access to redirects in Scrapy converters,
- more readable and simple OverusedBuffer implementation,
- examples, tests and docs fixes.
Thank you all, for your contributions!
Support of new Kafka API and other minor improvements
A long awaiting support of kafka-python 1.x.x client. Now Frontera is much more resistant to physical connectivity loss and is using new asynchronous Kafka API.
Other improvements:
- SW consumes less CPU (because of rare state flushing),
- requests creation api is changed in BaseCrawlingStrategy, and now it's batch oriented,
- new article in the docs on cluster setup,
- disable scoring log consumption option in DB worker,
- fix of hbase drop table,
- improved tests coverage.
Python 3 support and many more
- Full Python 3 support 👏 👍 🍻 (#106), all the thanks goes to @Preetwinder.
- canonicalize_url method removed in favor of w3lib implementation.
- The whole
Request
(incl. meta) is propagated to DB Worker, by means of scoring log (fixes #131) - Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
HBaseQueue
supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.Request
object is now persisted inHBaseQueue
, allowing to schedule requests with specific meta, headers, body, cookies parameters.MESSAGE_BUS_CODEC
option allowing to choose other than default message bus codec.- Strategy worker refactoring to simplify it’s customization from subclasses.
- Fixed a bug with extracted links distribution over spider log partitions (#129).
Fixed kafka message bus crash with default codec, new options
New options for managing broad crawling queue get algorithm and improved logging in manager and strategy worker.
Fixing import crash when kafka-python isn't installed
Options, proper finishing by crawling strategy and traceback on SIGUSR1
CONSUMER_BATCH_SIZE
is removed and two new options are introducedSPIDER_LOG_CONSUMER_BATCH_SIZE
andSCORING_LOG_CONSUMER_BATCH_SIZE
- Traceback is thrown into log when SIGUSR1 is received in DBW or SW.
- Finishing in SW is fixed when crawling strategy reports finished.
Kafka codec option
Before that release the default compression codec was Snappy. We found out Snappy support is broken in certain Kafka versions, and issued that release. The latest version has no compression codec enabled by default, and allows to choose the compression codec with KAFKA_CODEC_LEGACY
option.