Releases · scrapinghub/frontera

This is major release containing many architectural changes. The goal of these changes is make development and debugging of the crawling strategy easier. From now, there is an extensive guide in documentation on how to write a custom crawling strategy, a single process mode making much easier to debug crawling strategy locally and old distributed mode for production systems. Starting from this version there is no requirement to setup Apache Kafka or HBase to experiment with crawling strategies on your local computer.

We also removed unnecessary, rarely used features: distributed spiders run mode, prioritisation logic from backends to make Frontera easier to use and understand.

Here is a (somewhat) full change log:

PyPy (2.7.*) support,
Redis backend (kudos to @khellan),
LRU cache and two cache generations for HBaseStates,
Discovery crawling strategy, respecting robots.txt and leveraging sitemaps to discover links faster,
Breadth-first and depth-first crawling strategies,
new mandatory component in backend: DomainMetadata,
filter_links_extracted method in crawling strategy API to optimise calls to backends for state data,
create_request in crawling strategy is now using FronteraManager middlewares,
many batch gen instances,
support of latest kafka-python,
statistics are sent to message bus from all parts of Frontera,
overall reliability improvements,
settings for OverusedBuffer,
DBWorker was refactored and divided on components (kudos to @vshlapakov),
seeds addition can be done using s3 now,
Python 3.7 compatibility.

Assets 2

09 Feb 14:26

sibiryakov

v0.7.1

1129a25

Codecs are serializing strings type and other improvements

Thanks to @voith, a problem introduced with beginning of support of Python 3 when Frontera was supporting only keys and values stored as bytes in .meta fields is now solved. Many Scrapy middlewares weren't working or working incorrectly. This is still not tested properly, so please report any bugs.

Other improvements include:

batched states refresh in crawling strategy,
proper access to redirects in Scrapy converters,
more readable and simple OverusedBuffer implementation,
examples, tests and docs fixes.

Thank you all, for your contributions!

Assets 2

29 Nov 11:36

sibiryakov

v0.7.0

9e94fc9

Support of new Kafka API and other minor improvements

A long awaiting support of kafka-python 1.x.x client. Now Frontera is much more resistant to physical connectivity loss and is using new asynchronous Kafka API.
Other improvements:

SW consumes less CPU (because of rare state flushing),
requests creation api is changed in BaseCrawlingStrategy, and now it's batch oriented,
new article in the docs on cluster setup,
disable scoring log consumption option in DB worker,
fix of hbase drop table,
improved tests coverage.

Assets 2

18 Aug 09:41

sibiryakov

v0.6.0

9ff6e76

Python 3 support and many more

Full Python 3 support 👏 👍 🍻 (#106), all the thanks goes to @Preetwinder.
canonicalize_url method removed in favor of w3lib implementation.
The whole Request (incl. meta) is propagated to DB Worker, by means of scoring log (fixes #131)
Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
HBaseQueue supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
Request object is now persisted in HBaseQueue, allowing to schedule requests with specific meta, headers, body, cookies parameters.
MESSAGE_BUS_CODEC option allowing to choose other than default message bus codec.
Strategy worker refactoring to simplify it’s customization from subclasses.
Fixed a bug with extracted links distribution over spider log partitions (#129).

Assets 2

22 Jul 15:47

sibiryakov

v0.5.3

2d55a38

Fixed kafka message bus crash with default codec, new options

New options for managing broad crawling queue get algorithm and improved logging in manager and strategy worker.

Assets 2

18 Jul 15:21

sibiryakov

v0.5.2.3

7dd6165

Fixing import crash when kafka-python isn't installed

See #173

Assets 2

29 Jun 08:09

sibiryakov

v0.5.2.2

62d9e49

Options, proper finishing by crawling strategy and traceback on SIGUSR1

CONSUMER_BATCH_SIZE is removed and two new options are introduced SPIDER_LOG_CONSUMER_BATCH_SIZE and SCORING_LOG_CONSUMER_BATCH_SIZE
Traceback is thrown into log when SIGUSR1 is received in DBW or SW.
Finishing in SW is fixed when crawling strategy reports finished.

Assets 2

24 Jun 10:41

sibiryakov

v0.5.2.1

723ef7b

Kafka codec option

Before that release the default compression codec was Snappy. We found out Snappy support is broken in certain Kafka versions, and issued that release. The latest version has no compression codec enabled by default, and allows to choose the compression codec with KAFKA_CODEC_LEGACY option.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: scrapinghub/frontera

Bug fix release

Examples and documentation updates

Crawling strategy is a cornerstone of Frontera

Codecs are serializing strings type and other improvements

Support of new Kafka API and other minor improvements

Python 3 support and many more

Fixed kafka message bus crash with default codec, new options

Fixing import crash when kafka-python isn't installed

Options, proper finishing by crawling strategy and traceback on SIGUSR1

Kafka codec option