scrapy-kafka-export package provides a Scrapy extension to export items
to Kafka.
License is MIT.
Extension requires Python 2.7 or 3.4+.
pip install scrapy-kafka-export
To use KafkaItemExporterExtension, enable and configure it in settings.py:
EXTENSIONS = {
'scrapy_kafka_export.KafkaItemExporterExtension': 1,
}
KAFKA_EXPORT_ENABLED = True
KAFKA_BROKERS = [
'kafka1:9093',
'kafka2:9093',
'kafka3:9093'
]
KAFKA_TOPIC = 'test-topic'
After that all scraped items would be put to a Kafka topic.
If an item has an _id field, _id is used as a message key.
If your Kafka uses SSL, configure SSL-based auth:
KAFKA_SSL_CONFIG_MODULE = 'myproject' KAFKA_SSL_CACERT_FILE = 'certificates/ca-cert.pem' KAFKA_SSL_CLIENTCERT_FILE = 'certificates/client-cert.pem' KAFKA_SSL_CLIENTKEY_FILE = 'certificates/client-key.pem'
Assuming the following structure for the certificates from the project 'myproject':
myproject_repo/ myproject_repo/myproject/ myproject_repo/myproject/__init_.py myproject_repo/myproject/certificates/ca-cert.pem myproject_repo/myproject/certificates/myproject-client-cert.pem myproject_repo/myproject/certificates/myproject-client-key.pem ...
If you're using setup.py to deploy the project (using scrapyd or Scrapy Cloud),
certificates should be added to package data. Modify setup.py like this:
from setuptools import setup, find_packages
setup(
name = 'myproject',
...
package_data = {
'myproject': ['certificates/*.pem'],
},
...
)
KAFKA_EXPORT_ENABLED- Flag that enables the extension; it is False by default.KAFKA_BROKERS- List of Kafka brokers in format host:portKAFKA_TOPIC- Kafka topic where items are going to be sentKAFKA_BATCH_SIZE- Kafka batch size (100 by default).KAFKA_SSL_CONFIG_MODULE- name of the project moduleKAFKA_SSL_CACERT_FILE- resource path of the Certificate Authority certificateKAFKA_SSL_CLIENTCERT_FILE- resource path of the client certificateKAFKA_SSL_CLIENTKEY_FILE- resource path of the client key
If KAFKA_SSL_CONFIG_MODULE is not set, no certificate will be loaded.
If you want to push Scrapy items to Kafka from a script, instead of using
scrapy_kafka_export.KafkaItemExporterExtension use
scrapy_kafka_export.writer.ScrapyKafkaTopicWriter; see its docstring
for more.