You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to figure out, that that works.
So far, I've connected my spider to redis with 3 test-domains.
When I start the spider, I can see the first hit to the websites.
What I don't understand now is:
How are the URLs that the LinkExtractor finds fed back into Redis?
And I assume, my cralwer is being "stopped" at: domain = kwargs.pop('domain', '')
kwargs is always an empty dict.
Where does it come from?
It seems like, I initialize self.allowed_domains with an empty list of domains - so the crawler can't start.
How to do it right?
`
class MyCrawlerSpider(RedisCrawlSpider):
"""Spider that reads urls from redis queue (mycrawler:start_urls)."""
name = "redis_my_crawler"
redis_key = 'mycrawler:start_urls'
rules = (
Rule(LinkExtractor(), follow=True, process_links="filter_links"),
Rule(LinkExtractor(), callback='parse_page', follow=True, process_links="filter_links"),
)
def __init__(self, *args, **kwargs):
# Dynamically define the allowed domains list.
print('Init')
print(args)
print(kwargs)
domain = kwargs.pop('domain', '')
print(domain)
self.allowed_domains = filter(None, domain.split(','))
print(self.allowed_domains)
super(MyCrawlerSpider, self).__init__(*args, **kwargs)
def filter_links(self, links):
allowed_strings = ('news')
allowed_links = []
for link in links:
if (any(s in link.url.lower() for s in allowed_strings)
and any(domain in link.url for domain in self.allowed_domains)):
print(link)
allowed_links.append(link)
return allowed_links
def parse_page(self, response):
print(response.url)
return None
`
The text was updated successfully, but these errors were encountered:
Description
Hello,
I'm trying to figure out, that that works.
So far, I've connected my spider to redis with 3 test-domains.
When I start the spider, I can see the first hit to the websites.
What I don't understand now is:
How are the URLs that the LinkExtractor finds fed back into Redis?
And I assume, my cralwer is being "stopped" at:
domain = kwargs.pop('domain', '')
kwargs is always an empty dict.
Where does it come from?
It seems like, I initialize
self.allowed_domains
with an empty list of domains - so the crawler can't start.How to do it right?
`
class MyCrawlerSpider(RedisCrawlSpider):
"""Spider that reads urls from redis queue (mycrawler:start_urls)."""
name = "redis_my_crawler"
`
The text was updated successfully, but these errors were encountered: