Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does the CrawlSpider work? #284

Open
bch80 opened this issue Jun 26, 2023 · 0 comments
Open

How does the CrawlSpider work? #284

bch80 opened this issue Jun 26, 2023 · 0 comments

Comments

@bch80
Copy link

bch80 commented Jun 26, 2023

Description

Hello,

I'm trying to figure out, that that works.
So far, I've connected my spider to redis with 3 test-domains.
When I start the spider, I can see the first hit to the websites.

What I don't understand now is:
How are the URLs that the LinkExtractor finds fed back into Redis?

And I assume, my cralwer is being "stopped" at:
domain = kwargs.pop('domain', '')
kwargs is always an empty dict.
Where does it come from?

It seems like, I initialize self.allowed_domains with an empty list of domains - so the crawler can't start.
How to do it right?

`
class MyCrawlerSpider(RedisCrawlSpider):
"""Spider that reads urls from redis queue (mycrawler:start_urls)."""
name = "redis_my_crawler"

redis_key = 'mycrawler:start_urls'

rules = (
    Rule(LinkExtractor(), follow=True, process_links="filter_links"),
    Rule(LinkExtractor(), callback='parse_page', follow=True, process_links="filter_links"),
)

def __init__(self, *args, **kwargs):
    # Dynamically define the allowed domains list.
    print('Init')
    print(args)
    print(kwargs)
    domain = kwargs.pop('domain', '')
    print(domain)
    self.allowed_domains = filter(None, domain.split(','))
    print(self.allowed_domains)
    super(MyCrawlerSpider, self).__init__(*args, **kwargs)

def filter_links(self, links):
    allowed_strings = ('news')
    allowed_links = []
    for link in links:
        if (any(s in link.url.lower() for s in allowed_strings)
            and any(domain in link.url for domain in self.allowed_domains)):
            print(link)
            allowed_links.append(link)

    return allowed_links


def parse_page(self, response):
    print(response.url)
    return None

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant