Crawler doesn't extract any links from Google Cloud documentation website #680

Guthman · 2024-08-20T23:35:30Z

from trafilatura.spider import focused_crawler
crawl_start_url = 'https://cloud.google.com/docs'
to_visit, known_links = focused_crawler(homepage=crawl_start_url, max_seen_urls=1000, max_known_urls=1000)

to_visit is empty and known_links only contains the input url

Ignoring robots.txt (using the rule below) doesn't seem to help...

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass

The text was updated successfully, but these errors were encountered:

adbar · 2024-08-22T11:04:08Z

That's correct, there is something wrong with relative link processing here.

adbar · 2024-08-30T13:46:30Z

Google is blacklisted by the underlying courlan package, this can simply be bypassed by passing the strict=False parameter to the extract_links() function in the spider module.

cjgalvin · 2024-11-01T04:58:34Z

Sorry to comment on a closed issue, but I wanted to check if this solution still works. I ran into a similar result as the original poster on different websites. That led me to this issue.

It looks like the PR set the default to strict=False for extract_links, so I would expect the Google Cloud docs from the original post to work. However, I get the same result as the original post: to_visit is empty and known_links only contains the input website. That's the same result I see with the other websites.

To be clear, my other websites may have different issues, and this question is focused on why I cannot crawl https://cloud.google.com/docs. The scraper works for other websites designed to be scraped. I am also able to download https://cloud.google.com/docs using bare_extraction.

I am on trafilatura v1.12.2. Here is my code (I tried with and without the original post's IgnoreRobotFileParser rules):

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass

url = "https://cloud.google.com/docs"
to_visit, known_links = focused_crawler(url, max_seen_urls=10, max_known_urls=10, rules=IgnoreRobotFileParser())

Thank you in advance.

Guthman · 2024-11-04T08:24:13Z

I've moved on from trafilatura as my use case requires some more capabilities than this library can offer (like Javascript support), so I don't know, sorry.

adbar · 2024-11-06T16:07:45Z

@cjgalvin There might be a problem with the urllib3 dependency on this page. Try installing the optional pycurl package (which Trafilatura supports seamlessly), it is often better and faster.

cjgalvin · 2024-11-07T03:06:55Z

@Guthman no worries, thank you for the response.

@adbar okay, will give it a test.

adbar added the bug Something isn't working label Aug 22, 2024

adbar linked a pull request Aug 30, 2024 that will close this issue

spider: relax strict parameter for link extraction #687

Merged

adbar closed this as completed in #687 Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler doesn't extract any links from Google Cloud documentation website #680

Crawler doesn't extract any links from Google Cloud documentation website #680

Guthman commented Aug 20, 2024

adbar commented Aug 22, 2024

adbar commented Aug 30, 2024

cjgalvin commented Nov 1, 2024 •

edited

Loading

Guthman commented Nov 4, 2024

adbar commented Nov 6, 2024

cjgalvin commented Nov 7, 2024

Crawler doesn't extract any links from Google Cloud documentation website #680

Crawler doesn't extract any links from Google Cloud documentation website #680

Comments

Guthman commented Aug 20, 2024

adbar commented Aug 22, 2024

adbar commented Aug 30, 2024

cjgalvin commented Nov 1, 2024 • edited Loading

Guthman commented Nov 4, 2024

adbar commented Nov 6, 2024

cjgalvin commented Nov 7, 2024

cjgalvin commented Nov 1, 2024 •

edited

Loading