Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler doesn't extract any links from Google Cloud documentation website #680

Closed
Guthman opened this issue Aug 20, 2024 · 6 comments · Fixed by #687
Closed

Crawler doesn't extract any links from Google Cloud documentation website #680

Guthman opened this issue Aug 20, 2024 · 6 comments · Fixed by #687
Labels
bug Something isn't working

Comments

@Guthman
Copy link

Guthman commented Aug 20, 2024

from trafilatura.spider import focused_crawler
crawl_start_url = 'https://cloud.google.com/docs'
to_visit, known_links = focused_crawler(homepage=crawl_start_url, max_seen_urls=1000, max_known_urls=1000)

to_visit is empty and known_links only contains the input url

Ignoring robots.txt (using the rule below) doesn't seem to help...

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass
@adbar adbar added the bug Something isn't working label Aug 22, 2024
@adbar
Copy link
Owner

adbar commented Aug 22, 2024

That's correct, there is something wrong with relative link processing here.

@adbar
Copy link
Owner

adbar commented Aug 30, 2024

Google is blacklisted by the underlying courlan package, this can simply be bypassed by passing the strict=False parameter to the extract_links() function in the spider module.

@adbar adbar linked a pull request Aug 30, 2024 that will close this issue
@cjgalvin
Copy link

cjgalvin commented Nov 1, 2024

Sorry to comment on a closed issue, but I wanted to check if this solution still works. I ran into a similar result as the original poster on different websites. That led me to this issue.

It looks like the PR set the default to strict=False for extract_links, so I would expect the Google Cloud docs from the original post to work. However, I get the same result as the original post: to_visit is empty and known_links only contains the input website. That's the same result I see with the other websites.

To be clear, my other websites may have different issues, and this question is focused on why I cannot crawl https://cloud.google.com/docs. The scraper works for other websites designed to be scraped. I am also able to download https://cloud.google.com/docs using bare_extraction.

I am on trafilatura v1.12.2. Here is my code (I tried with and without the original post's IgnoreRobotFileParser rules):

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass

url = "https://cloud.google.com/docs"
to_visit, known_links = focused_crawler(url, max_seen_urls=10, max_known_urls=10, rules=IgnoreRobotFileParser())

Thank you in advance.

@Guthman
Copy link
Author

Guthman commented Nov 4, 2024

I've moved on from trafilatura as my use case requires some more capabilities than this library can offer (like Javascript support), so I don't know, sorry.

@adbar
Copy link
Owner

adbar commented Nov 6, 2024

@cjgalvin There might be a problem with the urllib3 dependency on this page. Try installing the optional pycurl package (which Trafilatura supports seamlessly), it is often better and faster.

@cjgalvin
Copy link

cjgalvin commented Nov 7, 2024

@Guthman no worries, thank you for the response.

@adbar okay, will give it a test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants