-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler doesn't extract any links from Google Cloud documentation website #680
Comments
That's correct, there is something wrong with relative link processing here. |
Google is blacklisted by the underlying courlan package, this can simply be bypassed by passing the |
Sorry to comment on a closed issue, but I wanted to check if this solution still works. I ran into a similar result as the original poster on different websites. That led me to this issue. It looks like the PR set the default to strict=False for extract_links, so I would expect the Google Cloud docs from the original post to work. However, I get the same result as the original post: to_visit is empty and known_links only contains the input website. That's the same result I see with the other websites. To be clear, my other websites may have different issues, and this question is focused on why I cannot crawl https://cloud.google.com/docs. The scraper works for other websites designed to be scraped. I am also able to download https://cloud.google.com/docs using bare_extraction. I am on trafilatura v1.12.2. Here is my code (I tried with and without the original post's IgnoreRobotFileParser rules):
Thank you in advance. |
I've moved on from trafilatura as my use case requires some more capabilities than this library can offer (like Javascript support), so I don't know, sorry. |
@cjgalvin There might be a problem with the |
to_visit is empty and known_links only contains the input url
Ignoring robots.txt (using the rule below) doesn't seem to help...
The text was updated successfully, but these errors were encountered: