Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty Results When Using Spider Function with Category URL #696

Open
felipehertzer opened this issue Sep 9, 2024 · 5 comments
Open

Empty Results When Using Spider Function with Category URL #696

felipehertzer opened this issue Sep 9, 2024 · 5 comments
Labels
question Further information is requested

Comments

@felipehertzer
Copy link
Contributor

felipehertzer commented Sep 9, 2024

Hey @adbar,

I am currently testing the spider function, and I have encountered an issue when attempting to use a category URL to fetch posts specifically from that category.

Here is the code snippet that I am working with:

spider_results, _ = focused_crawler(
  homepage="https://www.australiandefence.com.au/news/news",
  max_seen_urls=1,
  max_known_urls=50,
  prune_xpath="//header | //footer",
)
print(spider_results)

The function returns empty results. After investigating, I believe the problem may lie in this line of code. I modified the line to:

if response.url not in homepage and response.url != "/":

This change resolved the issue, but It breaks the redirect function.

Thank you.

@adbar adbar added the question Further information is requested label Sep 9, 2024
@adbar
Copy link
Owner

adbar commented Sep 9, 2024

Hi @felipehertzer, I cannot reproduce the issue, I get results for your example with the latest version of the code (from the Github repository). Did you make other changes?

@felipehertzer
Copy link
Contributor Author

Hey @adbar,

I have reinstalled it, but the issue persists.

When I run the following code, the variable new_base_url appears to be missing a value. Is this the same result you are getting?

htmlstring, homepage, new_base_url = probe_alternative_homepage(url)
print(homepage, new_base_url)  # result = /news/news ''
if htmlstring and homepage and new_base_url:

@adbar
Copy link
Owner

adbar commented Sep 9, 2024

I still cannot reproduce it, probe_alternative_homepage() works as expected, it returns the HTML code, https://www.australiandefence.com.au/news/news and https://www.australiandefence.com.au.

Besides, the lines if response.url not in homepage and response.url != "/": you're suggesting is equivalent to the one in the code.

I guess the check probe_alternative_homepage() could be skipped if the input is not a homepage but the subsection of a website, but this is a different issue.

@felipehertzer
Copy link
Contributor Author

felipehertzer commented Sep 16, 2024

Hello @adbar,

I apologise for the delayed response. I had some additional time to conduct further testing and identified the issue in the line below. I was able to do a fix on my side installing pycurl, because I was using urllib 2.2.3 instead. While PyCurl functions correctly, urllib does not.

Specifically, it seems that the geturl() method is not returning the complete URL; it only returns the path, such as /news/news. In contrast, PyCurl correctly returns the full URL: https://www.australiandefence.com.au/news/news.

Here is the line of code in question:

resp = Response(response.data, response.status, response.geturl())

@adbar
Copy link
Owner

adbar commented Oct 1, 2024

Thanks for the details, this is tricky, it may be a bug in urllib3. How do you think we can solve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants