Empty Results When Using Spider Function with Category URL #696

felipehertzer · 2024-09-09T01:04:45Z

I am currently testing the spider function, and I have encountered an issue when attempting to use a category URL to fetch posts specifically from that category.

Here is the code snippet that I am working with:

spider_results, _ = focused_crawler(
  homepage="https://www.australiandefence.com.au/news/news",
  max_seen_urls=1,
  max_known_urls=50,
  prune_xpath="//header | //footer",
)
print(spider_results)

The function returns empty results. After investigating, I believe the problem may lie in this line of code. I modified the line to:

if response.url not in homepage and response.url != "/":

This change resolved the issue, but It breaks the redirect function.

Thank you.

adbar · 2024-09-09T10:03:51Z

Hi @felipehertzer, I cannot reproduce the issue, I get results for your example with the latest version of the code (from the Github repository). Did you make other changes?

felipehertzer · 2024-09-09T10:29:55Z

Hey @adbar,

I have reinstalled it, but the issue persists.

When I run the following code, the variable new_base_url appears to be missing a value. Is this the same result you are getting?

htmlstring, homepage, new_base_url = probe_alternative_homepage(url)
print(homepage, new_base_url)  # result = /news/news ''
if htmlstring and homepage and new_base_url:

adbar · 2024-09-09T11:32:05Z

I still cannot reproduce it, probe_alternative_homepage() works as expected, it returns the HTML code, https://www.australiandefence.com.au/news/news and https://www.australiandefence.com.au.

Besides, the lines if response.url not in homepage and response.url != "/": you're suggesting is equivalent to the one in the code.

I guess the check probe_alternative_homepage() could be skipped if the input is not a homepage but the subsection of a website, but this is a different issue.

felipehertzer · 2024-09-16T05:03:26Z

Hello @adbar,

I apologise for the delayed response. I had some additional time to conduct further testing and identified the issue in the line below. I was able to do a fix on my side installing pycurl, because I was using urllib 2.2.3 instead. While PyCurl functions correctly, urllib does not.

Specifically, it seems that the geturl() method is not returning the complete URL; it only returns the path, such as /news/news. In contrast, PyCurl correctly returns the full URL: https://www.australiandefence.com.au/news/news.

Here is the line of code in question:

trafilatura/trafilatura/downloads.py

Line 205 in f57ef0b

resp = Response(response.data, response.status, response.geturl())

adbar · 2024-10-01T11:18:55Z

Thanks for the details, this is tricky, it may be a bug in urllib3. How do you think we can solve this?

adbar added the question Further information is requested label Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty Results When Using Spider Function with Category URL #696

Empty Results When Using Spider Function with Category URL #696

felipehertzer commented Sep 9, 2024 •

edited

Loading

adbar commented Sep 9, 2024

felipehertzer commented Sep 9, 2024

adbar commented Sep 9, 2024

felipehertzer commented Sep 16, 2024 •

edited

Loading

adbar commented Oct 1, 2024

Empty Results When Using Spider Function with Category URL #696

Empty Results When Using Spider Function with Category URL #696

Comments

felipehertzer commented Sep 9, 2024 • edited Loading

adbar commented Sep 9, 2024

felipehertzer commented Sep 9, 2024

adbar commented Sep 9, 2024

felipehertzer commented Sep 16, 2024 • edited Loading

adbar commented Oct 1, 2024

felipehertzer commented Sep 9, 2024 •

edited

Loading

felipehertzer commented Sep 16, 2024 •

edited

Loading