robots.allowed returns false for other sites (and domains) #110

gk1544 · 2019-03-23T21:46:03Z

Hi,

Let's take a look at the following example from Google:

robots.txt location:
http://example.com/robots.txt

Valid for:
http://example.com/
http://example.com/folder/file
Not valid for:
http://other.example.com/
https://example.com/
http://example.com:8181/

For instance, when asked if any page on http://other.example.com/ is allowed, reppy returns False.

It should either return True or potentially throw an exception, but definitely not False.
Returning False is incorrect because robots.txt is not a whitelist.

Here is an example:

import reppy
robots_content = 'Disallow: /abc'
robots = reppy.Robots.parse('http://example.com/robots.txt', robots_content)

print(robots.allowed('http://example.com/', '*'))
# True (**correct**)
print(robots.allowed('http://other.example.com/', '*'))
# False (**incorrect**)
print(robots.allowed('http://apple.com/', '*'))
# False (**incorrect**)

The text was updated successfully, but these errors were encountered:

dlecocq · 2019-03-25T16:53:41Z

I can certainly understand the argument for not wanting it to return False - it is somewhat misleading. Ultimately this traces down into rep-cpp's agent.cpp at https://github.com/seomoz/rep-cpp/blob/master/src/agent.cpp#L69 .

I have mixed feelings about what the behavior should be. On the one hand, False doesn't really capture the truth of it, but it is the safer alternative - better to incorrectly report False than to risk incorrectly reporting True; instead we need a way to convey "it's not clear whether it's allowed or not based on this robots.txt." On the other hand, an exception doesn't quite feel appropriate to throw an exception because it doesn't feel particularly exceptional. Perhaps a different return type that conveys some more of the nuance would work, but that also seems a little clunky.

Whenever we've used this, we generally are using it through the cache, which takes care of finding the appropriate Robots or Agent based on the domain.

pensnarik · 2020-10-03T12:02:36Z

What's workaround for this? Many website contains robots.txt rules only for 2nd level domain this means that links containing "www.domain.com" will be also forbidden by rules while they're not. For example:

DEBUG - URL https://insurancejournal.com/news/west/ is allowed in robots.txt
DEBUG - URL https://www.insurancejournal.com/news/international/2020/10/02/584993.htm is FORBIDDEN by robots.txt, skipping

I'm thinking to remove www. from the URL before checking it but this looks ugly.

dosisod mentioned this issue Sep 11, 2019

RobotsCache: Exceptions cannot be handled (not being re-thrown) #116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robots.allowed returns false for other sites (and domains) #110

robots.allowed returns false for other sites (and domains) #110

gk1544 commented Mar 23, 2019

dlecocq commented Mar 25, 2019

pensnarik commented Oct 3, 2020

robots.allowed returns false for other sites (and domains) #110

robots.allowed returns false for other sites (and domains) #110

Comments

gk1544 commented Mar 23, 2019

dlecocq commented Mar 25, 2019

pensnarik commented Oct 3, 2020