Not working disallow rule #209

kox-solid · 2024-08-08T13:17:09Z

robotspy==0.8.0

import robots

content = """
User-agent: mozilla/5
Disallow: /
"""

check_url = "https://example.com"
user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"

parser = robots.RobotsParser.from_string(content)

print(parser.can_fetch(user_agent, check_url))
print(parser.is_agent_valid(user_agent))

returns True, False instead of False, False

andreburgaud · 2024-08-20T16:14:03Z

Thank you, @kox-solid, for raising this issue. Let me look into it as soon as possible.

andreburgaud · 2024-08-25T03:57:25Z

@kox-solid, I analyzed your example. Your example correctly returns True and False, not False, False, as you expected, for the following reasons:

In RFC939 (Robots Exclusion Protocol), the term user-agent is also the product token and represents the crawler's name (for example, Googlebot). The User-agent line in the robots.txt is not valid because the product token only accepts uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-"), per the RFC https://www.rfc-editor.org/rfc/rfc9309#name-the-user-agent-line. mozilla/5 includes a slash /, incompatible with the spec.
As a result of the point above, the library will discard the rule Disallow: /.
Also, the user_agent value passed as the first parameter to parser.can_fetch() is also intended to be a crawler's name to match one of the possible groups in the robots.txt. In your example, user_agent is a User-Agent HTTP request header. Although they are related, the function only expects a crawler's name.

So parser.is_agent_valid(user_agent) returns False because the User-Agent HTTP request header includes characters not valid for a product token, and parser.can_fetch(user_agent, check_url)) returns True because 1) there is no valid group in the robots.txt (mozilla/5 is not a valid product token), 2) the given user_agent has not match in the robots.txt, therefore the disallow rule does not apply.

The following example would have returned False, True:

import robots

content = """
User-agent: mozilla
Disallow: /
"""

check_url = "https://example.com"
user_agent = "Mozilla"

parser = robots.RobotsParser.from_string(content)

print(parser.can_fetch(user_agent, check_url))  # False (Disallow)
print(parser.is_agent_valid(user_agent))        # True (Mozilla is a valid user-agent)

I understand that the terms user-agent in User-Agent HTTP request header and User-Agent line in robots.txt sound confusing. The latest RFC https://www.rfc-editor.org/rfc/rfc9309 provides some elements of clarity (for example, by using the term product token instead of user-agent), but in robots.txt, it will remain User-agent.

Other libraries or tools like https://github.com/google/robotstxt or https://github.com/jimsmart/grobotstxt follow the same approach as robotspy, and share similar tests to validate. Nevertheless, I could have missed something, been more precise, or provided better error log to guide the library users. Let me know how I could improve it or if my explanation offers some clarification.

Thank you again for raising your concern.

kox-solid · 2024-08-25T08:50:38Z

In RFC939 (Robots Exclusion Protocol), the term user-agent is also the product token and represents the crawler's name (for example, Googlebot). The User-agent line in the robots.txt is not valid because the product token only accepts uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-"), per the RFC https://www.rfc-editor.org/rfc/rfc9309#name-the-user-agent-line. mozilla/5 includes a slash /, incompatible with the spec.

This is the question of interpretation. The RFC says, "The product token MUST contain only uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-")". But it is not said that if a forbidden symbol is met, it does not need to create a token at all. Some parsers create tokens to the first incorrect symbol. And it is in this case that "mozilla/5" will be cut to "mozilla".
If you rely on Google, look at the tests: https://github.com/google/robotstxt-spec-test
here https://github.com/google/robotstxt-spec-test/blob/master/src/main/resources/CTC/stress/369883.textproto last 2 tests don't pass your parser for example.

So parser.is_agent_valid(user_agent) returns False because the User-Agent HTTP request header includes characters not valid for a product token ...

Why? The RFC says, "the product token SHOULD be a substring in the User-Agent header". Where is the restriction on characters in the User-Agent header?

andreburgaud · 2024-08-25T22:38:25Z

Thank you, @kox-solid, for pointing to the 369883 tests. I initially did not include these tests in robotspy, so this will allow me to dig deeper. I tested it with the Google C++ version which behaves as you stated. The Go version behaves like robotspy. Again, I will investigate further and appreciate your insight.

You are correct regarding the last point related to the User-Agent header; I did not question that. I meant that you passed the User-Agent header to an internal function of robotspy intended to parse a crawler name, not a User-Agent header. Indeed, there is no restriction on characters in the User-Agent header, and the function is_agent_valid was not intended to parse a User-Agent header. As you pointed out, it returned False, as you expected.

Out of curiosity, was there a particular scenario you faced and wanted to use a robots.txt parser? My starting point with robotspy was to fix a bug in the robots parser of the Python standard library, but I'm curious about concrete use cases I could leverage in my tests.

andreburgaud · 2024-08-25T23:43:32Z

Hi @kox-solid, Thanks again for raising this issue. As you suggested, I updated the parser to behave like Google robots. The two tests you pointed out in 369883. text proto are now acting like Google robots. The new 0.9.0 version is available at https://pypi.org/project/robotspy/.

Your code example will work if you set user_agent to Mozilla or other case-insensitive variants. It behaves like https://github.com/google/robotstxt or https://github.com/jimsmart/grobotstxt.

Thank you again for pointing out this anomaly.

kox-solid · 2024-08-26T13:14:42Z

Hello @andreburgaud, thanks for your work and quick reply. I tested robotspy==0.9.0 against all Google "stress" tests and got the following result:
Didn't pass the next tests:
https://github.com/google/robotstxt-spec-test/blob/master/src/main/resources/CTC/stress/541230.textproto (4, 5, 8, 12, 14)
https://github.com/google/robotstxt-spec-test/blob/master/src/main/resources/CTC/stress/308278.textproto (6. 8. 10)
Of course, these issues are not related to the topic, rather for your information.

I need a robots.txt parser for SEO audit crawler that will interpret robots.txt the same as Google or close to it. At the moment, there is no such parser for Python, all of them have certain flaws.

andreburgaud · 2024-08-26T14:06:44Z

Hi @kox-solid, thank you for pointing to the failing tests and sharing your motivation for finding a robots.txt parser. This is super helpful. I can't pretend robotspy will meet your expectations, but I welcome any suggestions. I will integrate the Google stress tests into the robotspy pytest suite to understand and possibly bridge the gaps. I value your contribution and interest in robotspy 🙌

andreburgaud · 2024-08-27T02:29:54Z

@kox-solid, I just released robotspy version 0.10.0 with fixes for the bugs you pointed out (Google stress tests 541230 and 308278, in particular). I will continue to integrate the other Google tests, and I'm sure I will discover more bugs. Thank you for your contribution! I will close this issue but don't hesitate to ping me with any other concerns or suggestions.

andreburgaud · 2024-08-27T02:30:53Z

I'm closing this according to my previous comments.

andreburgaud added a commit that referenced this issue Aug 25, 2024

added example as support to issue #209

f9721f9

andreburgaud self-assigned this Aug 25, 2024

andreburgaud closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not working disallow rule #209

Not working disallow rule #209

kox-solid commented Aug 8, 2024

andreburgaud commented Aug 20, 2024

andreburgaud commented Aug 25, 2024 •

edited

Loading

kox-solid commented Aug 25, 2024

andreburgaud commented Aug 25, 2024

andreburgaud commented Aug 25, 2024

kox-solid commented Aug 26, 2024

andreburgaud commented Aug 26, 2024

andreburgaud commented Aug 27, 2024

andreburgaud commented Aug 27, 2024

Not working disallow rule #209

Not working disallow rule #209

Comments

kox-solid commented Aug 8, 2024

andreburgaud commented Aug 20, 2024

andreburgaud commented Aug 25, 2024 • edited Loading

kox-solid commented Aug 25, 2024

andreburgaud commented Aug 25, 2024

andreburgaud commented Aug 25, 2024

kox-solid commented Aug 26, 2024

andreburgaud commented Aug 26, 2024

andreburgaud commented Aug 27, 2024

andreburgaud commented Aug 27, 2024

andreburgaud commented Aug 25, 2024 •

edited

Loading