-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not working disallow rule #209
Comments
Thank you, @kox-solid, for raising this issue. Let me look into it as soon as possible. |
@kox-solid, I analyzed your example. Your example correctly returns
So The following example would have returned False, True: import robots
content = """
User-agent: mozilla
Disallow: /
"""
check_url = "https://example.com"
user_agent = "Mozilla"
parser = robots.RobotsParser.from_string(content)
print(parser.can_fetch(user_agent, check_url)) # False (Disallow)
print(parser.is_agent_valid(user_agent)) # True (Mozilla is a valid user-agent) I understand that the terms user-agent in User-Agent HTTP request header and User-Agent line in robots.txt sound confusing. The latest RFC https://www.rfc-editor.org/rfc/rfc9309 provides some elements of clarity (for example, by using the term product token instead of user-agent), but in Other libraries or tools like https://github.com/google/robotstxt or https://github.com/jimsmart/grobotstxt follow the same approach as Thank you again for raising your concern. |
This is the question of interpretation. The RFC says, "
Why? The RFC says, " |
Thank you, @kox-solid, for pointing to the 369883 tests. I initially did not include these tests in robotspy, so this will allow me to dig deeper. I tested it with the Google C++ version which behaves as you stated. The Go version behaves like robotspy. Again, I will investigate further and appreciate your insight. You are correct regarding the last point related to the User-Agent header; I did not question that. I meant that you passed the User-Agent header to an internal function of robotspy intended to parse a crawler name, not a User-Agent header. Indeed, there is no restriction on characters in the User-Agent header, and the function Out of curiosity, was there a particular scenario you faced and wanted to use a robots.txt parser? My starting point with robotspy was to fix a bug in the robots parser of the Python standard library, but I'm curious about concrete use cases I could leverage in my tests. |
Hi @kox-solid, Thanks again for raising this issue. As you suggested, I updated the parser to behave like Google robots. The two tests you pointed out in 369883. text proto are now acting like Google robots. The new 0.9.0 version is available at https://pypi.org/project/robotspy/. Your code example will work if you set Thank you again for pointing out this anomaly. |
Hello @andreburgaud, thanks for your work and quick reply. I tested robotspy==0.9.0 against all Google "stress" tests and got the following result: I need a robots.txt parser for SEO audit crawler that will interpret robots.txt the same as Google or close to it. At the moment, there is no such parser for Python, all of them have certain flaws. |
Hi @kox-solid, thank you for pointing to the failing tests and sharing your motivation for finding a robots.txt parser. This is super helpful. I can't pretend |
@kox-solid, I just released |
I'm closing this according to my previous comments. |
robotspy==0.8.0
returns True, False instead of False, False
The text was updated successfully, but these errors were encountered: