- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 33.3k
bpo-39187: robotparser does not respect longest match #17794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -224,6 +224,42 @@ class GoogleURLOrderingTest(BaseRobotTest, unittest.TestCase): | |
| bad = ['/folder1/anotherfile.html'] | ||
|  | ||
|  | ||
| class LongestMatchUserAgentTest(BaseRobotTest, unittest.TestCase): | ||
| # https://tools.ietf.org/html/draft-koster-rep-00#section-3.2 | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This document is now available as https://datatracker.ietf.org/doc/html/rfc9309. Please update all links. But this test passes also if all allow rules have higher priority than any disallow rule. So please add also reversed rules (short allow, long disallow) here. | ||
| # The most specific rule should be used | ||
| robots_txt = """\ | ||
| User-agent: FooBot | ||
| Disallow: /folder1/ | ||
| Allow: /folder1/myfile.html | ||
| """ | ||
| agent = 'foobot' | ||
| good = ['/folder1/myfile.html'] | ||
| bad = ['/folder1/anotherfile.html'] | ||
|  | ||
|  | ||
| class LongestMatchDefaultUserAgentTest(BaseRobotTest, unittest.TestCase): | ||
| # https://tools.ietf.org/html/draft-koster-rep-00#section-3.2 | ||
| # The most specific rule should be used | ||
| robots_txt = """\ | ||
| User-agent: * | ||
| Disallow: /folder1/ | ||
| Allow: /folder1/myfile.html | ||
| """ | ||
| good = ['/folder1/myfile.html'] | ||
| bad = ['/folder1/anotherfile.html'] | ||
|  | ||
|  | ||
| class EquivalentRulesTest(BaseRobotTest, unittest.TestCase): | ||
| # https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.2 | ||
| # The most specific rule should be used | ||
| robots_txt = """\ | ||
| User-agent: * | ||
| Disallow: /folder1/ | ||
| Allow: /folder1/ | ||
| """ | ||
| good = ['/folder1/myfile.html', '/folder1', '/folder1'] | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Double '/folder1'. | ||
|  | ||
|  | ||
| class DisallowQueryStringTest(BaseRobotTest, unittest.TestCase): | ||
| # see issue #6325 for details | ||
| robots_txt = """\ | ||
|  | @@ -367,7 +403,7 @@ def test_basic(self): | |
| def test_can_fetch(self): | ||
| self.assertTrue(self.parser.can_fetch('*', self.url('elsewhere'))) | ||
| self.assertFalse(self.parser.can_fetch('Nutch', self.base_url)) | ||
| self.assertFalse(self.parser.can_fetch('Nutch', self.url('brian'))) | ||
| self.assertTrue(self.parser.can_fetch('Nutch', self.url('brian'))) | ||
| self.assertFalse(self.parser.can_fetch('Nutch', self.url('webstats'))) | ||
| self.assertFalse(self.parser.can_fetch('*', self.url('webstats'))) | ||
| self.assertTrue(self.parser.can_fetch('*', self.base_url)) | ||
|  | ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| Add a sort function to respect the longest match rule as per the current internet draft: https://tools.ietf.org/html/draft-koster-rep-00#section-3.2 | ||
| The sort function also takes into account equivalent rules such that allow should be used: https://tools.ietf.org/html/draft-koster-rep-00#section-2.2.2 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test replaces GoogleURLOrderingTest. I think that GoogleURLOrderingTest should now be removed, because it lost its meaning.