disallow bug #83

Sobes76rus · 2018-02-11T19:36:54Z

Hello!

user-agent: *
disallow: */test

http://mysite.com/test is allowed, but must be disallowed. Am i right?

The text was updated successfully, but these errors were encountered:

b4hand · 2018-02-12T03:15:30Z

I can confirm this fails. However, this is probably an issue in rep-cpp rather than reppy since it handles the actual robots.txt parsing. Also, I'm not really sure how well rep-cpp actually supports wildcards. Technically, the original robots.txt specification did not allow wildcards, but we do support several extensions.

b4hand · 2018-02-12T03:47:51Z

I've figured out what the issue is. It appears that internally */test is being normalized to /*/test because normally rules are absolute paths and can't be relatively, but obviously /*/test doesn't match http://mysite.com/test because there's not two slashes. I'm not sure exactly what's the best way to handle this, but I'll file a bug against rep-cpp.

b4hand · 2018-02-12T04:00:37Z

See seomoz/rep-cpp#34.

Sobes76rus · 2018-02-13T19:56:10Z

thx =)

jspalink · 2018-02-19T21:48:19Z

Related to this -- i'm noticing an issue parsing even when the wildcard is not leading in a disallow rule. Here's an example:

https://www.theverge.com/robots.txt

The Disallow: /users/*/replies rule essentially causes all allowed calls to return false.

>>> robots = Robots.fetch('https://www.theverge.com/robots.txt')
>>> robots.allowed('https://www.theverge.com/anything', 'my-user-agent')
False
>>> list(robots.sitemaps)
[]

Coincidentally, it also fails to parse the sitemaps. A bit of exploring reveals this:

>>> robots.__str__()
b'{"*": [Directive(Disallow: /)]}'

Editing their robots.txt to remove the offending line solves this problem.

dlecocq · 2018-02-20T15:56:31Z

That particular site returns a 403 for Robots.fetch with the default user agent provided by requests:

>>> robots = Robots.fetch('https://www.theverge.com/robots.txt')
>>> robots
<reppy.robots.AllowNone object at 0x108c55450>
>>> requests.get('https://www.theverge.com/robots.txt')
<Response [403]>

Providing a different user agent resolves the issue:

>>> headers = {'User-Agent': 'Chrome'}
>>> robots = Robots.fetch('https://www.theverge.com/robots.txt', headers=headers)
>>> robots
<reppy.robots.Robots object at 0x109461250>
>>> robots.allowed('https://www.theverge.com/anything', 'my-user-agent')
True
>>> list(robots.sitemaps)
['https://www.theverge.com/sitemaps', 'https://www.theverge.com/sitemaps/videos', 'https://www.theverge.com/sitemaps/google_news']

@jspalink

jspalink · 2018-02-20T21:13:22Z

@dlecocq - that works for me. Thank you for the tip!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disallow bug #83

disallow bug #83

Sobes76rus commented Feb 11, 2018

b4hand commented Feb 12, 2018

b4hand commented Feb 12, 2018

b4hand commented Feb 12, 2018

Sobes76rus commented Feb 13, 2018

jspalink commented Feb 19, 2018 •

edited

Loading

dlecocq commented Feb 20, 2018 •

edited

Loading

jspalink commented Feb 20, 2018

disallow bug #83

disallow bug #83

Comments

Sobes76rus commented Feb 11, 2018

b4hand commented Feb 12, 2018

b4hand commented Feb 12, 2018

b4hand commented Feb 12, 2018

Sobes76rus commented Feb 13, 2018

jspalink commented Feb 19, 2018 • edited Loading

dlecocq commented Feb 20, 2018 • edited Loading

jspalink commented Feb 20, 2018

jspalink commented Feb 19, 2018 •

edited

Loading

dlecocq commented Feb 20, 2018 •

edited

Loading